Home > Hacking Invenio > BibClassify Internals > The HEP taxonomy: rationale and extensions |
The DESY Library has been responsible for maintaining a thesaurus of high
energy physics (HEP) terms for a long time. The thesaurus is currently used as
a subject controlled vocabulary by HEP institutes worldwide. The need to
convert the HEP
text thesaurus to a
more complex taxonomy (in
/opt/invenio/etc/bibclassify/HEP.rdf
), with richer structure
and semantics, was mainly driven by the needs of the BibClassify system. The
current taxonomy of high energy physics is expressed in the
SKOS syntax. SKOS is a dialect
of RDF and it is especially intended for the representation of knowledge
organization systems, such as thesauri, taxonomies and basic ontologies.
NB. The reasons behind the adoption of SKOS, instead of other similar knowledge organization formats - notably OWL - are to do with the simplicity, yet completeness, of SKOS. The SKOS language contains all the basic properties that were needed to express the taxonomy and it allows straightforward conversion from text to RDF. If you are interested in a more detailed discussion on this matter, please check the CERN-DESY email correspondence.
In order to satisfy the needs of typical HEP classification schemes and practices, the SKOS language had to be extended to include additional properties. HEP keywords are often expressed as a combination - a pair - of keywords. An example of this is:
Born-Infeld model: monopole
Here both Born-Infeld model
and monopole
are
standard HEP keywords (single keywords). However, they also combine
together to express a collective concept. We call it a
composite keyword and have extended the SKOS language in order to
include such new paradigm. The two property extensions created for this
purpose are:
composite
: to express a relationship of combination
with another single keyword by pointing to the target composite keyword. It is
a subProperty of the SKOS property narrower
compositeOf
: to express the relationship of dependence of a
composite keyword from two single concepts. It is a subProperty of the SKOS
property broader
These two extensions are probably best described by an example. The single
keyword Born-Infeld model
is expressed in the HEP taxonomy as:
The concept contains all the usual SKOS tags to express the relations and denominations of a concept (<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"> <prefLabel xml:lang="en">Born-Infeld model</prefLabel> <hiddenLabel xml:lang="en">Born-Infeld</hiddenLabel> <altLabel xml:lang="en">DBI</altLabel> <broader rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheoreticalmodel"/> <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelrelativistic"/> <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonlinear"/> <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonabelian"/> <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"/> <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelchiral"/> </Concept>
prefLabel
, broader
,
etc.). In addition, it contains five composite
tags: these link
to five different combinations of this keyword with fellow single keywords.
For example one of these points to the composite keyword
Born-Infeld model: monopole
, whose entry is:
<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"> <prefLabel xml:lang="en">Born-Infeld model: monopole</prefLabel> <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"/> <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#monopole"/> </Concept>
The structure of single and composite keywords, as well as their
associations expressed by properties composite
and
compositeOf
are self-evident. By using such a model, we are able
to efficiently extract keyword pairs from fulltext, as explained in the
BibClassify extraction guide.
Finally, it is worth pointing out a couple of other syntax practices that might be specific only to the HEP taxonomy:
hiddenLabel
: this lexical label (currently in unstable state)
is used primarily to define alternative displays of a keyword that are meant to
be hidden from the user, but are still accessible to text parsing operations.
In the HEP taxonomy, these include misspellings and wildcards (strictly
expressed as legal regular expressions).note
: this label is currently reserved to describe a
nostandalone
condition - a property of those single keywords that
can only appear as part of composite keywords (their occurrence as standalones
is discarded).