The HEP taxonomy: rationale and extensions

The DESY Library has been responsible for maintaining a thesaurus of high energy physics (HEP) terms for a long time. The thesaurus is currently used as a subject controlled vocabulary by HEP institutes worldwide. The need to convert the HEP text thesaurus to a more complex taxonomy (in /opt/invenio/etc/bibclassify/HEP.rdf), with richer structure and semantics, was mainly driven by the needs of the BibClassify system. The current taxonomy of high energy physics is expressed in the SKOS syntax. SKOS is a dialect of RDF and it is especially intended for the representation of knowledge organization systems, such as thesauri, taxonomies and basic ontologies.

NB. The reasons behind the adoption of SKOS, instead of other similar knowledge organization formats - notably OWL - are to do with the simplicity, yet completeness, of SKOS. The SKOS language contains all the basic properties that were needed to express the taxonomy and it allows straightforward conversion from text to RDF. If you are interested in a more detailed discussion on this matter, please check the CERN-DESY email correspondence.

In order to satisfy the needs of typical HEP classification schemes and practices, the SKOS language had to be extended to include additional properties. HEP keywords are often expressed as a combination - a pair - of keywords. An example of this is:

Born-Infeld model: monopole

Here both Born-Infeld model and monopole are standard HEP keywords (single keywords). However, they also combine together to express a collective concept. We call it a composite keyword and have extended the SKOS language in order to include such new paradigm. The two property extensions created for this purpose are:

composite: to express a relationship of combination with another single keyword by pointing to the target composite keyword. It is a subProperty of the SKOS property narrower
compositeOf: to express the relationship of dependence of a composite keyword from two single concepts. It is a subProperty of the SKOS property broader

These two extensions are probably best described by an example. The single keyword Born-Infeld model is expressed in the HEP taxonomy as:

<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel">
  <prefLabel xml:lang="en">Born-Infeld model</prefLabel>
  <hiddenLabel xml:lang="en">Born-Infeld</hiddenLabel>
  <altLabel xml:lang="en">DBI</altLabel>
  <broader rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheoreticalmodel"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelrelativistic"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonlinear"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonabelian"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelchiral"/>
</Concept>

The concept contains all the usual SKOS tags to express the relations and denominations of a concept (prefLabel, broader, etc.). In addition, it contains five composite tags: these link to five different combinations of this keyword with fellow single keywords. For example one of these points to the composite keyword Born-Infeld model: monopole, whose entry is:

<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole">
<prefLabel xml:lang="en">Born-Infeld model: monopole</prefLabel>
  <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"/>
  <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#monopole"/>
</Concept>

The structure of single and composite keywords, as well as their associations expressed by properties composite and compositeOf are self-evident. By using such a model, we are able to efficiently extract keyword pairs from fulltext, as explained in the BibClassify extraction guide.

Finally, it is worth pointing out a couple of other syntax practices that might be specific only to the HEP taxonomy:

hiddenLabel: this lexical label (currently in unstable state) is used primarily to define alternative displays of a keyword that are meant to be hidden from the user, but are still accessible to text parsing operations. In the HEP taxonomy, these include misspellings and wildcards (strictly expressed as legal regular expressions).
note: this label is currently reserved to describe a nostandalone condition - a property of those single keywords that can only appear as part of composite keywords (their occurrence as standalones is discarded).