Home > Hacking Invenio > BibClassify Internals > The code behind BibClassify: the extraction algorithm |
This section provides a detailed account of the phrase matching techniques used by BibClassify to automatically extract significant terms from fulltext documents. While reading this guide, you are advised to refer to the original BibClassify code.
BibClassify extracts 2 types of keywords from a fulltext documents: single/main keywords and composite keywords. Single keywords are keywords composed of one or more words ("scalar" or "field theory"). Composite keywords are composed of several single keywords ("field theory: scalar") and are considered as such if the single keywords are found combined in the fulltext. All keywords are stored in a RDF/SKOS taxonomy or in a simple keyword file. When using the keyword file, it is only possible to extract single keywords.
The bulk of the extraction mechanism takes place inside the functions
get_single_keywords
and get_composite_keywords
in
bibclassify_keyword_analyzer.py
.
This paragraph explains the code of
bibclassify_ontology_handler.py
.
BibClassify handles the taxonomy differently whether it is running in
standalone file mode (from the sources) or as an Invenio module. In both
cases, the taxonomy is specified through the -k, --taxonomy
option. In standalone file mode, the argument has to be a path when in normal
mode. In module mode, the argument refers to the ontology short name found in
the clsMETHOD table (e.g. "HEP" for the taxonomy "HEP.rdf"). However the
ontology long name ("HEP.rdf") or even its reference URL do also work. The
reference URL is stored in the table clsMETHOD in the column "location".
In standalone file mode, we just compare the date of modification of the taxonomy file and the date of creation of the cache file. If the cache is older than the ontology, we regenerate it.
In module mode, we first check the modification date of the reference ontology by performing a HTTP HEAD request. We compare this date with the date of the locally stored ontology. If needed we download the newer ontology. This ensures that BibClassify always uses the latest ontology available. The cache management is similar to the standalone mode.
In order to generate the cache file, the taxonomy is stored and parsed into memory using RDFLib.
The cache consists of dictionaries of SingleKeyword and CompositeKeyword objects. These objects contain a meaningful description of the keywords and regular expressions in a compiled form that allow to find the keywords in the fulltext. These regular expressions are described in paragraph 4.
This paragraph discusses the way BibClassify manages the fulltext of
records. Source code discussed can be found in
bibclassify_text_extractor
and
bibclassify_text_normalizer
.
The code of bibclassify_text_extractor.py
will soon be updated
and therefore the documentation for this module is pending.
The extraction of PDF documents in the field of HEP can lead to some
inconsistencies in the document because of mathematical formulas and Greek
letters. bibclassify_text_normalizer.py
takes care of these
problems by running a set of correcting regular expressions on the extracted
fulltext. These regular expressions can be found in the configuration file of
BibClassify.
For each single and composite keyword, the taxonomy contains different labels:
For each of these labels, we compile and cache regular expressions. The way the regular expressions are built is described in the configuration file of BibClassify.
When searching for single keywords in a fulltext, we run the corresponding set of regular expressions on the text and store the number of matches and the position of the keywords in the text.
For each composite keyword, we first run the regular expressions corresponding to alternative and hidden labels. This is similar to the search for single keywords.
Then, for each composite keyword, we check if all of its components were found in the fulltext. If this is the case, then we check the positions of the single keywords in the text. If the single keywords are placed nearby, then we found a composite keyword. If not, then we check if the words placed between the single keywords are valid separators (configured in the configuration file of BibClassify).
The result of this operation is a list of composite keywords with the total number of occurrences. Occurrences for all concerned single keywords are also attached to this list.
Before presenting the results to the user, some extra filtering occurs, primarily to refine the output keywords. The main post-processing actions performed on the results are:
The final results that are produced to the user consist of the 20 first (configurable) best single keywords and best composite keywords. The results may be presented in different formats (text output or MARCXML). Sample text output can be found in the BibClassify Admin Guide.
BibClassify extracts also author keywords when the option '--detect-author-keywords' is set. BibClassify searches for the string of keywords in the fulltext. Then it separates them and outputs them.