Home > Hacking Invenio > BibAuthority Internals |
There are two cases that need special attention when idexing bibliographic data that contains references to authority records. The first case is relatively simple and requires the enriching of bibliographic data with data from authority records whenever a bibliographic record is being indexed. The second is a bit more complex, for it requires detecting which bibliographic records should be re-indexed, based on referenced authority records having been updated within a given date range.
First of all, we need to say something about how INVENIO let's the admin index the data. INVENIO's indexer (BibIndex) is always run as a task that is executed by INVENIO's scheduler (BibSched). Typically, this is done either by scheduling a bibindex task from the command line (manually), or it is part of a periodic task (BibTask) run directly from BibSched, typically ever 5 minutes. In case it is run manually, the user has the option of specifying certain record IDs to be re-indexed, e.g. by specifying ranges of IDs or collections to be re-indexed. In this case, the selected records are re-indexed whether or not there were any modifications to the data. Alternatively, the user can specify a date range, in which case the indexer will search all the record IDs that have been modified in the selected date range (by default, the date range would specify all IDs modified since the last time the indexer was run) and update the index only for those records. As a third option, the user can specify specific types of indexes. INVENIO lets you search by different criteria (e.g. 'any field', 'title', 'author', 'abstract', 'keyword', 'journal', 'year', 'fulltext', …), and each of these criteria corresponds to a separate index, indexing only the data from the relevant MARC subfields. Normally, the indexer would update all index types for any given record ID, but with this third option, the user can limit the re-indexing to only specific types of indexes if desired.
Note: In reality, INVENIO creates not only 1 but 6 different indexes per index type. 3 are forward indexes (mapping words, pairs or phrases to record IDs), 3 are reverse indexes (mapping record IDs to words, pairs or phrases). The word, pair and phrase indexes are used for optimizing the searching speed depending on whether the user searches for words, sub-phrases or entire phrases. These details are however not relevant for BibAuthority. It simply finds the values to be indexed and passes them on to the indexer which indexes them as if it was data coming directly from the bibliographic record.
Once the indexer knows which record ID (and optionally, which index type) to re-index, including authority data is simply a question of checking whether the MARC subfields currently being indexed are under authority control (as specified in the BibAuthority configuration file). If they are, the indexer must follow the following (pseudo-)algorithm which will fetch the necessary data from the referenced authority records:
For each subfield and each record ID currently being re-indexed:
If the subfield is under authority control (→ config file):
Get the type of referenced authority record expected for this field
For each authority record control number found in the corresponding 'XXX__0' subfields and matching the expected authority record type (control number prefix):
Find the authority record ID (MARC field '001' control number) corresponding to the authority record control number (as contained in MARC field '035' of the authority record)
For each authority record subfield marked as index relevant for the given $type (→ config file)
Add the values of these subfields to the list of values to be returned and used for enriching the indexed strings.
The strings collected with this algorithm are simply added to the strings already found by the indexer in the regular bibliographic record MARC data. Once all the strings are collected, the indexer goes on with the usual operation, parsing them 3 different times, once for phrases, once for word-pairs, once for words, which are used to populate the 6 forward and reverse index tables in the database.
When a bibindex task is created by date range, we are presented with a more tricky situation which requires a more complex treatment for it to work properly. As long as the bibindex task is configured to index by record ID, the simple algorithm described above is enough to properly index the authority data along with the data from bibliographic records. This is true also if we use the third option described above, specifying the particular index type to re-index with the bibindex task. However, if we launch a bibindex task based on a date range (by default the date range covers the time since the last time bibindex task was run on for each of the index types), bibindex would have no way to know that it must update the index for a specific bibliographic record if one of the authority records it references was modified in the specified date range. This would lead to incomplete indexes.
A first idea was to modify the time-stamp for any bibliographic records as soon as an authority record is modified. Every MARC record in INVENIO has a 'modification_date' time-stamp which indicates to the indexer when this record was last modified. If we search for dependent bibliographic records every time we modify an authority record, and if we then update the 'modification_date' time-stamp for each of these dependent bibliographic records, then we can be sure that the indexer would find and re-index these bibliographic records as well when indexing by a specified date-range. The problem with this is a performance problem. If we update the time-stamp for the bibliographic record, this record will be re-indexed for all of the mentioned index-types ('author', 'abstract', 'fulltext', etc.), even though many of them may not cover MARC subfields that are under authority control, and hence re-indexing them because of a change in an authority record would be quite useless. In an INVENIO installation there would typically be 15-30 index-types. Imagine if you make a change to a 'journal' authority record and only 1 out of the 20+ index-types is for 'journal'. INVENIO would be re-indexing 20+ index types in stead of only the 1 index type which is relevant to the the type of the changed authority record.
There are two approaches that could solve this problem equally well. The first approach would require checking – for each authority record ID which is to be re-indexed – whether there are any dependent bibliographic records that need to be re-indexed as well. If done in the right manner, this approach would only re-index the necessary index types that can contain information from referenced authority records, and the user could specify the index type to be re-indexed and the right bibliographic records would still be found. The second approach works the other way around. In stead of waiting until we find a recently modified authority record, and then looking for dependent bibliographic records, we directly launch a search for bibliographic records containing links to recently updated authority records and add the record IDs found in this way to the list of record IDs that need to be re-indexed.
Of the two approaches, the second one was choses based solely upon considerations of integration into existing INVENIO code. As indexing in INVENIO currently works, it is more natural and easily readable to apply the second method than the first.
According to the second method, the pseudo-algorithm for finding the bibliographic record IDs that need to be updated based upon recently modified authority records in a given date range looks like this:
For each index-type to re-index:
For each subfield concerned by the index-type:
If the subfield is under authority control (→ config file):
Get the type of authority record associated with this field
Get all of the record IDs for authority records updated in the specified date range.
For each record ID
Get the authority record control numbers of this record ID
For each authority record control number
Search for and add the record IDs of bibliographic records containing this control number (with type in the prefix) in the 'XXX__0' field of the current subfield to the list of record IDs to be returned to the caller to be marked as needing re-indexing.
The record IDs returned in this way are added to the record IDs that need to be re-indexed (by date range) and then the rest of the indexing can run as usual.
The pseudo-algorithms described above were used as described in this document, but were not each implemented in a single function. In order for parts of them to be reusable and also for the various parts to be properly integrated into existing python modules with similar functionality (e.g auxiliary search functions were added to INVENIO's search_engine.py code), the pseudo-algorithms were split up into multiple nested function calls and integrated where it seemed to best fit the existing code base of INVENIO. In the case of the pseudo-algorithm described in “Updating the index by date range”, the very choice of the algorithm had already depended on how to best integrate it into the existing code for date-range related indexing.
In order to reference authority records, we use alphanumeric strings stored in the $0 subfields of fields that contain other, authority-controlled subfields as well. The format of these alphanumeric strings for INVENIO is in part determined by the MARC standard itself, which states that:
Subfield $0 contains the system control number of the related authority record, or a standard identifier such as an International Standard Name Identifier (ISNI). The control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses. See MARC Code List for Organizations for a listing of organization codes and Standard Identifier Source Codes for code systems for standard identifiers. Subfield $0 is repeatable for different control numbers or identifiers.
An example of such a string could be “(SzGeCERN)abc1234”, where “SzGeCERN” would be the MARC organization code, and abc1234 would be the unique identifier for this authority record within the given organization.
Since it is possible for a single field (e.g. field '100') to have multiple $0 subfields for the same field entry, we need a way to specify which $0 subfield reference is associated with which other subfield of the same field entry.
For example, imagine that in bibliographic records both '700__a' ('other author' name) as well as '700__u' ('other author' affiliation) are under authority control. In this case we would have two '700__0' subfields. Of of them would reference the author authority record (for the name), the other one would reference an institute authority record (for the affiliation). INVENIO needs some way to know which $0 subfield is associated with the $a subfield and which one with the $u subfield.
We have chosen to solve this in the following way. Every $0 subfield value will not only contain the authority record control number, but in addition will be prefixed by the type of authority record (e.g. 'AUTHOR', 'INSTITUTE', 'JOURNAL' or 'SUBJECT), separated from the control number by a separator, e.g. ':' (configurable). A possible $0 subfield value could therefore be: “author:(SzGeCERN)abc1234”. This will allow INVENIO to know that the $0 subfield containing “author:(SzGeCERN)abc1234” is associated with the $a subfield (author's name), containing e.g. “Ellis, John”, whereas the $0 subfield containing “institute:(SzGeCERN)xyz4321” is associated with the $u subfield (author's affiliation/institute) of the same field entry, containing e.g. “CERN”.