Home > Admin Area > BibConvert Admin Guide |
BibConvert utility enables you to convert metadata records from various metadata formats into another metadata format supported by the Invenio local database. It is designed to process XML harvested metadata records, converting them into MARC21 before they are uploaded into the database. However, BibConvert is flexible enough to deal also with other structured metadata according to your needs, and offers a way to actually insert what you want into the database.
BibConvert is suitable for tasks such as conversion of records received from multiple data sources, or conversion of records from another system that may support a different metadata format.
In order to cover a wider range of possible conversions, BibConvert has 2 different modes, each dealing with different types of data, and each using different configuration files.
etc/bibconvert/config
directory
of your Invenio installation.-c
.
It is used to specify which transformation stylesheet to apply to the piped XML.
If the stylesheet you want to use is installed in the$ bibconvert -coaidc2marcxml.xsl < sample.xml > /tmp/record.xml
etc/bibconvert/config
directory of your Invenio installation, then you can just refer to it by its filename.
Otherwise use the full path to the file.
OAI DublinCore into MARC21 and OAI MARC into MARC21 configurations will be provided as default configuration, ensuring the standard uploading sequence (incl. OAIHarvest and BibUpload utilities). Other configurations can be created according to your needs. The configuration file that has to be created for each data source is a text file with following structure:
### the configuration starts here ### Configuration of bibconvert templates ### source data :=== data extraction configuration template === ### here comes the data extraction configuration template # entry example: AU---%A---MAX---;--- # extracts maximum available data by field from metadata record # the values are found between specified tags # in this case between the '%A' tag and other tags defined # repetitive values are recognized by a semicolon separator # resp. by multiple presence of '%A' tag === data source configuration template === ### here comes the data source configuration template # entry example: AU---<:FIRSTNAME:>-<:SURNAME:> # describes the contents of extracted source data fields # in this case, the field AU is described as having two distinct subfields === data target configuration template === ### here comes the data target configuration template # entry example: AU::CONF(AU,,0)---<datafield id="700" ind1="" ind2=""><subfield code="a"><:AU*::SURNAME::CAP():>, <AU*::FIRSTNAME::ABR():></subfield></datafield> # This section concerns rather the desired output, while previous two were focused on the data source structures. # Each line equals to one output line, composed of given literals and values from extracted source data fields. # In this example, the XML Marc21 output line is defined, # containing re-formatted values of source fields SURNAME and FIRSTNAME ### the configuration ends here
Having prepared a configuration, the BibConvert will convert the source data file according to it in a batch mode. The BibConvert is fully compatible with the Uploader1.x configuration language. For more information, have a look at the BibConvert Configuration Guide section below.
For a fully functional demo, consider the following sample input data:
sample.dat -- sample bibliographic data to be converted and inputted into Invenio
sample.cfg -- sample configuration file, featuring knowledge base demo
To convert the above data into XML MARC, use the following command:
and see the XML MARC output file. You would then continue the upload procedure by calling BibUpload.$ bibconvert -b'<collection>' -csample.cfg -e'</collection>' < sample.dat > /tmp/sample.xml
Other useful BibConvert configuration examples:
dcq.cfg -- Qualified Dublin Core in SGML to XML MARC example
dcq.dat -- corresponding data file, featuring collection identifiers demo
dcxml-to-marcxml.cfg -- OAI XML Dublin Core to XML MARC example
bibtex.cfg -- BibTeX to XML MARC example
- Create/edit "data extraction configuration template" section of the configuration file.
- Each line of this section stands for a definition of one source field:
name---keyword---terminating string---separator---
- Choose a (valid) name allowed by the system
- Enter keyword and terminating string, which are boundary tags for
the wanted value extraction
- In case the field is repetitive, enter the value separator
- "---"is mandatory separator between all values, even zero-length
- MAX/MIN keywords can be used instead of terminating string
Example of a definition of author(repetitive) and title (non-repetitive)
fields:
=== data extraction configuration template === ### here comes the data extraction configuration template AU---AU_---MAX---;--- TI---TI_---EOL------
- Create/edit "data source configuration template" section of the configuration file.
name---{CONST<:SUBFIELD:>[CONST]}}
- Enter only constants that appear systematically.
Example of a definition of author(repetitive) and title (non-repetitive)
fields:
- Create/edit "data target configuration template" section of the configuration file.
CODE---CONST<:name::SUBFIELD::FUNCT():>CONST<:GENERATED_VALUE:>
- CODE stands for a tag for readability (optional)
Example of a definition of author (repetitive) and title (non-repetitive) codes:
Example of a definition of a book editors field in which the newlines are preserved
so that they can be processed by the JOINMULTILINES formatting function:
Every function requires a certain number of parameters to be entered
in brackets. If an insufficient number of parameters is present,
the function uses default values. Default values are constructed with attempt to keep the original value.
The configuration of templates is case sensitive.
The following functions are available:
ADD(prefix,suffix) - add prefix/suffix
Adds prefix/postfix to the value, we can use this function to add the proper
field name as a prefix of the value itself:
ADD(WAU=,) prefix for the first author (which may
have been taken from the field AU2)
The input value is compared to a kb_file and may be replaced
by another value. In the case that the input value is not recognized, it is by default kept
without any modification. This default can be overridden by _DEFAULT_---default value entry in the kb_file
The file specified in the parameter is a text file representing a table
of values that correspond to each other:
{input_value---output_value}
KB(file,1) searches the exact value passed.
Edge spaces are not considered.
Output value is not further formated. The words in the input value are shortened according to the parameters
specified. By default, only the initial character is kept and the output
value is terminated by a dot.
Exclusively words that reach the specified length limit in the input value are abbreviated.
No suffix is appended to the words shorter than specified limit.
Remove string from the value (reverse function to the "ADD")
The input value is searched for the string specified in the first parameter.
All such strings are replaced with the string specified in the second parameter.
All groups of characters belonging to the type specified in the first
parameter are suppressed or replaced with a string specified in the second
parameter.
Recognized types:
SPACE .. invisible chars incl. NEWLINE
Limits the value to the required number of characters by
cutting excess characters either on the Left or Right.
Keeps the number of words specified in the first parameter and cuts the excessive
characters either on Left or Right.
All words shorter than the limit specified in the parameter are replaced
from the sentence.
All words greater in number of characters than the limit specified in
the parameter are replaced. Words with length exactly n are kept.
The entire value is deleted if shorter than the specified limit.
In order to increase the necessary length of the output line in the configuration
itself, apply the function on the total value:
AU::MINLW(25)---CER <:SYSNO:> AU L <:SURNAME:>,
<:NAME:>
The record is shortened by replacing words containing the specified
string.
for example, to get the email address from the value, use the following
The sentence is shortened by replacing words containing specified type
of character.
Types supported in EXPW function:
ALPHA .. alphabetic
Note: SPACE is not handled as a keyword, since all space characters
are considered as word separators.
Compares the value with the first parameter. In case the result is TRUE,
the input value is replaced with the second parameter, otherwise the input
value is replaced with the third parameter.
The entry is confirmed in case its input falls into the range from-to
specified in the parameter, border values included. As an upper limit it
is possibe to use the keyword MAX.
This is useful in case of AU code, where the first entry has a different
definition from other entries:
AU::RANGE(1,1)---CER <:SYSNO:> AU2 L <:AU::SURNAME:>,
<:AU::NAME:> ... takes the first name from the defined
AU field
Currently, the following date values are generated:
where n is the number of digits required.
Generates the current date in the form given as a parameter. The format
has to be given according to the ANSI C notation, i.e. the string is composed
out of following components:
%a abbreviated weekday name
WEEK(-4) returns 48, if current week is 52
w current weekday
The system number, if generated like this, contains a variable value
changing every second. For the system number is an identifier of the record,
it is needed to ensure it will be unique for the entire record processed.
Unlike the function DATE, which simply generates the value of format given,
SYSNO keeps the value persistent throughout the entire record and excludes collision
with other records that are generated in period of one week with one second granularity.
It is not possible to use the DATE function for generating a system number instead.
The system number is unique in range of one week only, according to
the current definition.
- Each line of this section stands for a definition of one source field
-
- Between two discrete subfields has to be defined a constant of a non zero
length
- "---"is a mandatory separator between the name and the source
field definition
=== data source configuration template ===
TI---<:TI:>
AU---<:FIRSTNAME:>-<:SURNAME:>
3.3 Step 3 Definition of target record
This definition describes the layout of the target record that is created by the conversion,
together with the corresponcence to the source fields defined in step 2.
- Each line of this section stands for an output line created by the conversion.
- <name> corresponds to the name defined in the steps 1 and 2
- "::"is a mandatory separator between the name and the subfield
definition
- optionally, you can apply the appropriate formatting function(s)
and generated values
- "::"is a mandatory separator between the subfield definition and the function(s)
- "---"is a mandatory separator between the tag and the output code definition
- mark repetitive source fields with an asterisk (*)
AU::CONF(AU,,0)---<datafield id="700" ind1="" ind2=""><subfield code="a"><:AU*::AU:></subfield></datafield>
TI::CONF(TI,,0)---<datafield id="245" ind1="" ind2=""><subfield code="a"><:TI::TI::SUP(SPACE, ):></subfield></datafield>
- preserve newlines in a source field for later use by formatting
functions by marking them with "^"
AU---<datafield id="773" ind1=" " ind2=" "><:BOOKEDITOR^::BOOKEDITOR::JOINMULTILINES(<subfield code="a">,</subfield>):></datafield>
With a value such as:
Test
Case, A
The results may be:
<datafield tag="773" ind1="" ind2=""><subfield code="a">Test</subfield><subfield code="a">Case, A</subfield></datafield>
3.4 Formatting in BibConvert
3.4.1 Definition of formatting functions
Every field can be processed with a variety of functions that
partially or entirely change the original value.
There are three types of functions available that take as element either
single characters, words or the entire value of processed field.
KB(kb_file,[0-9]) -lookup in kb_file and replace value
ABR(x,suffix)/ABRW(x,suffix) - abbreviation with suffix addition
ABRX() - abbreviate exclusively words longer
CUT(prefix,postfix) - remove substring from side
REP(x,y) - replacement of characters
SUP(type) - suppression of characters of specified type
LIM(n,L/R)/LIMW(str,L/R) - restriction to n letters
WORDS(n,side) - restriction to n words from L/R
MINL(n)/MAXL(n) - replacement of words shorter/greater
than n
MINLW(n) - replacement of short values
EXP(str,1|0)/EXPW(type) - replacement of words from
value if containing spec. type/string
IF(value,valueT,valueF) - replace T/F value
UP/DOWN/CAP/SHAPE/NUM - lower case and upper case, shape
SPLIT(n,h,str,from)/SPLITW(sep,h,str,from) - split
into more lines
CONF(field,value,1/0)/CONFL(value,1/0) - confirm validity
of a field
RANGE(from,to) - confirm only entries in the specified
range
DEFP() - default print
IFDEFP(field,value,1/0) - IF condition is met, default print
JOINMULTILINES(prefix,suffix) - Join a multiline string into a single line
with each segment having prefix and suffix
ADD(prefix,postfix)
default: ADD(,) no addition
KB(kb_file) - kb_file search
default: KB(kb_file,1/0/R)
KB(file,0) searches the KB code inside the value passed.
KB(file,2) as 0 but not case sensitive
KB(file,R) replacements are applied on substrings/characters only.
bibconvert look-up value in KB_file in one of following modes:
===========================================================
1 - case sensitive / match (default)
2 - not case sensitive / search
3 - case sensitive / search
4 - not case sensitive / match
5 - case sensitive / search (in KB)
6 - not case sensitive / search (in KB)
7 - case sensitive / search (reciprocal)
8 - not case sensitive / search (reciprocal)
9 - replace by _DEFAULT_ only
R - not case sensitive / search (reciprocal) replace
ABR(x,trm),ABRW(x,trm) - abbreviate term to x places
with(out) postfix
default: ABR(1,.)
default: ABRW(1,.)
ABRW takes entire value as one word.
example
input
output
ABR()
firstname_surname
f._s.
ABR(1,)
firstname_surname
f_s
ABR(10,COMMA)
firstname_surname
firstname,_surname,
ABRX() - abbreviate exclusively words longer than given limit
default: ABRX(1,.)
CUT(prefix,postfix) - remove substring from side
default: CUT(,)
REP(x,y) - replace x with y
default: REP(,) no replacement
SUP(type,string) - suppress chars of certain
type
default: SUP(,) type not recognized
ALPHA .. alphabetic
NALPHA .. not alphabetic
NUM .. numeric
NNUM .. not numeric
ALNUM .. alphanumeric
NALNUM .. non alphanumeric
LOWER .. lower case
UPPER .. upper case
PUNCT .. punctuation
NPUNCT .. not punctuation
example
input
output
SUP(SPACE,-)
sep_1999
sep-1999
SUP(NNUM)
sep_1999
1999
SUP(NUM)
sep_1999
sep_
LIM(n,side)/LIMW(str,side) - limit to n letters while trimming
L/R side
default: LIM(0,)
no change
default: LIMW(,R) no change
LIMW locates the first occurrence of (str) string and cut either Left or Right side.
example
input
output
LIM(4,L)
sep_1999
1999
LIM(4,R)
sep_1999
sep_
LIMW(_,R)
sep_1999
sep_
WORDS(n,side) - limit to n words while trimming L/R side
default: WORDS(0,R)
example
input
output
WORDS(1,R)
Sep 1999
Sep
WORDS(1,L)
Sep 1999
1999
MINL(n) - exp. words shorter than n
default: MINL(1)
The words with length exactly n are kept.
example
input
output
MINL(2)
History of Physics
History of Physics
MINL(3)
History of Physics
History Physics
MAXL(n) - exp. words longer than n
default: MAXL(0)
example
input
output
MAXL(2)
History of Physics
of
MAXL(3)
History of Physics
of
MINLW(n) - replacement of short values
default: MINLW(1) (no change)
This is used for the validation of created records, where we have 20
characters in the header.
The default validation is MINLW(21), i.e. the record entry will not
be consided as valid, unless it contains at least 21 characters including
the header. This default setting can be overriden by the -l command line option.
EXP(str,1|0) - exp./aprove word containing specified
string
default: EXP (,0) leave
all value
The second parameter states whether the string approves the word (0)
or disables it (1).
example
input
output
EXP(@,0)
mail to: libdesk@cern.ch
libdesk@cern.ch
EXP(:,1)
mail to: libdesk@cern.ch
mail libdesk@cern.ch
EXP(@)
mail to: libdesk@cern.ch
libdesk@cern.ch
EXPW(type) - exp. word from value if containing spec. type
default: EXPW type
not recognized
NALPHA .. not alphabetic
NUM .. numeric
NNUM .. not numeric
ALNUM .. alphanumeric
NALNUM .. non alphanumeric
LOWER .. lower case
UPPER .. upper case
PUNCT .. punctuation
NPUNCT .. non punctuation
example
input
output
EXPW(NNUM)
sep_1999
1999
EXPW(NUM)
sep_1999
sep
IF(value,valueT,valueF) - replace T/F value
default: IF(,,)
In case the input value has to be kept, whatever it is, the keyword
ORIG can be used (usually in the place of the third parameter)
example
input
output
IF(sep_1999,sep)
sep_1999
sep
IF(oct_1999,oct)
sep_1999
IF(oct_1999,oct,ORIG)
sep_1999
oct_1999
UP - upper case
Convert all characters to upper case
DOWN - lower case
Convert all characters to lower case
CAP - make capitals
Convert the initial character of each word to upper case
and the rest of characters to lower case
SHAPE - format string
Supresses all invalid spaces
NUM - number
If it contains at least one digit, convert it into a
number by suppressing other characters. Leading zeroes are deleted.
SPLIT(n,h,str,from)
Splits the input value into more lines, where each line contains
at most (n+h+length of str) characters, (n) being the number of characters
following the number of characters in the header, specified in (h). The
header repeats at the beginning of each line. An additional string can
be inserted as a separator between the header and the following value.
This string is specified by the third parameter (str). It is possible to
restrict the application of (str) so it does not appear on the first line
by entering "2" for (from)
SPLITW(sep,h,str,from)
Splits the input value into more lines by replacing the line
separator stated in (sep) with CR/LFs. Also, as in the case of the SPLIT
function, the first (h) characters are taken as a header and repeat at
the beginning of each line. An additional string can be inserted
as a separator between the header and the following value. This string
is specified by the third parameter (str). It is possible to restrict the
application of (str) so it does not appear on the first line by entering
"2" for (from)
CONF(field,value,1/0) - confirm validity of a
field
The input value is taken as it is, or refused depending on
the value of some other field. In case the other (field) contains
the string specified in (value), then the input value is confirmed (1)
or refused (0).
CONFL(str,1|0) - confirm validity of a field
The input value is confirmed if it contains (1)/misses(0)
the specified string (str)
RANGE(from,to) - confirm only entries in the specified
range
Left side function of target template configuration section to select the desired
entries from the repetitive field.
The range can only be continuous.
AU::RANGE(2,MAX)---CER <:SYSNO:> AU L <:AU::SURNAME:>
, <:AU::NAME:> ... takes the the rest of namesfrom
the AU field
DEFP() - default print
The value is printed by default even if it does not contain any variable input from the source file.
IFDEFP(field,value,1/0) - IF condition is met, default print
The line is printed by default (even if it does not contain any variable input from the source file) IF
a condition is met that depends on the value of some other field. The condition is basically either that "field"
contains "value" (in which case the 3rd parameter should be set to 1), or that "field" does NOT contain "value"
(in which case the 3rd parameter should be set to 0).
For example, given the following line:
690C::REP(EOL,)::IFDEFP(comboYEL,BOOK,1)---<datafield tag="690" ind1="C" ind2=" "><subfield code="a">BOOK</subfield></datafield>
We want to print the line if the (field) "comboYEL" contains the (value) "BOOK", otherwise we don't want to print it.
Therefore, the 3rd parameter is set to "1". However, in the following line:
690C::REP(EOL,)::IFDEFP(comboYEL,BOOK,0)---<datafield tag="690" ind1="C" ind2=" "><subfield code="a">OTHER</subfield></datafield>
We want to print the line if the (field) "comboYEL" does NOT contain the (value) "BOOK", otherwise we don't want to
print it. Therefore, the 3rd parameter is set to "0".
This is achieved by using "IFDEFP". If the line had contained variables, the "CONF" function would have been used
instead.
JOINMULTILINES(prefix,suffix) - Join a multiline string into a single line
with each segment having prefix and suffix
Given a field-value with newlines in it, split the field on the new lines (\n),
separating them with prefix, then suffix. E.g.:
For the field XX with the value:
Test
Case, A
And the function call:
<:XX^::XX::JOINMULTILINES(<subfield code="a">,</subfield>):>
The results would be:
<subfield code="a">Test</subfield><subfield code="a">Case, A</subfield>
One note on this: <:XX^::XX:
Without the ^ the newlines will be lost as bibconvert will remove them, so
you'll never see an effect from this function.
3.4.2 Generated values
In the template configurations, values can be either taken from the source
or generated in the process itself. This is mainly useful for evaluating constant values.
DATE(format,n)
default: DATE(,10)
%A full weekday name
%b abbreviated month name
%B full month name
%c date and time representation
%d decimal day of month number (01-31)
%H hour (00-23)(12 hour format)
%I hour (01-12)(12 hour format)
%j day of year(001-366)
%m month (01-12)
%M minute (00-59)
%p local equivalent of a.m. or p.m.
%S second (00-59)
%U week number in year (00-53)(starting with
Sunday)
%V week number in year
%w weekday (0-6)(starting with Sunday)
%W week number in year (00-53)(starting with
Monday)
%x local date representation
%X local time representation
%y year (no century prefix)
%Y year (with century prefix)
%Z time zone name
%% %
WEEK(diff)
Enters the two-digit number of the current week (%V) increased
by specified difference.
If the resulting number is negative, the returned value is zero (00).
Values are kept up to 99, three digit values are shortened from the
left.
WEEK current
week
SYSNO
Works the same as DATE, however the format of the resulting value is
fixed so it complies with the requirements of further record handling.
The format is 'whhmmss', where:
hh current hour
mm current minute
ss current second
OAI
Inserts OAI identifier incremented by one for earch record
Starting value that is used in the first record in the batch job can be specified on the command line using the -o<starting_value> option.