Home > Admin Area > OAIHarvest Admin Guide |
In order to periodically harvest metadata from one or several repositories, it is possible to organize OAI sources through the OAI Harvest Admin Interface.
The interface allows the administrator to add new repositories as well as edit and delete existing ones.
Once defined, run the oaiharvest
command-line tool to start periodical harvesting of these repositories based on your settings.
(Note that you can alternatively use the oaiharvest
command-line tool to perform a manual harvesting of a repository independently of the settings in OAI Harvest admin interface, and save the result into a file. This lets you specify the OAI arguments for the harvesting (verb, dates, sets, metadata format, etc.). You would then need to run bibconvert
and bibupload
to integrate the harvested records into your repository. This method can be useful to play with repositories, or for writing custom harvesting scripts, but it is not recommended for periodical harvesting. Run oaiharvest -h
for additional information.)
The first step requires the administrator to enter the baseURL of the OAI repository. This is done for validation purposes - i.e. to check that the baseURL actually points to an OAI-compliant repository.
(Note: the validation simply performs an 'Identify' query to the baseURL and parses the reply with crucial tags such as OAI-PMH
and Identify
).
Once the baseURL is validated, the administrator is required to fill into the following fields:
Once a source has been added to the database, it will be visible in the overview page, as shown below
name | baseURL | metadataprefix | frequency | bibconvertfile | postprocess | actions | |
---|---|---|---|---|---|---|---|
cds | http://cds.cern.ch/oai2d | marcxml | daily | NULL | h-u | edit / delete / test / history / harvest |
At this point it will be possible to edit the definition of this source by clicking on the appropriate action button. All the fields described in 2.2.1 can be modified (except for the Starting date).
There is no more validation at this stage, hence, please take extra care when editing important fields such as baseurl and metadataprefix.
OAI repositories can be removed from the database by clicking on the appropriate action button in the overview page.
This interface provides the possibility to test OAI harvesting settings. In order to see the harvesting and postprocessing results, administrator has to provide record identifier (insige the harvested OAI source) and click "test" button. After doing this, two new frames will appear. The upper contatins original OAI XML harvested from the source. The second contains the result of all the postprocessing activities or an error message.
Viewing the harvesting historyThis page allows the administrator to see which records have been recently harvested. All the data is shown in database insertion order. If there was more than 10 inserted records during a day, only first 10 will be displayed. In order to do manipulations connected with the rest, the administrator has to proceed to day details page which is available by "View next entries..." link.
The same page provides the possibility of reharvesting the records present in the past. All have to be done in order to achieve this is selecting appropriate records by checking the checkboxes on the right side of the entry and clicking "reharvest selected records" button.
Harvesting particular recordsThis page provides the possibility of harvesting records manually. The administrator has to provide internal OAI source identifier. After this, record will be harvested, converted and filtered according to the source settings and scheduled to be uploaded into the database.
Once administrators have set up their desired OAI repositories in the database through the Admin Interface they can invoke oaiharvest
to start up periodical harvesting.
Oaiharvest usage
oaiharvest [options] Manual single-shot harvesting mode: -o, --output specify output file -v, --verb OAI verb to be executed -m, --method http method (default POST) -p, --metadataPrefix metadata format -i, --identifier OAI identifier -s, --set OAI set(s). Whitespace-separated list -r, --resuptionToken Resume previous harvest -f, --from from date (datestamp) -u, --until until date (datestamp) -c, --certificate path to public certificate (in case of certificate-based harvesting) -k, --key path to private key (in case of certificate-based harvesting) -l, --user username (in case of password-protected harvesting) -w, --password password (in case of password-protected harvesting) Automatic periodical harvesting mode: -r, --repository="repo A"[,"repo B"] which repositories to harvest (default=all) -d, --dates=yyyy-mm-dd:yyyy-mm-dd reharvest given dates only Scheduling options: -u, --user=USER User name under which to submit this task. -t, --runtime=TIME Time to execute the task. [default=now] Examples: +15s, 5m, 3h, 2002-10-27 13:57:26. -s, --sleeptime=SLEEP Sleeping frequency after which to repeat the task. Examples: 30m, 2h, 1d. [default=no] -L --limit=LIMIT Time limit when it is allowed to execute the task. Examples: 22:00-03:00, Sunday 01:00-05:00. Syntax: [Wee[kday]] [hh[:mm][-hh[:mm]]]. -P, --priority=PRI Task priority (0=default, 1=higher, etc). -N, --name=NAME Task specific name (advanced option). General options: -h, --help Print this help. -V, --version Print version information. -v, --verbose=LEVEL Verbose level (0=min, 1=default, 9=max). --profile=STATS Print profile information. STATS is a comma-separated list of desired output stats (calls, cumulative, file, line, module, name, nfl, pcalls, stdname, time).
oaiharvest
performs a number of operations on the repositories listed in the database. By default oaiharvest
considers all repositories, one by one (this gets overridden when --repository
argument is passed).
oaiharvest
behaves according to the arguments passed at the command line:
--dates
argument is not passed, it checks whether an update from the repository is needed (Note: the update status is calculated based on the time of the last harvesting and the frequency chosen by the administrator).
--dates
argument is passed, it simply harvests the metadata of the repository from/until the given dates. The last update date is left unchanged.
In most cases, administrators will want oaiharvest
to run in the background, i.e. run in sleep mode and wake up periodically (e.g. every 24 hours) to check whether updates are needed:
$ oaiharvest -s 24h
In other cases, administrators may want to perform periodical harvesting only on specific sources:
$ oaiharvest -r cds -s 12h
Another option is that administrators may want to harvest from certain repositories within two specific dates. This will be regarded as a one-off operation and will not affect the last update value of the source:
$oaiharvest -r cds -d 2005-05-05:2005-05-30