Link-based Ranking with/without External Citations
PageRank is one of the most popular link-based ranking methods for WWW.
"PageRank is a link analysis algorithm [..] that assigns a numerical
weighting to each element of a hyperlinked set of documents, with the
purpose of "measuring" its relative importance within the set."[Wikipedia]
If the "hyperlinked set of documents", meaning the Citation Graph, is
almost complete, then the classical link-based ranking (a method inspired by
the PageRank algorithm) should be used. Otherwise, if the citation graph is
missing a lot of citations, we advise the use of the linked-based
ranking with external citations.
Algorithm:
1. Read the citation data from the database or from a file that
has the following format: x[tab]y where the paper with recid x
cites the paper with recid y. The citation data is stored in a
map key:value, where the 'key' is a record id and the 'value'
is the list of publications that cite publication 'key'.
2. Read the publication dates for each paper from the database
or from a file that has the following format: x[tab]y, where
x is a recid and y is the publication year. There are several
possibilities for retrieving the dates from the database:
i) using the 260__c MARC tag that contains only the publication
year (this is the option that we are using);
ii) using the 269__c MARC tag that contains the complete
publication date.
For the papers that do not have a publication date, we consider
the date of insertion in the database (961__x tag). If neither
the publication date nor the insertion date are available,
we consider an average date (computed with the existing dates).
3. Read the convergence threshold, check_point and damping_factor
parameters from the configuration file. (These are specific
parameters for the link-based ranking method).
4. There are two possibilities for computing the publications'
weights: either use the external citations, either not use the
external citations.
4.1. When using the external citations: read the necessary
parameters (everything that starts with "ext_") from
the configuration file.
4.2 When not using the external citations: -
5. Iteratively calculate the weight for each publication, until
it reaches a stable state.
6. Write the ranks to the database and to a file. The name of
the file and the number of ranks that should be outputted can
be set into the configuration file. If the name of the file is
not set, then the ranks are only written in the database.