BibAuthorID Internals
BibAuthorID
-----------
DEFINITIONS
Inspire: Publication system for high-energy physics
Hepnames: A collection of known authors including their scientific history in the field
Document: A document is, in a broader sense, a scientific artifact inside Inspire
(a publication, a pre-print, a picture, a data set, etc.)
Virtual Author (VA): The author as it appears on a document
Real Author (RA): The final entity which ideally is a real individual researcher
Feature: A feature of a VA or RA is a metric, the owner can be compared to other entities
THE ALGORITHM IN SHORT
Before any algorithm can run, one step of preparation is to be done:
- Read all names from the Inspire database and store them in new table
The algorithm itself is divided into several steps:
1. Clustering--finding potentially related authors
2. Matching--create RA entities by pairwise comparison of VAs within a cluster
3. Post-matching comparison--identifying identical RA entities through cross-cluster comparison
STEP I: CLUSTERING
[CODE]
FOR every last name in the database:
FOR every name with that last name:
FOR every documents associated with the name:
create new VA (record):
update cluster id of VA and create new cluster if necessary
mark as updated
mark as not connected
[/CODE]
STEP II: MATCHING
The matching algorithm effectively runs several times on different
conditions.
Run 1: run matching algorithm and let it find only updated VAs (the ones
that have the update flag set) that have a full name (i.e. at least
one first name).
Run 2: run matching algorithm and let it find the rest of the updated and
not-connected VAs.
Run 3: run matching algorithm once or more and let it find only orphaned
VAs (i.e. VAs that are neither updated nor connected)
[CODE]
IF mode == updated_fullname:
queue := find all updated VAs that have a full name
and sort them by the number of their features
ELSEIF mode == updated:
queue := find all updated VAs and sort them by their features
ELSEIF mode == orphaned:
queue := find all disconnected VAs and sort them by their features
FOR every qVA in queue:
cluster := find all VAs in the same cluster
other_RAs := find all RAs that have VAs attached from the same cluster
IF no other_RAs exist:
create new RA and copy all features from qVA
ELSE:
FOR every other_RA in other_RAs:
probabilities.add(compare qVA features with other_RA features)
IF all probabilities < adding threshold:
create new RA and copy all features from qVA
ELSEIF (number of probabilities > adding threshold) == 1:
add qVA to RA with highest probability
and copy features from qVA to that RA
ELSE:
mark qVA as not connected
mark qVA as not updated
continue with next qVA in queue
mark qVA as connected
mark qVA as not updated
[/CODE]
STEP II: MODULES
Up to this point, the algorithm is build in the fashion of a framework. The
framework provides all the methods needed to access features of RAs or VAs
and is extensible through the means of modules. A module's purpose is to
provide several functions to be able to compare features of a VA to the
features of a RA. Currently, four modules are implemented to determine the
correct attribution of a VA to a RA. In particular, these are created to
compare names, affiliations, paper-equality and co-authorship. The following
snippets shall show the overall functionality of each of the modules.
MODULE 1: NAME COMPARISON
[CODE]
clean names by removing special chars (.-_/\[]{}())
split names in last name, initials and names
build name combinations
FOR all possible name combinations on both sides:
initials_p := compare initials:
attribute weight to position of initial: pos/((1+n/2)*n)
add up weights of matching initials
names_p := compare names:
attribute weight to position of name: pos/((1+n/2)*n)
add up weights of matching names
IF names_p > 0.6:
initials_p_weight := 0.3
names_p_weight := 0.7
ELSEIF initials_p_weight == 1.0 and names_p_weight <= 0:
initials_p_weight := 0
names_p_weight := 0
ELSE:
initials_p_weight := 0.5
names_p_weight := 0.5
probabilities.add(names_p_weight * names_p +
initials_p_weight * initials_p)
RETURN MAX(probabilities)
[/CODE]
MODULE 2: AFFILIATION COMPARISON
[CODE]
IF no affiliation in common:
RETURN 0.0
FOR every common affiliation:
IF common affiliation == "Unknown":
common_affiliations.add(0)
ELSE:
common_affiliations.add(1)
date_difference := find date difference in month
IF date_difference > 600 (50 years):
date_probabilities.add(0)
ELSE:
date_probabilities.add(e^(-0.05 * date_difference ^ 0.7))
affiliation_p := AVERAGE(common_affiliations)
date_p = AVERAGE(date_probabilities)
RETURN (affiliation_p + date_p) / 2
[/CODE]
MODULE 3: COAUTHORSHIP COMPARISON
[CODE]
IF no coauthors on both sides:
RETURN 0.0
IF number of VA coauthors > 50:
create hash of sorted coauthor list.
IF RA holds same hash:
RETURN 1.0
ELSE:
RETURN 0.0
parity = find intersection of RA and VA coauthors
FOR every coauthor in parity:
if coauthor is a collaboration:
RETURN 1.0
IF number of VA coauthors > 0:
RETURN 1 - e^(-0.8 * len(parity)^0.7))
[/CODE]
MODULE 4: PAPER-EQUALITY TEST
[CODE]
IF qVA is on a paper that is already part of the RA:
RETURN impossible match
ELSE:
RETURN possible match
[/CODE]
STEP III: POST-MATCHING COMPARISON
[CODE]
FOR every RA:
compare features of RA with all features of all other RAs
IF compatablility with other RA > 0.75:
merge the two RAs into one RA
[/CODE]