Search Processing Stages
What the search engine does after it receives a user query? Let us
explain the steps it takes for a fictive complex query example in
which a user typed
author:ellis reportnumber:TH-2003-114 -muon +energ* title:"kaon decay"
and selected "Theses" and "Books" collections and the "arrival date"
between the 10th and the 20th of March 2003.
The search engine proceeds as follows.
1. Firstly we classify and break apart our search arguments into a)
basic search units and b+c) additional search options:
a) (p_1, f_1), (p_2, f_2), ..., (p_m, f_m) --- the list of basic
(pattern, field) searching units. The user-typed boolean
query is split into the (p_i, f_i) units according to the
non-significant whitespace. (A whitespace is considered to
be significant when it occurs within quoted expressions for
phrase searching, see Search Tips and user-level
documentation.) In our example, the list of basic search
units is: (author, ellis), (reportnumber, TH-2003-114),
(anyfield, muon), (anyfield, energ*), (title, "kaon decay").
b) c_1, c_2, ..., c_n --- the list of collections the user wanted
to search in. In our example: Theses, Books.
c) l_1, l_2, ..., l_o --- the list of additional limiting search
criteria, such as limit to certain arrival date, certain
language, or certain subject category. The user usually
selects such limits from within Advanced Search interface
and its selection boxes. In our example, the limit is on
the arrival date: (arrivaldate, 20030310->20030320).
The basic search units and additional search arguments are then
dealt with subsequently (from a to c with decreasing priority) in
the following searching stages.
2. For each (p_i, f_i), we verify that at least some hits can be found
regardless of c_j and l_k. In other words, we make sure that p_i
is a known indexed term in f_i. Note that p_i may contain
asterisks and may start and end by a single/double quotes, with the
following special meaning:
foo*bar -- asterisk is a wildcard character, meaning to match
any sequence of characters
"foo and bar" -- double quotes to denote exact phrase matching
'foo and bar' -- single quotes to denote partial phrase matching
The p_i (word or phrase) is then tested for existence in the f_i
(word or phrase) index.
2-1. If p_i was found in the f_i index, we retain the (p_i, f_i)
search unit. We also retain it in the case when p_i wasn't
found in the f_i index but when (p_i, f_i) is joined to the
previous unit or to the next unit by boolean operator OR.
2-2. If p_i wasn't found inside f_i, we then look whether p_i
contains some non-alphanumeric characters, such as a dash or a
slash. If this is so, we try to replace them with a boolean
AND query. In our example, the search unit (reportnumber,
TH-2003-114) would be replaced by a new boolean query for
(reportnumber,TH) and (reportnumber,2003) and
(reportnumber,114). If this new query succeeds, we retain
this new boolean query in place of the old search unit.
2-3. If the preceding step failed, we propose to the end user a
list of nearest indexed terms (words, phrases) around p_i
within f_i index and let the user choose one known indexed
term out of this list. In our example, the phrase "kaon
decay" cannot be found in the title index, so we'll propose
closest titles around "kaon decay ...". Let's suppose that
the user will choose "Kaon decays and the flavour problem" out
of this list.
After all the basic search units (p_i, f_i) have been treated we
may proceed to the following stage 3.
3. At this stage, all search units (p_i, f_i) are known to yield at
least some results. We now continue by trying boolean query as
specified by the user. The execution priority goes from left to
right, the known boolean operators are:
+ for set intersection
- for set difference
| for set union
In our example, we proceed by doing the set intersections of
(author, ellis), (reportnumber,TH), (reportnumber,2003),
(reportnumber,114), followed by set differentiation with (anyfield,
muon), followed by set intersection with (anyfield, energ*) and
(title, "Kaon decays and the flavour problem").
3-1. If this gives some hits, we proceed to stage 4.
3-2. If this does not give any hit, we display the number of hits
found for each search unit, advise the user to combine his
search terms differently, and we stop.
4. At this stage, the boolean query (p_1, f_1), (p_2, f_2), ... ,
(p_n, f_n) is known to yield some results. We now continue by
checking whether these results fall into collections c_j that the
user has chosen. This is done by performing a set intersection of
the results obtained so far with the collection universe for c_j.
4-1. If this gives some hits, we proceed to stage 5.
4-2. If this does not give any hit, we first try to look for the
query in any public collection.
4-2-1. If this gives some hits, we warn the user that no match
could have been found in his c_j choice but that there
are hits in other public collections. We propose a
link to get them and we stop.
4-2-2. If this does not give any hit, then there must have
been some hits in some of the restricted collections.
We display a warning that the restricted collections
must be explicitly selected before searching and we
stop.
5. At this stage, the boolean query (p_i, f_i) within c_j is known to
yield some results. We now proceed by checking additional search
limits l_k imposed by the user. This is done by subsequent set
intersections of the results obtained so far with the universe of
records matching limiting criteria (l_1, l_2, ..., l_o).
5-1. If this gives some hits, we proceed to stage 6.
5-2. If this does not give any hit for a certain l_k, we warn the
user that no match could have been found for his l_o choices
and proceed to stage 6 with the results obtained so far.
6. We are done and may display the results.