The Crawl Job consists of the web domain list and the configuration parameters. The result of the job execution, the Result Set, is fed into index database for later use by the Company Semantic Profile (CSP) construction and enrichment (Figure 1) procedures. Resource Employment features (Table 2) are related to two different levels of the architecture (Figure 4): the Job Level Context (JLC) and the Domain Level Crawl (DLC).
RE |
Resource Employment |
RE-max |
maximization of CPU and I/O employment |
RE-dlc |
domain level multi-threading pattern |
RE-jlc |
job level multi-threading pattern |
RE-lnk |
balancing bandwidth pull by shuffling request over multiple domains |
RE-lim |
configuration parameters to limit the crawl size |
As remedy of the DLC size limitation (RE-lim) we adopted several metrics (Table 3), categorized by their role into: primary (pr), secondary (sc) and special (sp) category. The primary category contains the ones expected to be triggered in significant number of cases, during proper crawler operation. The secondary are introduced as software failure handling measure, more specifically: to terminate crawling threads that are blocked by: yet unknown logic loop; fall into crawler trap; unhandled exception; or other crash-like event. Special metrics are used in particular research runs to limit the crawl size according the experiment purpose. During research and development, a number of control rules were developed and used for certain experiments.
Parameter |
Value |
Unit |
|
LT |
Load Take per iteration limit |
1 |
n of t |
Imax |
Iteration limit |
100 |
n of i |
TTmax |
Total Targets limit |
10000 |
n of t |
PLmax |
Total Page-loads limit |
500 |
n of p |
TDL |
Domain crawl time limit |
15 |
min |
TAC |
Inactivity time limit |
1 |
min |
TST |
Targets count stability limit |
3 |
n of i |
Table 3: A set of crawl size limitation parameters and values used for preliminary survey. Abbreviations: n for number; t for target/link, i for DLC iteration, p for target loaded i.e. web page;
Iteration limit Imax and PLmax are simple termination criteria, whose values in the final design are to be adjusted, to reduce unnecessary workload as the most relevant pages, on a web site, should be crawled before reaching any near value. As a reference value for Imax, it is relevant to mention, that 32.25 iterations were required, on average, to reach the Targets count stability limit (TST) – a DLC termination criterion, during the preliminary study crawl. The TST is a special termination rule, used for deep scan crawls workload optimization. It tracks uninterpreted sequence of iterations without having the known Target count changed. The TTmax, if reached, would suggest the domain was mistakenly included in the research sample, having in mind the highest number of pages the survey detected. Time limit for single DLC (TDL) and the DLC thread inactivity (TAC) timeout, defined as time period between starts of two iterations, are protection mechanisms to overcome: otherwise undetected problems in the crawler source code, algorithm, research sample, network/uplink or with a kind of crawl trap. When triggered, both TTmax and TDL, leave warning message with details in an incident log XML file, for later investigation. The LT controls number of pages to be taken from the ranked frontier and loaded in the next iteration.