Crawl Job Execution

imbWEB
Excerpt from theoretical paper on imbWEM and Crawl Job execution

The Crawl Job consists of the web domain list and the configuration parameters. The result of the job execution, the Result Set, is fed into index database for later use by the Company Semantic Profile (CSP) construction and enrichment (Figure 1) procedures. Resource Employment features (Table 2) are related to two different levels of the architecture (Figure 4): the Job Level Context (JLC) and the Domain Level Crawl (DLC).

Relationship between the Job Level Context (JLC) and the Domain Level Crawl (DLC) scopes.

RE

Resource Employment

RE-max

maximization of CPU and I/O employment

RE-dlc

domain level multi-threading pattern

RE-jlc

job level multi-threading pattern

RE-lnk

balancing bandwidth pull by shuffling request over multiple domains

RE-lim

configuration parameters to limit the crawl size

As remedy of the DLC size limitation (RE-lim) we adopted several metrics (Table 3), categorized by their role into: primary (pr), secondary (sc) and special (sp) category. The primary category contains the ones expected to be triggered in significant number of cases, during proper crawler operation. The secondary are introduced as software failure handling measure, more specifically: to terminate crawling threads that are blocked by: yet unknown logic loop; fall into crawler trap; unhandled exception; or other crash-like event. Special metrics are used in particular research runs to limit the crawl size according the experiment purpose. During research and development, a number of control rules were developed and used for certain experiments.

Parameter

Value

Unit

LT

Load Take per iteration limit

1

n of t

Imax

Iteration limit

100

n of i

TTmax

Total Targets limit

10000

n of t

PLmax

Total Page-loads limit

500

n of p

TDL

Domain crawl time limit

15

min

TAC

Inactivity time limit

1

min

TST

Targets count stability limit

3

n of i

Table 3: A set of crawl size limitation parameters and values used for preliminary survey. Abbreviations: n for number; t for target/link, i for DLC iteration, p for target loaded i.e. web page;

Iteration limit Imax and PLmax are simple termination criteria, whose values in the final design are to be adjusted, to reduce unnecessary workload as the most relevant pages, on a web site, should be crawled before reaching any near value. As a reference value for Imax, it is relevant to mention, that 32.25 iterations were required, on average, to reach the Targets count stability limit (TST) – a DLC termination criterion, during the preliminary study crawl. The TST is a special termination rule, used for deep scan crawls workload optimization. It tracks uninterpreted sequence of iterations without having the known Target count changed. The TTmax, if reached, would suggest the domain was mistakenly included in the research sample, having in mind the highest number of pages the survey detected. Time limit for single DLC (TDL) and the DLC thread inactivity (TAC) timeout, defined as time period between starts of two iterations, are protection mechanisms to overcome: otherwise undetected problems in the crawler source code, algorithm, research sample, network/uplink or with a kind of crawl trap. When triggered, both TTmax and TDL, leave warning message with details in an incident log XML file, for later investigation. The LT controls number of pages to be taken from the ranked frontier and loaded in the next iteration.

Spread the love