The imbWEM defines the Web Exploration Model (WEM) that describes: the execution flow for particular experiment (or crawl job), the DLC data and the crawler. The execution flow is defined with Stage Control Model, which is an ordinal sequence of Stage Model instances. In this article we consider implementation with single instance of the Stage Model, having one or more Objective instances associated. The Objective, used in the experiments, has static criteria defining the crawl size limits as given by Crawl Job configuration parameters.
Notice: the concept presented here is part of an article currently in peer-review process. If you intend to cite this material, please be patient until the paper is accepted, so you’re able to cite source of higher credibility
The Crawling Context model
The Crawling Context is defined as complex data structure, containing dynamic metrics, progressively populated collections and other temporary information used by the crawler during inner crawl of a web site. It is domain-level temporary data structure, in the sense it is created and used within single web site crawl session, performed by an instance of a particular crawler algorithm. New instance the Crawling Context is created just before the seed page (domain landing page) loaded. At the end of a DLC session, before the instance is disposed, metrics relevant for the evaluation are sent to the Reporting Engine.
Target is defined as data structure containing Link Instances and Link Vectors (Figure 1), together with its resolved URL. The Link Vector class provides aggregate access to meta information of Link Instance collection, like: anchor text tokens, URL tokens and contextual information as reference to the page content blocks containing the Link Instances. Once a Page is loaded, the Content Processor component extracts Link Instances using simple XPath query over HTML DOM, transformed into XmlDocument object with Html Agility Pack library for .NET. The paths from href attribute are resolved and filtered by the Path Resolver component. The component creates the Link Vectors by joining Link Instances pointing to the same address into common Link Vector object. If a Link Vector is pointing to previously unknown address, new instance of the Target class is created and assigned to the Active Targets (AT) collection. It is the most important part of the Crawling Context as it contains Targets yet to be evaluated by the Crawler Model and its frontier ranking modules.
The Crawler Model
The Crawler Model describes particular crawler implementation, where the collection of the Frontier Modules is the most important property of the model.
The modules contain different instances of the Rule class that is the atomic building block of the concept. As subject of the evaluation, a Rule may be associated to: Target, Page or Objective. By nature of operation the rules are divided into: Passive, Active and the Control rules. A rule instance produces the evaluation result according to its inner algorithm and class specific configuration that may have as input:
• the properties and the state information of the subject instance,
• any other domain-level information contained in the Crawling Context
Passive rules are applied only once for each subject instance (Target or Page) since their inner algorithm use only immutable features and/or properties, that are not influenced by the changes in the Crawling Context. Active rules are adaptable, updating their evaluation result in each iteration, according to the relevant changes in the Crawling Context. Their influence on subject’s rank, layer exclusion or inclusion is therefore mutable across crawl iterations.
The Passive and Active rules are used for subject ranking and distribution among layers of the module, where rules having Target(s) as subject are run within the FRA loop. The Control rules, beside Target(s) and Page(s), may have an Objective as the subject. Their purpose is termination of the execution flow, exclusion of the Target(s) from the frontier consideration or the Page(s) from the result set. The Objective control and Target control rules are usually executed at the end of each FRA iteration.
The Frontier Modules are defined in three collections each for modules inheriting one of the base classes (Figure 2):
• The Frontier Layer Module base class (API reference)
• The Frontier Ranking Module base class (API reference)
• The Frontier Control Module base class (API reference)
The targets of the AT are evaluated in sequence illustrated on the Figure 3. The modules of Frontier Layer Module base class keep some Targets restrained in their Frontier Layers making them effectively excluded from the down-stream evaluation. The Frontier Ranking Modules may contain the Control Target Rules to exclude some Targets from the process and send them back to the AT or to dismiss them permanently. The Control Modules evaluate the rest of the ranked targets with the same control options. The first LT of Targets that passed trough the control modules are sent to the Loader component while the rest, if any, are fed back to the AT collection.
In context of FRA, the Target model is expanded by the Evaluation Registry (ER) keeping records on result returned by the each rule that evaluated the Target. The results from Passive and Active rules are kept separately as the second are cleared at beginning of each iteration. Target that was evaluated by a Passive rule will not be evaluated in the next iteration, if existance of an ER entry confirmed by the rule signature. At the end of the Frontier Ranking module evaluation the ER computes the current relevance score by summing all Passive and all Active results as shown in the illustrated example below (Figure 4).
The Frontier Layers
The sharp discrimination (binary score) of irrelevant links provides better focusing of the crawl, but as consequence it prevents the crawler to pass trough the tunnels of irrelevancy in cases where site has one or several features: intro page of various kinds, pseudo-index page with important announcement, company division selection, multiple brands on the single web presentation, product range selection and other tunnel-structures. For this reason many authors prefer the coefficient score values (decimal number from 0 to 1) as it allows low-relevance links to stay long enough in the frontier enabling the crawler to cross the irrelevancy tunnel. Such approach is actually con efficiency, pro effectiveness trade-off. To address this issue, we extended the traditional frontier concept with the frontier layers: the active Targets are distributed among separate stacks, called Frontier Layers, defined as an ordinal collection within an instance of the Frontier Module.
When Targets are passed into frontier layer module they are distributed to Layers first by the Passive rules then by the Active ones. Rule evaluation result, for an instance of the Target class, may be imperative or neutral, i.e. it may assign the instance to appropriate layer and prevent its further evaluation or pass it trough to the next rule in the sequence. Depending from the Frontier Layer Module instance settings, unassigned Targets may be sent to the deepest (n-th) layer or dismissed from any further consideration. After the distribution phase finished, Targets are pulled only from the first non-empty layer from the top to the bottom.
Three exemplary cases illustrated on the Figure 5 may formally defined as:
$$C_(i = 0) \Longrightarrow L_0 <>\emptyset \Longrightarrow M_{output} \in L_0$$
$$C_(i = 1) \Longrightarrow L_0 = \emptyset and L_1 <> \emptyset \Longrightarrow M_{output} \in L_1 $$
$$C_(i = n) \Longrightarrow L_{0} … L_{n-1} = \emptyset and L_n <> \emptyset \Longrightarrow M_{output} \in L_n$$
In special case, when the module is configured to dismiss unassigned Targets and all layers are empty, the complete collection of the input Targets is bypassed as the module output. There are several important implications of the proposed approach. Such frontier management enables FRA to implement both binary and fuzzy relevancy score, in their optimal contexts, where the binary logic constitutes mechanism of target-to-layer distribution, keeping low-relevancy Targets out of expensive real number computation branch, until and if, they are required in function of the crawler effectiveness. Our hypothesis is that the proposed layered frontier concept, combined with complement Target distribution mechanism, is able to preserve effectiveness while avoiding loss of efficiency.