The Template Module

Frontier Ranking Algorithm – the Template Module

The heart of this module is procedure of page decomposition and detection of semantic role for each of extracted content blocks. This is the only module in the stack that evaluates links using strictly information immutable across the DLC process iterations. Furthermore, the alternative ranking implementation assumes that higher position in the navigation menu hierarchy implies greater significance

Content graph slice (from blocks to tokens)

The Template Module inherits the Frontier Layer Module base class and declares three layers:

M02.L01: The Navigation (the surface layer)
M02.L02: The Information (the 1st layer)
M02.L03: The Reserve (the 2nd layer)

each of them associated to role tag with the same name (“navigation”, “information”, “reserve”).

The first step in the process is to build graph tree representation of the page content from XPath list of visible HTML leaf nodes containing text. The graph is navigated from root to leafs in iterative procedure until the scoped node count is more or equal to the desired number of blocks (T_bc), which is in this case 3, reflecting the number of layers in the module. Once the criterion is satisfied we rank the nodes scoped in the current iteration by total number of descendants. Each of the first T_bc-1 nodes is processed into content block, while the last block is created from the rest of nodes in the list.

The second step is to compute Link-Block Frequency (lbf) for each block:

$ibf_i = { lb_i \over lb_{max} }$	(2)

where lb_i is number of Link Instances in the block i, lb_max is highest number of Link Instances among the blocks. In the most cases choosing the block with the highest lbf_i would be enough but it would fail in cases where the main section of the page contains list with recommended links or very long descriptive text with frequent occurrence of in-line links. The third step follows intuition of Zipf’s Law (Powers, 1998), where we assume that textual blocks with high volume of textual data would have content corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... entropy greater then the blocks having strictly navigation links. We compute the normalized Shannon’s entropy (Shannon, 1948, p. 11) of text token frequencies in each content block as:

$E^'({b_j}) = {{- \sum_{i=1}^{ \|b_j\|} tf_{(x_i)} \log (tf_{(x_i)}) } \over { \log (\| b_{j}\|)}}$	(3)

where b_j is collection of tokens from the block j, x_i is a token in the block j, tf(x_i) is term i frequency normalized by the highest term frequency in the block.

In the fourth step these two values describing each block are combined by division into b_nav measure:

$b^{nav}_{j} = { {lbf_{j} \cdot log ( \|b_{j}\|)} \over {k + \sum^{i=1}_{\|b_{j}\|} tf(x_i) \cdot log (tf(x_i))} }$	(4)

where k is predefined coefficient (k=0.001) preventing division by zero as the entropy may have 0 value for perfectly homogeneous distribution.

In the last step the semantic role of a block within the page is assigned by two tests:

block with the highest b^nav value receives the role tag “navigation“
the other two blocks are compared for E(bj), the one with higher value receives the the “information” role tag
the last block receives the default role tag: “reserve”

In special cases:

when two blocks have the same b^nav value: the “information” tag is assigned to the both
when all blocks have the same b^nav value: the “reserve” tag is assigned to all

Once the role tags are assigned to each content block the Targets are distributed into layers by a passive rule by comparing the link HTML node XPath with root XPath of each block. As the Target may have link instances in two or all content blocks the tags are prioritized by order of layers i.e. if a target has instance in “information” and another in “navigation” blocks, it is sent into the Navigation layer.

In alternative implementation, the Template_rank, the module is extended with an Active rule giving the link relevancy score computed as inverse proportion of XPath depth distance (number of path nodes) between the block root node XPath and the node containing the Link Instance.

Spread the love

imbVeles

Web Exploration, Load and Extraction Subsystem

The Template Module