The heart of this module is procedure of page decomposition and detection of semantic role for each of extracted content blocks. This is the only module in the stack that evaluates links using strictly information immutable across the DLC process iterations. Furthermore, the alternative ranking implementation assumes that higher position in the navigation menu hierarchy implies greater significance
The Template Module inherits the Frontier Layer Module base class and declares three layers:
- M02.L01: The Navigation (the surface layer)
- M02.L02: The Information (the 1st layer)
- M02.L03: The Reserve (the 2nd layer)
each of them associated to role tag with the same name (“navigation”, “information”, “reserve”).
The first step in the process is to build graph tree representation of the page content from XPath list of visible HTML leaf nodes containing text. The graph is navigated from root to leafs in iterative procedure until the scoped node count is more or equal to the desired number of blocks (Tbc), which is in this case 3, reflecting the number of layers in the module. Once the criterion is satisfied we rank the nodes scoped in the current iteration by total number of descendants. Each of the first Tbc-1 nodes is processed into content block, while the last block is created from the rest of nodes in the list.
The second step is to compute Link-Block Frequency (lbf) for each block:
where lbi is number of Link Instances in the block i, lbmax is highest number of Link Instances among the blocks. In the most cases choosing the block with the highest lbfi would be enough but it would fail in cases where the main section of the page contains list with recommended links or very long descriptive text with frequent occurrence of in-line links. The third step follows intuition of Zipf’s Law (Powers, 1998), where we assume that textual blocks with high volume of textual data would have content corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... entropy greater then the blocks having strictly navigation links. We compute the normalized Shannon’s entropy (Shannon, 1948, p. 11) of text token frequencies in each content block as:
where bj is collection of tokens from the block j, xi is a token in the block j, tf(xi) is term i frequency normalized by the highest term frequency in the block.
In the fourth step these two values describing each block are combined by division into bnav measure:
where k is predefined coefficient (k=0.001) preventing division by zero as the entropy may have 0 value for perfectly homogeneous distribution.
In the last step the semantic role of a block within the page is assigned by two tests:
- block with the highest bnav value receives the role tag “navigation“
- the other two blocks are compared for E(bj), the one with higher value receives the the “information” role tag
- the last block receives the default role tag: “reserve”
In special cases:
- when two blocks have the same bnav value: the “information” tag is assigned to the both
- when all blocks have the same bnav value: the “reserve” tag is assigned to all
Once the role tags are assigned to each content block the Targets are distributed into layers by a passive rule by comparing the link HTML node XPath with root XPath of each block. As the Target may have link instances in two or all content blocks the tags are prioritized by order of layers i.e. if a target has instance in “information” and another in “navigation” blocks, it is sent into the Navigation layer.
In alternative implementation, the Templaterank, the module is extended with an Active rule giving the link relevancy score computed as inverse proportion of XPath depth distance (number of path nodes) between the block root node XPath and the node containing the Link Instance.