# Page decompositon

(DRAFT)

Page decomposition is currently implemented on several separate places within imbNLP framework. On this page we’ll cover theoretical aspect of each page decomposition algorithm implemented.

Tree Structure Compression

The Tree Structure Compression – TSC is a content transformation step prior natural language processing operations like text tokenization and semantic entailment. HTML documents tend to have highly variable structural complexity i.e. depth distance from document root to the block with relevant content varies in wide range from one to the another web site. The TSC filters out invisible parts of the document and removes useless content nestings, effectivly shortening the XPath of the end nodes (leafs) down to minimum hierarchy, required to preserve visible textual content nodes and their semantically valuable relationships.

Steps within the TSC:

1. Collect all leafs, defined as nodes without children but containing textual content within the node opening and closing tag. This is implemented with XPathNavigator and proper XPath query:

//*[text()][count(*)=0]

1. The query result is filtered for meta and client-side code removal (tags like: title, meta, style, script, link, and html comments…). The applied tag-name based filtration may result in emergence of new empty nodes generation: therefore additional filtration step is performed to drop out nodes with empty string or white space content.

2. The node hierarchy compression iterates trough leaf nodes and checks if the current node has no siblings i.e. the parent node has single child. If the criteria met the parent node is removed and the child is nested into grand-parent node on the parent’s place, effectively making the XPath of the child node shorter for one generation. This iterative operation is repeated until no leaf meeting the criteria was left, or until stack overflow protection was triggered by exceeding repetition limit.

The TSC process may be seen as retraction of graph leafs along the main tree branches. The resulting graph has optimum, known generation levels, from witch the first generation branches are interpreted as contextual wrappers for contained textual information.

Semantic Page Decomposition

The algorithm of semantic page decomposition performs detection of semantic role for each of extracted content blocks. It is developed for the Template Module (Semantic Crawlers, imbWBIWeb Business Intelligence libraries of imbVeles Framework.). The module is based on the Frontier Layer Module base class and declares three layers:

• M02.L01: The Navigation (the surface layer)

• M02.L02: The Information (the 1st layer)

• M02.L03: The Reserve (the 2nd layer)

each of them associated to role tag with the same name (“navigation”, “information”, “reserve”).

The first step in the process is to build graph tree representation of the page content from XPath list of visible HTML leaf nodes containing text. The graph is navigated from root to leafs in iterative procedure until the scoped node count is more or equal to the desired number of blocks (Tbc), which is in this case 3, reflecting the number of layers in the module. Once the criterion is satisfied we rank the nodes scoped in the current iteration by total number of descendants. Each of the first Tbc-1 nodes is processed into content block, while the last block is created from the rest of nodes in the list.

The second step is to compute Link-Block Frequency (lbf) for each block:

 $lbf_i= \frac {lb_i} {lb_{max}}$ (12)

where lbi is number of Link Instances in the block i, lbmax is highest number of Link Instances among the blocks. In the most cases choosing the block with the highest lbfi would be enough but it would fail in cases where the main section of the page contains list with recommended links or very long descriptive text with frequent occurrence of in-line links. The third step follows intuition of Zipf’s Law (Powers, 1998), where we assume that textual blocks with high volume of textual data would have content corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... entropy greater then the blocks having strictly navigation links. We compute the normalized Shannon’s entropy (Shannon, 1948, p. 11) of text token frequencies in each content block as:

 (13)

where bj is collection of tokens from the block j, xi is a token in the block j, tf(xi) is term i frequency normalized by the highest term frequency in the block.

In the fourth step these two values describing each block are combined by division into bnav measure:

 (14)

where k is predefined coefficient (k=0.001) preventing division by zero as the entropy may have 0 value for perfectly homogeneous distribution.

In the last step the semantic role of a block within the page is assigned by two tests:

• block with the highest bnav value receives the role tag “navigation

• the other two blocks are compared for E(bj), the one with higher value receives the the “information” role tag

• the last block receives the default role tag: “reserve

In special cases:

• when two blocks have the same bnav value: the “information” tag is assigned to the both

• when all blocks have the same bnav value: the “reserve” tag is assigned to all

Once the role tags are assigned to each content block the Targets are distributed into layers by a passive rule by comparing the link HTML node XPath with root XPath of each block. As the Target may have link instances in two or all content blocks the tags are prioritized by order of layers i.e. if a target has instance in “information” and another in “navigation” blocks, it is sent into the Navigation layer.