# Composite Text Density

With the textual information, we propose two measures for the evaluation of the textual importance of tags in web pages: Text Density and Composite Text Density. Once an HTML document is parsed and represented by a DOM > tree, we calculate the text density for each node. Higher text density implies the node is more likely to represent a tag with content text within the web page. In the case of noise, the opposite applies. Afterward, we > extend the Text Density to the Composite Text Density by adding statistical information about hyperlinks.

Song, D., Sun, F., & Liao, L. (2015). A hybrid approach for content extraction with text density and visual importance of DOM nodes, 75–96. http://doi.org/10.1007/s10115-013-0687-x

$$TD_i = \frac{C_i}{T_i}$$

where TDi is Text Density, i is HTML tag, Ci is CharNumber, Ti is TagNumber

$$CTD_i = \frac{C_i}{T_i} \cdotp \log_{ln}$$

$$CTD_i = \frac { C_i T_i }$$

Additional statictical information is used:

• LinkCharNumber  – number of all hyperlink characters in its subtree
• LinkTagNumbernumber of all hyperlink tags in its subtree – number of all hyperlink tags in its subtree

To combine textual and visual information for DOM nodes, the measure of visual important
is incorporated into the Composite Text Density, redefined as Hybrid Text Density.
If i is a leaf node in the DOM tree, then its CharNumber (Ci ) is adjusted to Hybrid
CharNumber (HCi ) using the visual importanceVisual Importance value as a weight, as HCi = V Ii ∗ Ci .
For other tag nodes, the Hybrid CharNumber is defined as the sum of all its sons’ Hybrid
CharNumbers.

Definition 3.4 If i is a node in the DOM tree, then its Hybrid Text Density (HT Di) is:

in which all the appearances of CharNumber in Eq. 2 are substituted by the Hybrid Char-
Number. Specifically, Ci and Cb for node i and the <body> tag are, respectively, changed
to HCi and HCb, while the remaining of Eq. 4 is exactly the same with Eq. 2.
It must be noticed that, previously in the definitions of Text Density and Composite Text
Density, Ti is set to be 1 when it is 0. However, in the above definition of the Hybrid Text
Density, as the Visual Importance value V I is in the interval of [0, 1], the derived Hybrid
CharNumber is probably less than 1. To avoid the Hybrid Text Density being negative, an
adaption is designed here: if Ti is 0, it is set to be HCi /Ci, where Ci and HCi are the initial
and the Hybrid CharNumbers, respectively.
With the Hybrid Text Density, the resulted density histogram for the FT page is shown in
Fig. 6.
Unless otherwise specified, we use Text Density to refer to the initial Text Density, the
Composite Text Density and the Hybrid Text Density. And these calculation methods are
discussed further in Sect. 4