The Diversity Module inherits the Frontier Ranking Module base class and commences target sorting according to the estimated semantic difference (as complementary value of semantic similarity) between the Target and already crawled content. The crawled content is represented by two collections:
- the Target Tokens Repository (TTR), which is domain level term frequency table aggregating TSTs of that far crawled Targets
- the Page Tokens Repository (PTR), which is domain level page content terms frequency table aggregating terms of that far loaded pages.
The Semantic Diversity Score (TSDM) is calculated as follows:
(5) |
where ST is the semantic similarity between the TST and the TTR, SP is the semantic similarity between the TST and the PTR, kt and kp are predefined weights (set to 0.5) of each similarity factors. The semantic similarity is computed as proposed by Semantic Similarity Retrieval Model (SSRM) [page covering SSRM], equation 5 with difference in the way the query terms (TST in our case) are expanded. We expand query terms in 2 iterations, where for each iteration the set is expanded in direction of all related Lexicon instances. In SSRM implementation we follow the one of recommended models for Semantic Term Distance (STD). The STD is defined as number of hops between two nodes of the Lexicon graph and the Semantic Relevance (SR) between two terms is defined as inverse proportion of the Semantic Term Distance:
(6) |
The module contains single Active Rule that performs the TSDM calculation and sorts the Targets having the one with the highest TSDM on top of the list.