Document similarity computation models – literature review

General remarks CorpusIn linguistics, a corpus (plural corporaIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory…. (collection of documents) representation is u x v matrix (2D collection). xik – frequency (number of occurrence) of…

The Diversity Module

The Diversity Module inherits the Frontier Ranking Module base class and commences target sorting according to the estimated semantic difference (as complementary value of semantic similarity) between the Target and already crawled content. The crawled content is represented by two collections: the Target Tokens Repository (TTR), which is domain level term frequency table aggregating TSTs…

The Template Module

The heart of this module is procedure of page decomposition and detection of semantic role for each of extracted content blocks. This is the only module in the stack that evaluates links using strictly information immutable across the DLC process iterations. Furthermore, the alternative ranking implementation assumes that higher position in the navigation menu hierarchy…

Web Crawlers – Literature review

The greatest algorithmic challenges of the web crawling are: loaded page and discovered links relevance estimation. Usually, the both are playing a crucial role in the frontier scheduling. The earliest relevant works on page importance ranking are: • the PageRank [1] which defines web page relevance as function of link-reference page relationship where sum of…

Term Weighting methods – literature overview

Term Frequency (TF) Where: frtd = raw frequency of term t in document d Document Frequency – DF Global term weighting method, where terms occurring in more documents are considered as more relevant. DFt=∑d 1n 1t ∈d0t i / nd Term Frequency – Inverse Document Frequency – TF-IDF  The most common weight computation schema applied with Vector Space Model is the…