Category: Literature

Document similarity computation models – literature review

General remarks CorpusIn linguistics, a corpus (plural corporaIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory…. (collection of documents) representation is u x v matrix (2D collection). xik – frequency (number of occurrence) of…

Web Crawlers – Literature review

The greatest algorithmic challenges of the web crawling are: loaded page and discovered links relevance estimation. Usually, the both are playing a crucial role in the frontier scheduling. The earliest relevant works on page importance ranking are: • the PageRank [1] which defines web page relevance as function of link-reference page relationship where sum of…

Term Weighting methods – literature overview

Term Frequency (TF) Where: frtd = raw frequency of term t in document d Document Frequency – DF Global term weighting method, where terms occurring in more documents are considered as more relevant. DFt=∑d 1n 1t ∈d0t i / nd Term Frequency – Inverse Document Frequency – TF-IDF  The most common weight computation schema applied with Vector Space Model is the…