Category: Literature

SM05: Enhanced feature selection for website classification

Goal of this study is to propose a heuristic upgrade to existing feature selection (FS) functions, that would improve multi-class single-label classification by exploiting information already available in the training dataset. The proposed FIP (Flat Inverse Particularity) function, is based on assumption that features with small website-level frequency and high page-level frequency within single website,…

Text document representation models – literature review

Three (currently) most popular document representation methods are : Vector Space Model – VSM TF-IDF & Cosine similarity Latent Semantic IndexingThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More method overcomes the limitation of VSM. LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More is an approach that use particular matrix transformation technique called Singular Value…

Document similarity computation models – literature review

General remarks CorpusIn linguistics, a corpus (plural corporaIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory…. (collection of documents) representation is u x v matrix (2D collection). xik – frequency (number of occurrence) of…

Web Crawlers – Literature review

The greatest algorithmic challenges of the web crawling are: loaded page and discovered links relevance estimation. Usually, the both are playing a crucial role in the frontier scheduling. The earliest relevant works on page importance ranking are: • the PageRank [1] which defines web page relevance as function of link-reference page relationship where sum of…

Term Weighting methods – literature overview

Term Frequency (TF) Where: frtd = raw frequency of term t in document d Document Frequency – DF Global term weighting method, where terms occurring in more documents are considered as more relevant. DFt=∑d 1n 1t ∈d0t i / nd Term Frequency – Inverse Document Frequency – TF-IDF  The most common weight computation schema applied with Vector Space Model is the…