Author: Goran Grubic

Text document representation models – literature review

Three (currently) most popular document representation methods are : Vector Space Model – VSM TF-IDF & Cosine similarity Latent Semantic Indexing (LSI), also called: Latent Semantic Analysis (LSA) Semantic Similarity Retrieval Model (SSRM) Probabilistic topic model Probabilistic Latent Semantic Indexing (PLSI) Latent Dirichlet Allocation Statistical language model n-gram language models   Bag of words Set…

imbSCI 0.2.0 .NET Standard

imbSCI Foundation libraries are separated into four NuGet packages and migrated to .NET Standard 2.0, with two additional targets (.NET 4.0 and .NET 4.5). Therefore, new NuGet packages (starting with version  number 0.2.*, and having .Standard name suffix) are fully cross-platform, from Windows XP to mobile devices. New NuGet packages include three target platforms: + .NET…

Document similarity computation models – literature review

General remarks Corpus (collection of documents) representation is u x v matrix (2D collection). xik – frequency (number of occurrence) of term (word, n-gram, feature) i in document k. u – number of terms (words, n-grams, features) v – number of documents By features, we refer to: words, lemmas, terms, n-grams… Measures asserting only presence or absence…

The Diversity Module

The Diversity Module inherits the Frontier Ranking Module base class and commences target sorting according to the estimated semantic difference (as complementary value of semantic similarity) between the Target and already crawled content. The crawled content is represented by two collections: the Target Tokens Repository (TTR), which is domain level term frequency table aggregating TSTs…

The Template Module

The heart of this module is procedure of page decomposition and detection of semantic role for each of extracted content blocks. This is the only module in the stack that evaluates links using strictly information immutable across the DLC process iterations. Furthermore, the alternative ranking implementation assumes that higher position in the navigation menu hierarchy…