imbVeles – Page 5 – Web Exploration, Load and Extraction Subsystem

Text document representation models – literature review

By Goran GrubicLiteraturedocument model, document representation

Three (currently) most popular document representation methods are : Vector Space Model – VSM TF-IDF & Cosine similarity Latent Semantic IndexingThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More method overcomes the limitation of VSM. LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More is an approach that use particular matrix transformation technique called Singular Value…

Read More+

imbSCI 0.2.0 .NET Standard

By Goran GrubicimbSCI.NET Standard 2.0, imbSCI, NuGet

imbSCI Foundation libraries are separated into four NuGet packages and migrated to .NET Standard 2.0, with two additional targets (.NET 4.0 and .NET 4.5). Therefore, new NuGet packages (starting with version number 0.2.*, and having .Standard name suffix) are fully cross-platform, from Windows XP to mobile devices. New NuGet packages include three target platforms: + .NET…

Read More+

Document similarity computation models – literature review

By Goran GrubicLiteratureCommon Features, Correlation, Cosine, Distinctive, Jaccard, LSA, LSI, Overlap, Ratio, VSM, weighted vector space model

General remarks CorpusIn linguistics, a corpus (plural corporaIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory…. (collection of documents) representation is u x v matrix (2D collection). xik – frequency (number of occurrence) of…

Read More+

The Diversity Module

By Goran GrubicimbWEMcrawler, Diversity, semantic, SSRM

The Diversity Module inherits the Frontier Ranking Module base class and commences target sorting according to the estimated semantic difference (as complementary value of semantic similarity) between the Target and already crawled content. The crawled content is represented by two collections: the Target Tokens Repository (TTR), which is domain level term frequency table aggregating TSTs…

Read More+

The Template Module

By Goran GrubicimbWEMblock, content, crawler, FRA, navigation, page decomposition, template

The heart of this module is procedure of page decomposition and detection of semantic role for each of extracted content blocks. This is the only module in the stack that evaluates links using strictly information immutable across the DLC process iterations. Furthermore, the alternative ranking implementation assumes that higher position in the navigation menu hierarchy…

Read More+

The Structure Module

By Goran GrubicimbWEMcrawler, FRA, navigation, Structure module

It is designed with premise that URL segments from left to right may be interpreted as parent-child hierarchy structure in order to construct factors of FRA influence based on the following assumptions, given in order of dominance: link-path node closer to the root is more relevant than the one positioned deeper in the graph link-path…

Read More+