Document similarity computation models – literature review

The post is collection of short-notes on text document similarity computation models, in context of text categorization / web page categorization. Its purpose is to serve as continuously growing list of relevant models, developed in the field.

General remarks

CorpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... (collection of documents) representation is u x v matrix (2D collection).

  • xik – frequency (number of occurrence) of term (word, n-gram, feature) i in document k.
  • u – number of terms (words, n-grams, features)
  • v – number of documents

By features, we refer to: words, lemmas, terms, n-grams…


Binary (Boolean) Similarity measures

Measures asserting only presence or absence1 of features in the documents.

tik is boolean value that describes if feature k exists in document i

  • tik = 1 (true), if (xik>0)
  • tik = 0 (false), if (xik==0)

Base metrics from which the measures are computed for similarity between two documents i and j

  • Count of common features:

a_{ij} = \sum_k {t_{ik} \cdot t_{jk}}

  • Counts (bij, cij) of distinctive features of documents j and i (compared to each other, respectively):

b_{ij} = \sum_k {t_{ik} (1 - t_{jk})}

c_{ij} = \sum_k { (1-t_{ik}) \cdot t_{jk}}

  • Count of features contained within neither document:

d_{ij} = \sum_k { (1-t_{ik}) \cdot (1 - t_{jk})}


Common Features Model

Special case of the Contrast Model 2

s^{com}_{ij} = {a_{ij} \over {a_{ij} + b_{ij} + c_{ij} + d_{ij}}}

Ratio Model

s^{rat}_{ij} = {a_{ij} \over {a_{ij} + b_{ij} + c_{ij}

Distinctive Features

Special case of the Contrast Model 3

s^{dis}_{ij} = {a_{ij} + b_{ij} \over {a_{ij} + b_{ij} + c_{ij} + d_{ij}}


Count (Frequency) Similarity Models

Correlation model

S^{cor}_{ij} = {{\sum_k {x_{ik} \cdot x_{jk} }} \over {\sum_y {c_{ik} + \sum_k {x_{ik} }}}

Jaccard model

S^{jac}_{ij} = {{\sum_k {x_{ik} \cdot x_{jk} }} \over {\sum_k {x_{ik} + \sum_k {x_{ik} - {\sum_k {x_{ik} \cdot x_{jk} } }}}

Cosine model

S^{cos}_{ij} = {{\sum_k {x_{ik} \cdot x_{jk} }} \over \sqrt {\sum_k {x^2_{ik} \cdot \sum_k {x^2_{ik} }}}

Cosine similarity is also called: angular separation similarity

Overlap model

S^{ove}_{ij} = {{\sum_k {x_{ik} \cdot x_{jk} }} \over min \left ( {\sum_k {x^2_{ik} \cdot \sum_k {x^2_{ik} }}} \right )

 


Vector Space Model – VSM and Latent Semantic AnalysisThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (Indexing)
Vector Space Model - VSM
The Vector Space Model (VSM) 4 is a statistical representation of documents in collection (corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....) and corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... it self. It defines documents as vectors of identifiers in n-dimensional space, where each dimension corresponds to a distinct term and its “weight”. Term weight is numeric value, describing significance of the term in context of the document or the document collection.

The TF-IDF is a term-weighting model, used commonly with VSM and the Cosine similarity model.

So, it has three components:

  • TF – a local weighting function – significance of the feature within a document
  • IDF – a global weighting function – significance of the feature across the entire collection (of documents, corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....)
  • Cosine model – a document similarity computation method
imbNLP Implementation
TF-IDF is implemented, in several variations, trough imbNLP module. Check pages listed below and API documentation, for more information

TF-IDF, SSRM and cSSRM document similarity

The page covers theoretical and practical aspects of document content similarity computation, using imbNLP Framework. Let us first do theoretical overview on issue of document topical, or, semantic similarity. The Vector Space Model (VSM) [1]⁠ defines documents and queries as vectors of identifiers in n-dimensional space, where each algebraic dimension corresponds to a distinct term,…

Spread the love
0 comments

Weighted Terms and Semantic Clouds

Part Of Speech library of imbNLP module, contains several utility and data model classes supporting operations with weighted terms (or lemmas) and interconnected terms (Semantic Cloud). Lemma Term and Lemma Table Developed from concept of TF-IDF, the Web Lemma Term and Web Lemma Table classes provide support for document semantic similarity computation. These classes, together…

0 comments

Semantic Similarity Retrieval Model (SSRM)

One of the general issues with the TF-IDF schema comes from term commonality requirement i.e. need for non empty set of mutual terms, between the document and query, in order to compute any similarity ratio other then zero. The issue is particularly relevant in context of languages with complex morphology, as the stemming process, supported with suitable morphosyntactic resources, becomes critical factor for successful implementation of the schema. The stemming identifies lemma form of each word, extracted from the text and contracts the terms into set of lemma vectors, where frequencies of inflected forms are summed and attributed to the common lemma form. However, it is still valid argument that two lemma sets, without any overlap, may be semantically related. One of the most popular approaches, addressing this issue, is the Semantic Similarity Retrieval Model (SSRM) 1 It is a semantically enhanced alternative to the VSM providing better approximation of human topical perception. In this model, after the query terms are expanded with synonyms, derived from a lexical ontology, their initial weights are calculated in typical TF-IDF fashion.

In the first step of SSRM, the query term weight qi ,of each query term i is adjusted according to its relationships with other semantically similar terms j within the same query vector:

q_i = q_i + \sum_{sim(i,j)\geq t}^{j\neq i} q_j \cdot sim(i,j) }

where t is predefined threshold. Multiple related terms, in the same query, reinforce each other. The weights of non-similar terms remain unchanged. For short queries, specifying only a few terms, the weights are initialized to 1 and adjusted according to the formula above. In the second step, the query term is expanded up to three levels of ontological hierarchy, in both hyponym and hypernym direction. Weight for, newly added, terms is computed as:

q`_i = \sum_{sim(i,j)\geq t}^{j\neq i} {1 \over n } q_j \cdot sim(i,j) }

where n is the number of hyponyms of each expanded term j. In case when term retrieved trough expansion was already in the query set, the result is summed with the existing weight.

Semantic similarity between two terms sim(i,j), in a semantic ontology, can be computed with several methods. These methods may be divided into three principal categories 2 : Structure-based, Information Content and Feature-Based measures. The structure-based measures, recommended by the author of SSRM, use a function of number of edges between the two nodes. The similarity SimSSRM(q,d) between query and document is computed as:

Sim_{SSRM(q,d)} = { { \sum_i \sum_j q_i d_j sim_{(i,j)} } \over { \sum_i \sum_j q_i d_j } }

where: qi is weight of term i in the query, dj is weight of term in document d


Latent semantic analysisThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More)

Latent semantic analysis (LSA) 3 is another technique for vector representation of a document. It compresses high-dimensional VSM, using singular value decomposition (SVD) matrix operation. The LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More has four components:

  • a local weighting function – significance of the feature within a document
  • a global weighting function – significance of the feature across the entire collection (of documents, corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....)
  • number of retained (compressed) dimensions (i.e. features) during singular value decomposition (SVD)
  • a document similarity computation method
Singular value decomposition
SVD 4 is a matrix transformation operation that reduces number of dimensions onto given dimensionality (e.g. 10, 50, 150, 300 dimensions) by keeping matrix dimensions with highest values, and removing low value (significance, weight) features (noise reduction)

Steps in the LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More procedure:

  • Initial matrix is built from weighted corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory...., combining local and global weighting functions.
  • The SVD is then applied to the initial matrix, producing n x d reduced (or compressed) version.
  • In the final step, a similarity model is applied – usually the Cosine model:

S^{cos}_{ij} = {{\sum_k {n_{ik} \cdot n_{jk} }} \over \sqrt {\sum_k {n^2_{ik} \cdot \sum_k {n^2_{ik} }}}

 

Without SVD, the LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (or LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More) breaks-down to weighted vector space model.


Latent Dirichlet AllocationLDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More (LDALDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More)

LDALDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDALDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous… 

  1. [1] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.
  2.  T. Slimani, Description and Evaluation of Semantic Similarity Measures Approaches, Int. J. Comput. Appl. 80 (2013) 25–33. doi:10.5120/13897-1851.
  3.  P.W. Foltz, Latent semantic analysis for text-based research, Behav. Res. Methods, Instruments, Comput. 28 (1996) 197–202. doi:10.3758/BF03204765.
  4. S.V. Decomposition, S. Vectors, B. Rank, P. Method, P.C. Analysis, S. Gaussians, A. Application, D.O. Problem, C. Algorithm, S. Decomposition, B. Notes, Singular Value Decomposition ( SVD ), (n.d.) 1–36. doi:10.1109/TCOM.1976.1093309.
Spread the love