## General remarks

CorpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... (collection of documents) representation is *u* *x v* matrix (2D collection).

*x*– frequency (number of occurrence) of term (word, n-gram, feature)_{ik}*i*in document*k*.- u – number of terms (words, n-grams, features)
- v – number of documents

By **features**, we refer to: words, lemmas, terms, n-grams…

### Binary (Boolean) Similarity measures

Measures asserting only presence or absence

^{1}of features in the documents.

*t _{ik}* is boolean value that describes if feature

*k*exists in document

*i*

*t*= 1 (true), if (x_{ik}_{ik}>0)*t*= 0 (false), if (x_{ik}_{ik==}0)

Base metrics from which the measures are computed for similarity between two documents* i* and *j*

- Count of common features:

- Counts (
*b*) of distinctive features of documents_{ij}, c_{ij}*j*and*i*(compared to each other, respectively):

- Count of features contained within neither document:

#### Common Features Model

Special case of the Contrast Model

^{2}

#### Ratio Model

#### Distinctive Features

Special case of the Contrast Model

^{3}

#### Correlation model

#### Jaccard model

#### Cosine model

Cosine similarity is also called: angular separation similarity

#### Overlap model

^{4}is a statistical representation of documents in collection (corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....) and corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... it self. It defines documents as vectors of identifiers in

*n*-dimensional space, where each dimension corresponds to a distinct term and its

*“weight”*. Term weight is numeric value, describing significance of the term in context of the document or the document collection.

The TF-IDF is a term-weighting model, used commonly with VSM and the Cosine similarity model.

So, it has three components:

- TF – a local weighting function – significance of the feature within a document
- IDF – a global weighting function – significance of the feature across the entire collection (of documents, corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....)
- Cosine model – a document similarity computation method

## TF-IDF, SSRM and cSSRM document similarity

The page covers theoretical and practical aspects of document content similarity computation, using imbNLP Framework. Let us first do theoretical overview on issue of document topical, or, semantic similarity. The Vector Space Model (VSM) [1] defines documents and queries as vectors of identifiers in n-dimensional space, where each algebraic dimension corresponds to a distinct term,…

## Weighted Terms and Semantic Clouds

Part Of Speech library of imbNLP module, contains several utility and data model classes supporting operations with weighted terms (or lemmas) and interconnected terms (Semantic Cloud). Lemma Term and Lemma Table Developed from concept of TF-IDF, the Web Lemma Term and Web Lemma Table classes provide support for document semantic similarity computation. These classes, together…

One of the general issues with the TF-IDF schema comes from term commonality requirement i.e. need for non empty set of mutual terms, between the document and query, in order to compute any similarity ratio other then zero. The issue is particularly relevant in context of languages with complex morphology, as the stemming process, supported with suitable morphosyntactic resources, becomes critical factor for successful implementation of the schema. The stemming identifies lemma form of each word, extracted from the text and contracts the terms into set of lemma vectors, where frequencies of inflected forms are summed and attributed to the common lemma form. However, it is still valid argument that two lemma sets, without any overlap, may be semantically related. One of the most popular approaches, addressing this issue, is the Semantic Similarity Retrieval Model (SSRM) ^{1} It is a semantically enhanced alternative to the VSM providing better approximation of human topical perception. In this model, after the query terms are expanded with synonyms, derived from a lexical ontology, their initial weights are calculated in typical TF-IDF fashion.

In the first step of SSRM, the query term weight *q*_{i}* **,*of each query term *i *is adjusted according to its relationships with other semantically similar terms *j *within the same query vector:

where* t *is predefined threshold. Multiple related terms, in the same query, reinforce each other. The weights of non-similar terms remain unchanged. For short queries, specifying only a few terms, the weights are initialized to 1 and adjusted according to the formula above. In the second step, the query term is expanded up to three levels of ontological hierarchy, in both hyponym and hypernym direction. Weight for, newly added, terms is computed as:

where *n *is the number of hyponyms of each expanded term *j*. In case when term retrieved trough expansion was already in the query set, the result is summed with the existing weight.

Semantic similarity between two terms *sim(i,j)*, in a semantic ontology, can be computed with several methods. These methods may be divided into three principal categories ^{2} :* Structure-based*,* Information Content* and *Feature-Based measures*. The structure-based measures, recommended by the author of SSRM, use a function of number of edges between the two nodes. The similarity *Sim*_{SSRM}*(**q**,**d**) *between query and document is computed as:

where: *q*_{i}* *is weight of term *i* in the query, *dj* is weight of term in document *d*

Latent semantic analysis (LSA) ^{3} is another technique for vector representation of a document. It compresses high-dimensional VSM, using singular value decomposition (SVD) matrix operation. The LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More has four components:

- a local weighting function – significance of the feature within a document
- a global weighting function – significance of the feature across the entire collection (of documents, corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....)
- number of retained (compressed) dimensions (i.e. features) during singular value decomposition (SVD)
- a document similarity computation method

^{4}is a matrix transformation operation that reduces number of dimensions onto given dimensionality (e.g. 10, 50, 150, 300 dimensions) by keeping matrix dimensions with highest values, and removing low value (significance, weight) features (noise reduction)

Steps in the LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More procedure:

- Initial matrix is built from weighted corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory...., combining local and global weighting functions.
- The SVD is then applied to the initial matrix, producing
*n x d*reduced (or compressed) version. - In the final step, a similarity model is applied – usually the Cosine model:

Without SVD, the LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (or LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More) breaks-down to weighted vector space model.

LDALDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDALDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous…

- [1] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.
- T. Slimani, Description and Evaluation of Semantic Similarity Measures Approaches, Int. J. Comput. Appl. 80 (2013) 25–33. doi:10.5120/13897-1851.
- P.W. Foltz, Latent semantic analysis for text-based research, Behav. Res. Methods, Instruments, Comput. 28 (1996) 197–202. doi:10.3758/BF03204765.
- S.V. Decomposition, S. Vectors, B. Rank, P. Method, P.C. Analysis, S. Gaussians, A. Application, D.O. Problem, C. Algorithm, S. Decomposition, B. Notes, Singular Value Decomposition ( SVD ), (n.d.) 1–36. doi:10.1109/TCOM.1976.1093309.