Word (String) similarity measures

Namespace imbNLP.PartOfSpeech.analysis provides several utility classes and extension methods for word (or any other string) similarity assessment. Measures implemented at this point (v 0.1.2.31) are based on string decomposition into overlapping n-grams (bi-grams are the most commonly used). After words are decomposed, the set intersection metrics are computed using one of the following methods:

Jaccard Index (API Documentation)
Dice Coefficient (Brew & McKelvie, 1996) (API Documentation)
Continual Overlap Ratio – COR (API Documentation)

Brew, C., & McKelvie, D. (1996). Word-pair extraction for lexicography. In Proceedings of the 2nd international conference on new methods in language processing (pp. 45–55).

I’ve came up with the COR (although, I’m sure it is already known in the NLP literature, under some other name or in very similar form), in effort to get more semantically sensible results, in context of the Cloud Weaver component. Using Semantic Clouds, constructed from the complete research sample (BEC), in the particular purpose, the last produced the best results. The COR is found to produce the most meaningful pairs, detailed test results are available in “Particular aspects of the system” folder at BEC Mendeley Data repository. It introduces criterion of bi-gram ordinal continuity i.e. only common bi-grams that follow the same order of occurrence are counted. The algorithm takes the first bi-gram of word A, and search for the first matching bi-gram in the word B. Once the match is found, it counts how long (in terms of bi-gram count) is the common bi-gram sequence. On the first mismatch, the counting loop breaks. The ratio is computed by dividing counted common sequence length with number of bi-grams in the longer word. For each pair, the procedure is performed in both directions: A→B and B→A, where greater value of the two, is returned as Continual Overlap Ratio (COR).

	Word A	Translation	Word B	Translation	Comment	COR
1	unutrašnje	inner	unutrašnji	inner	Different genre	0.889
2	privredan	economic	privreda	economy	Adjective and noun	0.875
3	poslednje	the last	poslednji	the last	Different genres	0.875
4	kontrola	control	kontrolan	control	Noun and Adjective	0.875
5	potreba	need	potreban	needed	Noun and Adjective	0.857

Example: Top 5 results of word similarity computed with COR, for a Semantic Cloud (in context of Cloud Weaver component of BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI....)

The initial test of the CW component demonstrated high effectiveness of the algorithm, linking terms that were not stemmed into lemmas by the Lemma Table construction process. Ability to establish semantic relationship between nodes, having no existing links defined in the cloud, seemed to have great potential to improve classification, by allowing greater semantic term expansion.

The n-grams may be also created in non-overlapping manner:
[rashladni] (overlap, N=2) => , ra, as, sh, hl, la, ad, dn, ni
[rashladni] (ordinal, N=2) => , ra, sh, la, dn, i
[konstrukcija] (overlap, N=2) => , ko, on, ns, st, tr, ru, uk, kc, ci, ij, ja
[konstrukcija] (ordinal, N=2) => , ko, ns, tr, uk, ci, ja

Depending on nGramsModeEnum specified when getNGrams method is called.

Beside direct call to the extension methods, the measures are available trough utility class wordSimilarityComponent. Using the class instance, we’are able to serialize configuration of the similarity computation. Take a look to the code example from UnitTextWordAnalysis.cs from imbNLP.TestUnit project:

using System.Collections.Generic;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using imbNLP.PartOfSpeech.analysis;
using imbSCI.Core.files.folders;
using System.IO;
using System.Text;

namespace imbNLP.TestUnit
{
    [TestClass]
    public class UnitTestWordAnalysis
    {
        [TestMethod]
        public void TestNGramsAndSimilarity()
        {
            folderNode folder = new folderNode();
            folder = folder.Add("NLP\\WordAnalysis", "Word analysis", "Folder with results of word analysis tests");

            String[] words = new String[] { "ormar", "orman", "rashladni", "konstrukcija", "elektroinstalacija", "elektromotor", "motorno", "građevina", "građevinski", "metalni", "metalno", "metal", "aluminijum", "aluminijumski", "zgrada", "kotao", "kotlovski", "kotlarnica", "peć", "dimnjak", "cevovodi", "vod", "linija", "stanica",
            "elektrana", "elektrogradnja", "izgradnja", "gradjevinsko", "grejanje", "grejno", "gorivo", "goriva", "pelet", "panel", "polica", "stolica", "bakarni", "bronzani",
            "centrala", "obezbeđenje", "klimatizacija", "klimatizacioni", "ventilacija", "ventilacioni", "gorionik", "vatra", "voda", "cev", "proizvod", "proizvodni", "laser", "proizvodnja", "lasersko", "sečenje", "plazma", "merdevine", "čunak", "štednjak", "radijator", "elektro", "induktivno", "transformator", "transformatorska", "dalekovod", "elektrovod", "mašina", "šinski", "voz", "nadzemno", "visokogradnja", "podzemno", "transport", "prevoz", "izolacija","plastika", "guma", "štender",
            "vitrina", "zamrzivač", "protivpožarna", "zaštita", "prodajna", "kontaktirajte", "kontakt", "kontakti", "telefon", "svetlo", "rasveta", "javna", "kompanija", "firma", "preduzeće", "društvo", "izvoz", "sto", "radni", "snaga", "napon", "krovni", "krov", "konstrukcioni", "konstruisanje", "tehničko", "tehnika", "zaposleni", "radnici", "reference", "kupci", "prodajni", "prodaja", "razvojni", "razvoj", "industrijski", "snabdevanje", "kućni", "nameštaj", "kancelarijski", "prostor", "podno", "pekara", "hleb", "pica", "peći", "pećnica", "žardinjera", "ograda", "čelična", "čelik", "galanterija", "stepenice", "nadvožnjak", "pešački", "saobraćajni", "znak", "tabla", "bilbord", "reklamni", "redni", "fluid", "hlađenje", "zagrevanje", "sagorevanje", "čvrsto", "pirolitički", "parni", "dim", "pepeo", "dopremanje", "čišćenje", "održavanje", "inoks", "inoksni", "inoksa", "razmenjivač", "toplote"};

            StringBuilder sb = new StringBuilder();

            foreach (String word in words)
            {
                sb.AppendLine(wordAnalysisTools.getNGramsDescriptiveLine(word, 2, nGramsModeEnum.overlap));
                sb.AppendLine(wordAnalysisTools.getNGramsDescriptiveLine(word, 2, nGramsModeEnum.ordinal));
            }

            String sbp = folder.pathFor("ngrams.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "ngrams");
            File.WriteAllText(sbp, sb.ToString());

            wordSimilarityComponent component = new wordSimilarityComponent();
            component.N = 2;
            component.gramConstruction = nGramsModeEnum.overlap;
            component.treshold = 0.6;
            component.equation = nGramsSimilarityEquationEnum.DiceCoefficient;

            var result01 = component.GetResult(words);

            String p = folder.pathFor("result01.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "TestNGrams", false);
            File.WriteAllText(p, result01.ToString());


            component.equation = nGramsSimilarityEquationEnum.JaccardIndex;

            var result02 = component.GetResult(words);

            p = folder.pathFor("result02.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "TestNGrams", false);
            File.WriteAllText(p, result02.ToString());


            component.equation = nGramsSimilarityEquationEnum.continualOverlapRatio;

            var result03 = component.GetResult(words);

            p = folder.pathFor("result03.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "TestNGrams", false);
            File.WriteAllText(p, result03.ToString());
        }
    }
}

Check results of this test unit (four runs, each with some additional words included):

Below are raw copies of text file reports, for a category (“constructions”) Semantic Cloud of the BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... research:

Dice Coefficient

# Semantic Cloud Weaver report for [constructions]
 > [205] cloud nodes
 > [0.62927] initial link-per-node ratio
 > [0.84878] link-per-node ratio after WordSimilarity used
 -------------------------------------------------------- 
# N-Gram size: 2
# Split mode: overlap
# Equation: DiceCoefficient
# Treshold: 0.600
----------------------------

00001 : sledeće : sledeći 0.83333
00002 : priprema : oprema 0.83333
00003 : proces : procesni 0.83333
00004 : postepen : stepen 0.83333
00005 : proizvodnja : proizvod 0.82353
00006 : ograda : zgrada 0.80000
00007 : industrijski : industrija 0.80000
00008 : zavarivanje : zavarivač 0.77778
00009 : montaža : montažni 0.76923
00010 : zaštita : zaštitni 0.76923
00011 : stopa : topao 0.75000
00012 : namena : cena 0.75000
00013 : prav : pravni 0.75000
00014 : forma : norma 0.75000
00015 : namena : primena 0.72727
00016 : nagrada : zgrada 0.72727
00017 : nagrada : ograda 0.72727
00018 : putni : putnički 0.72727
00019 : paletni : paleta 0.72727
00020 : generacija : operacija 0.70588
00021 : kontinualan : kontinuiran 0.70000
00022 : promena : primena 0.66667
00023 : stran : trajan 0.66667
00024 : platforma : forma 0.66667
00025 : radni : rashladni 0.66667
00026 : nacionalan : profesionalan 0.66667
00027 : priprema : privremen 0.66667
00028 : namena : elemenat 0.66667
00029 : baza : faza 0.66667
00030 : privremen : savremen 0.66667
00031 : rad : radni 0.66667
00032 : sajam : sam 0.66667
00033 : klasičan : plastičan 0.66667
00034 : skladište : sklad 0.66667
00035 : antikorozivan : dekorativan 0.63636
00036 : pokrivanje : zavarivanje 0.63158
00037 : metalurgija : metalurški 0.63158
00038 : bravarski : bravarija 0.62500
00039 : privredan : privremen 0.62500
00040 : pokrivanje : pokrivni 0.62500
00041 : automatski : autorski 0.62500
00042 : objekat : projekat 0.61538
00043 : konačan : značajan 0.61538
00044 : detalj : paleta 0.60000
00045 : grana : nagrada 0.60000

Jaccard Index

# Semantic Cloud Weaver report for [constructions]
 > [205] cloud nodes
 > [0.62927] initial link-per-node ratio
 > [0.69268] link-per-node ratio after WordSimilarity used
 -------------------------------------------------------- 
# N-Gram size: 2
# Split mode: overlap
# Equation: JaccardIndex
# Treshold: 0.600
----------------------------

00001 : proces : procesni 0.71429
00002 : postepen : stepen 0.71429
00003 : sledeće : sledeći 0.71429
00004 : proizvodnja : proizvod 0.70000
00005 : industrijski : industrija 0.66667
00006 : ograda : zgrada 0.66667
00007 : zavarivanje : zavarivač 0.63636
00008 : zaštita : zaštitni 0.62500
00009 : montaža : montažni 0.62500
00010 : priprema : oprema 0.62500
00011 : prav : pravni 0.60000
00012 : stopa : topao 0.60000
00013 : forma : norma 0.60000

COR

# Semantic Cloud Weaver report for [constructions]
 > [205] cloud nodes
 > [0.62927] initial link-per-node ratio
 > [0.69268] link-per-node ratio after WordSimilarity used
 -------------------------------------------------------- 
# N-Gram size: 2
# Split mode: overlap
# Equation: continualOverlapRatio
# Treshold: 0.600
----------------------------

00001 : sledeće : sledeći 0.83333
00002 : industrijski : industrija 0.72727
00003 : zaštita : zaštitni 0.71429
00004 : proces : procesni 0.71429
00005 : montaža : montažni 0.71429
00006 : proizvodnja : proizvod 0.70000
00007 : zavarivanje : zavarivač 0.70000
00008 : paletni : paleta 0.66667
00009 : bravarski : bravarija 0.62500
00010 : privredan : privremen 0.62500
00011 : prav : pravni 0.60000
00012 : metalurgija : metalurški 0.60000
00013 : kontinualan : kontinuiran 0.60000

Spread the love

imbVeles

Web Exploration, Load and Extraction Subsystem

Word (String) similarity measures

Dice Coefficient

Jaccard Index

COR