Namespace imbNLP.PartOfSpeech.analysis provides several utility classes and extension methods for word (or any other string) similarity assessment. Measures implemented at this point (v 0.1.2.31) are based on string decomposition into overlapping n-grams (bi-grams are the most commonly used). After words are decomposed, the set intersection metrics are computed using one of the following methods:
- Jaccard Index (API Documentation)
- Dice Coefficient (Brew & McKelvie, 1996) (API Documentation)
- Continual Overlap Ratio – COR (API Documentation)
Brew, C., & McKelvie, D. (1996). Word-pair extraction for lexicography. In Proceedings of the 2nd international conference on new methods in language processing (pp. 45–55).
I’ve came up with the COR (although, I’m sure it is already known in the NLP literature, under some other name or in very similar form), in effort to get more semantically sensible results, in context of the Cloud Weaver component. Using Semantic Clouds, constructed from the complete research sample (BEC), in the particular purpose, the last produced the best results. The COR is found to produce the most meaningful pairs, detailed test results are available in “Particular aspects of the system” folder at BEC Mendeley Data repository. It introduces criterion of bi-gram ordinal continuity i.e. only common bi-grams that follow the same order of occurrence are counted. The algorithm takes the first bi-gram of word A, and search for the first matching bi-gram in the word B. Once the match is found, it counts how long (in terms of bi-gram count) is the common bi-gram sequence. On the first mismatch, the counting loop breaks. The ratio is computed by dividing counted common sequence length with number of bi-grams in the longer word. For each pair, the procedure is performed in both directions: A→B and B→A, where greater value of the two, is returned as Continual Overlap Ratio (COR).
Word A | Translation | Word B | Translation | Comment | COR | |
1 | unutrašnje | inner | unutrašnji | inner | Different genre | 0.889 |
2 | privredan | economic | privreda | economy | Adjective and noun | 0.875 |
3 | poslednje | the last | poslednji | the last | Different genres | 0.875 |
4 | kontrola | control | kontrolan | control | Noun and Adjective | 0.875 |
5 | potreba | need | potreban | needed | Noun and Adjective | 0.857 |
Example: Top 5 results of word similarity computed with COR, for a Semantic Cloud (in context of Cloud Weaver component of BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI....)
The initial test of the CW component demonstrated high effectiveness of the algorithm, linking terms that were not stemmed into lemmas by the Lemma Table construction process. Ability to establish semantic relationship between nodes, having no existing links defined in the cloud, seemed to have great potential to improve classification, by allowing greater semantic term expansion.
The n-grams may be also created in non-overlapping manner:
[rashladni] (overlap, N=2) => , ra, as, sh, hl, la, ad, dn, ni
[rashladni] (ordinal, N=2) => , ra, sh, la, dn, i
[konstrukcija] (overlap, N=2) => , ko, on, ns, st, tr, ru, uk, kc, ci, ij, ja
[konstrukcija] (ordinal, N=2) => , ko, ns, tr, uk, ci, jaDepending on nGramsModeEnum specified when getNGrams method is called.
Beside direct call to the extension methods, the measures are available trough utility class wordSimilarityComponent. Using the class instance, we’are able to serialize configuration of the similarity computation. Take a look to the code example from UnitTextWordAnalysis.cs from imbNLP.TestUnit project:
using System.Collections.Generic; using Microsoft.VisualStudio.TestTools.UnitTesting; using imbNLP.PartOfSpeech.analysis; using imbSCI.Core.files.folders; using System.IO; using System.Text; namespace imbNLP.TestUnit { [TestClass] public class UnitTestWordAnalysis { [TestMethod] public void TestNGramsAndSimilarity() { folderNode folder = new folderNode(); folder = folder.Add("NLP\\WordAnalysis", "Word analysis", "Folder with results of word analysis tests"); String[] words = new String[] { "ormar", "orman", "rashladni", "konstrukcija", "elektroinstalacija", "elektromotor", "motorno", "građevina", "građevinski", "metalni", "metalno", "metal", "aluminijum", "aluminijumski", "zgrada", "kotao", "kotlovski", "kotlarnica", "peć", "dimnjak", "cevovodi", "vod", "linija", "stanica", "elektrana", "elektrogradnja", "izgradnja", "gradjevinsko", "grejanje", "grejno", "gorivo", "goriva", "pelet", "panel", "polica", "stolica", "bakarni", "bronzani", "centrala", "obezbeđenje", "klimatizacija", "klimatizacioni", "ventilacija", "ventilacioni", "gorionik", "vatra", "voda", "cev", "proizvod", "proizvodni", "laser", "proizvodnja", "lasersko", "sečenje", "plazma", "merdevine", "čunak", "štednjak", "radijator", "elektro", "induktivno", "transformator", "transformatorska", "dalekovod", "elektrovod", "mašina", "šinski", "voz", "nadzemno", "visokogradnja", "podzemno", "transport", "prevoz", "izolacija","plastika", "guma", "štender", "vitrina", "zamrzivač", "protivpožarna", "zaštita", "prodajna", "kontaktirajte", "kontakt", "kontakti", "telefon", "svetlo", "rasveta", "javna", "kompanija", "firma", "preduzeće", "društvo", "izvoz", "sto", "radni", "snaga", "napon", "krovni", "krov", "konstrukcioni", "konstruisanje", "tehničko", "tehnika", "zaposleni", "radnici", "reference", "kupci", "prodajni", "prodaja", "razvojni", "razvoj", "industrijski", "snabdevanje", "kućni", "nameštaj", "kancelarijski", "prostor", "podno", "pekara", "hleb", "pica", "peći", "pećnica", "žardinjera", "ograda", "čelična", "čelik", "galanterija", "stepenice", "nadvožnjak", "pešački", "saobraćajni", "znak", "tabla", "bilbord", "reklamni", "redni", "fluid", "hlađenje", "zagrevanje", "sagorevanje", "čvrsto", "pirolitički", "parni", "dim", "pepeo", "dopremanje", "čišćenje", "održavanje", "inoks", "inoksni", "inoksa", "razmenjivač", "toplote"}; StringBuilder sb = new StringBuilder(); foreach (String word in words) { sb.AppendLine(wordAnalysisTools.getNGramsDescriptiveLine(word, 2, nGramsModeEnum.overlap)); sb.AppendLine(wordAnalysisTools.getNGramsDescriptiveLine(word, 2, nGramsModeEnum.ordinal)); } String sbp = folder.pathFor("ngrams.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "ngrams"); File.WriteAllText(sbp, sb.ToString()); wordSimilarityComponent component = new wordSimilarityComponent(); component.N = 2; component.gramConstruction = nGramsModeEnum.overlap; component.treshold = 0.6; component.equation = nGramsSimilarityEquationEnum.DiceCoefficient; var result01 = component.GetResult(words); String p = folder.pathFor("result01.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "TestNGrams", false); File.WriteAllText(p, result01.ToString()); component.equation = nGramsSimilarityEquationEnum.JaccardIndex; var result02 = component.GetResult(words); p = folder.pathFor("result02.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "TestNGrams", false); File.WriteAllText(p, result02.ToString()); component.equation = nGramsSimilarityEquationEnum.continualOverlapRatio; var result03 = component.GetResult(words); p = folder.pathFor("result03.txt", imbSCI.Data.enums.getWritableFileMode.autoRenameThis, "TestNGrams", false); File.WriteAllText(p, result03.ToString()); } } }
Check results of this test unit (four runs, each with some additional words included):
wordSimilarityTest - TestUnit results of imbNLP word similarity measuresBelow are raw copies of text file reports, for a category (“constructions”) Semantic Cloud of the BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... research:
Dice Coefficient
# Semantic Cloud Weaver report for [constructions] > [205] cloud nodes > [0.62927] initial link-per-node ratio > [0.84878] link-per-node ratio after WordSimilarity used -------------------------------------------------------- # N-Gram size: 2 # Split mode: overlap # Equation: DiceCoefficient # Treshold: 0.600 ----------------------------
00001 : sledeće : sledeći 0.83333 00002 : priprema : oprema 0.83333 00003 : proces : procesni 0.83333 00004 : postepen : stepen 0.83333 00005 : proizvodnja : proizvod 0.82353 00006 : ograda : zgrada 0.80000 00007 : industrijski : industrija 0.80000 00008 : zavarivanje : zavarivač 0.77778 00009 : montaža : montažni 0.76923 00010 : zaštita : zaštitni 0.76923 00011 : stopa : topao 0.75000 00012 : namena : cena 0.75000 00013 : prav : pravni 0.75000 00014 : forma : norma 0.75000 00015 : namena : primena 0.72727 00016 : nagrada : zgrada 0.72727 00017 : nagrada : ograda 0.72727 00018 : putni : putnički 0.72727 00019 : paletni : paleta 0.72727 00020 : generacija : operacija 0.70588 00021 : kontinualan : kontinuiran 0.70000 00022 : promena : primena 0.66667 00023 : stran : trajan 0.66667 00024 : platforma : forma 0.66667 00025 : radni : rashladni 0.66667 00026 : nacionalan : profesionalan 0.66667 00027 : priprema : privremen 0.66667 00028 : namena : elemenat 0.66667 00029 : baza : faza 0.66667 00030 : privremen : savremen 0.66667 00031 : rad : radni 0.66667 00032 : sajam : sam 0.66667 00033 : klasičan : plastičan 0.66667 00034 : skladište : sklad 0.66667 00035 : antikorozivan : dekorativan 0.63636 00036 : pokrivanje : zavarivanje 0.63158 00037 : metalurgija : metalurški 0.63158 00038 : bravarski : bravarija 0.62500 00039 : privredan : privremen 0.62500 00040 : pokrivanje : pokrivni 0.62500 00041 : automatski : autorski 0.62500 00042 : objekat : projekat 0.61538 00043 : konačan : značajan 0.61538 00044 : detalj : paleta 0.60000 00045 : grana : nagrada 0.60000
Jaccard Index
# Semantic Cloud Weaver report for [constructions] > [205] cloud nodes > [0.62927] initial link-per-node ratio > [0.69268] link-per-node ratio after WordSimilarity used -------------------------------------------------------- # N-Gram size: 2 # Split mode: overlap # Equation: JaccardIndex # Treshold: 0.600 ---------------------------- 00001 : proces : procesni 0.71429 00002 : postepen : stepen 0.71429 00003 : sledeće : sledeći 0.71429 00004 : proizvodnja : proizvod 0.70000 00005 : industrijski : industrija 0.66667 00006 : ograda : zgrada 0.66667 00007 : zavarivanje : zavarivač 0.63636 00008 : zaštita : zaštitni 0.62500 00009 : montaža : montažni 0.62500 00010 : priprema : oprema 0.62500 00011 : prav : pravni 0.60000 00012 : stopa : topao 0.60000 00013 : forma : norma 0.60000
COR
# Semantic Cloud Weaver report for [constructions] > [205] cloud nodes > [0.62927] initial link-per-node ratio > [0.69268] link-per-node ratio after WordSimilarity used -------------------------------------------------------- # N-Gram size: 2 # Split mode: overlap # Equation: continualOverlapRatio # Treshold: 0.600 ---------------------------- 00001 : sledeće : sledeći 0.83333 00002 : industrijski : industrija 0.72727 00003 : zaštita : zaštitni 0.71429 00004 : proces : procesni 0.71429 00005 : montaža : montažni 0.71429 00006 : proizvodnja : proizvod 0.70000 00007 : zavarivanje : zavarivač 0.70000 00008 : paletni : paleta 0.66667 00009 : bravarski : bravarija 0.62500 00010 : privredan : privremen 0.62500 00011 : prav : pravni 0.60000 00012 : metalurgija : metalurški 0.60000 00013 : kontinualan : kontinuiran 0.60000