Weighted Terms and Semantic Clouds

Part Of Speech library of imbNLP module, contains several utility and data model classes supporting operations with weighted terms (or lemmas) and interconnected terms (Semantic Cloud).

Lemma Term and Lemma Table

Developed from concept of TF-IDF, the Web Lemma Term and Web Lemma Table classes provide support for document semantic similarity computation. These classes, together with other utility classes in the namespace are used for both Lemma Table and Phrase Table in Case and Category representation model of the BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI..... Below is exemplary excerpt from a BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... Category Lemma Table:

LABEL FREQUENCIES FACTORS
Lemma form Abs. frequency Weight Document Set Freq Document Freq Term Frequency IDF factor
  n w n n w r
konstrukcija 282 0.51777 10 75 1.00000 0.95037
čelik 283 1.00000 5 24 0.87832 2.08980
čeličan 245 0.44908 10 73 0.84336 0.97740
kontakt 190 0.31795 10 97 0.84196 0.69315
proizvod 157 0.38172 9 61 0.60559 1.15698
hala 150 0.57497 7 33 0.59580 1.77135
referenca 115 0.37536 5 49 0.50070 1.37604

Download Excel report, generated on a BEC Category Lemma Table: Lemma Table, example from BEC

Below are description of the Web Lemma Term properties, as given in the report.

Name Description Group Letter Unit
Lemma form Lemma form of the entry LABEL    
Abs. frequency Number of lemma forms detected in the set FREQUENCIES aTF n
Weight Final weight applied to the term FREQUENCIES TF-IDF w
Nominal form Alias of the Lemma form LEMMA    
Derived words InflectionPromena teči - primena morfologije forms, assigned to the lemma term INFLECTIONS    
Document Set Freq Number of Document Sets (web sites) containing the lemma form FREQUENCIES DSF n
Document Freq Number of Documents (web pages) containing the lemma form FREQUENCIES DF n
Term Frequency Normalized weight of the lemma form FREQUENCIES TF w
IDF factor The IDF factor applied to normalized lemma form frequency FACTORS IDF r

The imbVeles framework has several alternative TF-IDF and similar frequency counting classes:

Lemma Table constructor

For Lemma Table and Phrase Table construction, two constructor classes are implemented: wlfConstructorTFIDF and chunkConstructorTF. Both are consuming content graph (pipelineTaskSubjectContentToken and derived classes), produced with a Content Processing Pipeline model and Pipeline Machine host class. Set of unique lemmas is a complex data structure, where relationship with each token in the analyzed web pages, is maintained. This feature allows HTML-tag related weighting factors to be applied during lemma frequency counting, producing initial HTML-weighted frequency (Fi) for term i. HTML tags are separated into three groups, each having corresponding weight factors. Content of page title meta tag is in the same group (H) with all heading levels H1-H6. Alternative description text for images, defined trough alt attribute of image tags, is interpreted together with normal anchor text found as node value of link tags L . Text extracted from the rest of visible HTML tags is categorized to the last group T.

Initial tests shown an important weakness of the typical IDF computation, related to modern web design concept: several web sites in the research sample had single-page design, optimized for touch-screen devices like cell phones and tablets. Treating such web sites as single pages would undermine relevance of the terms found there and would be in collision with user perception of such, dynamically presented content. Whatever approach is taken regarding how the system should count such page, typical IDF computation will return zero weight for the complete set of extracted terms. Another drawback, in context of classification problem, comes from fact that words, found on every page in the set, are not necessarily unwanted and semantically insignificant. To overcome these issues, we introduced Document Frequency Correction (DFCDocument Frequency Factor - multiplier that is applied to Number of Documents variable in TF-cIDF term weight model. If DFC = 0, then it is TF model, if DFC=1, then it is typical TF-IDF....) factor (Table 12), initially set to 1.1. The factor reduces influence of IDF by multiplying number of documents in IDF equation.

Semantic Cloud

In default application (as defined in Business Entity Classification system), the Semantic Cloud is a non-directed graph data structure describing non-hierarchical network of semantically related lemma terms. The lemmas are represented as weighted nodes, having single type of relationships with other nodes in the cloud. The relationships are declared trough non-directed link instances, connecting two nodes. Although, on level of the actual class implementation (lemmaSemanticCloud) both nodes and links, do support category (or type) assignment as Int32 value, links do support Double value defined weight, on the same way as nodes. Furthermore, direction of relationship is also immanent property of link. These features are inherited from imbSCI.Graph.FreeGraph namespace base classes.

Zoomed in semantic cloud - example from BEC research

Zoomed in semantic cloud – example from BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... research – showing the biggest semantic cloud.

The class supports overlap query, term expansion, cloud merge, and load&save operations.

Semantic Cloud Constructor

The Semantic Cloud Constructor, implemented as part of the BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... research, constructs the cloud using Lemma Table and Phrase Table. Initial node weight is inherited from the Lemma Table of the Category and it is subject of sequential modification, by Cloud Matrix component. Lemmas that were categorized (in Term Categorization process) as at least Reserve Category, are introduced into preliminary version of the Semantic Cloud. The links between the nodes are induced from Phrase Table, following simple co-occurrence rule: all lemmas found in the same chunk are considered as inter-related semantic neighbors.

 

 

Spread the love