Part Of Speech library of imbNLP module, contains several utility and data model classes supporting operations with weighted terms (or lemmas) and interconnected terms (Semantic Cloud).
Lemma Term and Lemma Table
Developed from concept of TF-IDF, the Web Lemma Term and Web Lemma Table classes provide support for document semantic similarity computation. These classes, together with other utility classes in the namespace are used for both Lemma Table and Phrase Table in Case and Category representation model of the Business Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI..... Below is exemplary excerpt from a Business Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... Category Lemma Table:
|Lemma form||Abs. frequency||Weight||Document Set Freq||Document Freq||Term Frequency||IDF factor|
Download Excel report, generated on a BEC Category Lemma Table: Lemma Table, example from BEC
Below are description of the Web Lemma Term properties, as given in the report.
|Lemma form||Lemma form of the entry||LABEL|
|Abs. frequency||Number of lemma forms detected in the set||FREQUENCIES||aTF||n|
|Weight||Final weight applied to the term||FREQUENCIES||TF-IDF||w|
|Nominal form||Alias of the Lemma form||LEMMA|
|Derived words||Promena teči - primena morfologije forms, assigned to the lemma term||INFLECTIONS|
|Document Set Freq||Number of Document Sets (web sites) containing the lemma form||FREQUENCIES||DSF||n|
|Document Freq||Number of Documents (web pages) containing the lemma form||FREQUENCIES||DF||n|
|Term Frequency||Normalized weight of the lemma form||FREQUENCIES||TF||w|
|IDF factor||The IDF factor applied to normalized lemma form frequency||FACTORS||IDF||r|
The imbVeles framework has several alternative TF-IDF and similar frequency counting classes:
- weightTable<TWeightTableTerm>: Universal weight table, for SVM/TF-IDF and SSRM similarity computation and Semantic Lexicon based term expansion. The class has embedded term weighting computation
- weightTableSet<TWeightTableTerm, TWeightTable>: parent object of the weightTable, representing a set of Documents, facilitates IDF computation
- instanceCountCollection<T>: Generic instance count collection
- numericSampleStatistics:instanceCountCollection<Int32> : Derived counter collection, for Int32 statistics
- enumCounter<T>: counter collection, for enumeration types
Lemma Table constructor
For Lemma Table and Phrase Table construction, two constructor classes are implemented: wlfConstructorTFIDF and chunkConstructorTF. Both are consuming content graph (pipelineTaskSubjectContentToken and derived classes), produced with a Content Processing Pipeline model and Pipeline Machine host class. Set of unique lemmas is a complex data structure, where relationship with each token in the analyzed web pages, is maintained. This feature allows HTML-tag related weighting factors to be applied during lemma frequency counting, producing initial HTML-weighted frequency (Fi) for term i. HTML tags are separated into three groups, each having corresponding weight factors. Content of page title meta tag is in the same group (H) with all heading levels H1-H6. Alternative description text for images, defined trough alt attribute of image tags, is interpreted together with normal anchor text found as node value of link tags L . Text extracted from the rest of visible HTML tags is categorized to the last group T.
Initial tests shown an important weakness of the typical IDF computation, related to modern web design concept: several web sites in the research sample had single-page design, optimized for touch-screen devices like cell phones and tablets. Treating such web sites as single pages would undermine relevance of the terms found there and would be in collision with user perception of such, dynamically presented content. Whatever approach is taken regarding how the system should count such page, typical IDF computation will return zero weight for the complete set of extracted terms. Another drawback, in context of classification problem, comes from fact that words, found on every page in the set, are not necessarily unwanted and semantically insignificant. To overcome these issues, we introduced Document Frequency Correction (Document Frequency Factor - multiplier that is applied to Number of Documents variable in TF-cIDF term weight model. If DFC = 0, then it is TF model, if DFC=1, then it is typical TF-IDF....) factor (Table 12), initially set to 1.1. The factor reduces influence of IDF by multiplying number of documents in IDF equation.
In default application (as defined in Business Entity Classification system), the Semantic Cloud is a non-directed graph data structure describing non-hierarchical network of semantically related lemma terms. The lemmas are represented as weighted nodes, having single type of relationships with other nodes in the cloud. The relationships are declared trough non-directed link instances, connecting two nodes. Although, on level of the actual class implementation (lemmaSemanticCloud) both nodes and links, do support category (or type) assignment as Int32 value, links do support Double value defined weight, on the same way as nodes. Furthermore, direction of relationship is also immanent property of link. These features are inherited from imbSCI.Graph.FreeGraph namespace base classes.
The class supports overlap query, term expansion, cloud merge, and load&save operations.
Semantic Cloud Constructor
The Semantic Cloud Constructor, implemented as part of the Business Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... research, constructs the cloud using Lemma Table and Phrase Table. Initial node weight is inherited from the Lemma Table of the Category and it is subject of sequential modification, by Cloud Matrix component. Lemmas that were categorized (in Term Categorization process) as at least Reserve Category, are introduced into preliminary version of the Semantic Cloud. The links between the nodes are induced from Phrase Table, following simple co-occurrence rule: all lemmas found in the same chunk are considered as inter-related semantic neighbors.