Natural Language Processing module covers:
- Token level operations
- Word (or any string) similarity computation with several methods (Dice Coefficient, Jaccard Index of n-gram sets, Continual Overlap Ratio)
- Content decomposition pipeline
- Content elements tagging with Regex pattern rules
- Multi-faced content element rendering, for Regex pattern based flagging
- Phrase / Chunk extraction
- Part-of-SpeechPart-of-speech, is very frequently used to provide linguistic information to NER and CR in form of features in statistical approaches... tagging
- Quering various types of dictionaries
- Hunspell dictionary (spellchecker and hyphenation)
- Apertium bilingual dictionaries (word translation)
- UnitexTRE is a lightweight, robust, efficient, portable, and POSIX compliant regexp matching library. Key features include the agrep command line tool for approximate regexp matching in the style of grep, an approximate matching library API, portability, wide character and multibyte character support, binary pattern and data support, complete thread safety, consistently efficient matching, low memory consumption and small footprint, and... morphosyntactic & inflectional dictionary
- Multex-East v5 morphosyntactic & inflectional dictionary
- Custom (Excel spreadsheet driven) replacement, tagging and named entityNamed Entity (word type) dictionaries
- Ontology extraction
- Semantic Cloud (non-hierarchical) construction and term expansion
- Semantic Lexicon (hierarchical) construction and term expansion
- Space Vector Model TF-IDF and Semantic Similarity Retrieval Model similarity computation
- TF-IDF and Lemma Table extraction from web content or other text documents
- Text transliteration (conversion between Cyrilic to Latin scripts are implemented by default)