imbNLP: Introduction

Natural Language Processing module covers:

  • Token level operations
    • Word (or any string) similarity computation with several methods (Dice Coefficient, Jaccard Index of n-gram sets, Continual Overlap Ratio)
  • Content decomposition pipeline
    • Content elements tagging with Regex pattern rules
    • Multi-faced content element rendering, for Regex pattern based flagging
    • Phrase / Chunk extraction
    • Part-of-Speech tagging
  • Quering various types of dictionaries
    • Hunspell dictionary (spellchecker and hyphenation)
    • Apertium bilingual dictionaries (word translation)
    • Unitex morphosyntactic & inflectional dictionary
    • Multex-East v5 morphosyntactic & inflectional dictionary
    • Custom (Excel spreadsheet driven) replacement, tagging and named entity dictionaries
  • Ontology extraction
    • Semantic Cloud (non-hierarchical) construction and term expansion
    • Semantic Lexicon (hierarchical) construction and term expansion
  • Space Vector Model TF-IDF and Semantic Similarity Retrieval Model similarity computation
    • TF-IDF and Lemma Table extraction from web content or other text documents
  • Text transliteration (conversion between Cyrilic to Latin scripts are implemented by default)
