Console applications, using imbNLP library (like imbWBI Console Tool) and/or imbNLP console plugins, may have several Excel (.xlsx) spreadsheets in [application path]/resources/ directory subtree, that declare different sets of meta information about languages, lexical resources and system knowledge, consumed by NLP / Knowledge Extraction procedures of the framework. This page enumerates and describes the most important ones.
List of Hunspell supported languages
Contains list of languages supported trough basicLanguage class of the imbNLP.Core namespace, coupled with with Hunspell dictionary files in the [application path]/resources/lexical/hunspell subdirectory. Several Frontier Ranking Rules of imbWEM use information from crawler_url_needles column, to anticipate language of content pointed by newly discovered url.
File location:
- [application path]/resources/lexical/hunspel_list.xlsx
Column | Example | Description |
basicLanguageEnum | german | Name of Enum member |
file_prefix | de_DE_frami | Prefix of Hunspell dictionary files |
iso2code | de | Language ISO 2-letter code |
countryCode | DE | Country ISO 2-letter code |
englishName | German | Language name, in English |
nativeName | Deutsch | Language name, in native |
crawler_url_needles | de,ger,nem,deutsch,german,deu,nemacki,nemački | URL needles that suggest the language of targeted page/content |
Morphosyntactic tags interpretation tables
The Business Entity Classification research uses srLex 1.2 (Ljubešić et al., 2016) lexicon for Serbian language, specified in MULTEXT-East V5 morphosyntactic tag-set format. For easier utilization of the framework by other researchers, tag-set format interpretation table is specified in an external Excel spreadsheet file. Virtually any morphosyntactic dictionary can be consumed by the framework, after simple modification of the template spreadsheet file (Template for MSD tag-set interpretation specification).
File location:
- [application path]/resources/lexical/[MSDMorphosyntactic tagging - extends POS with given context. resource type name]/[language code]_[MSDMorphosyntactic tagging - extends POS with given context. resource type]_conversion.xlsx
File location, in particular case of BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... / imbWBIWeb Business Intelligence libraries of imbVeles Framework. Console Tool, interpretation table for MULTEXT-East v5 tag-set, used for srLex 1.2:
- [application path]/resources/lexical/multext/sr_multitext_conversion.xlsx
The Excel file contains two sheets (order of the sheets is relevant, not name):
- “translation”
- character in the tag-set format
- corresponding Enum type and member name, of the imbNLP.PartOfSpeech.flags namespace
- description string (optional)
Example:
I | pos_type.INT | Interjection |
Y | pos_type.ABB | Abbreviation |
X | pos_type.RES | Residual |
Z | pos_type.PUNCT | Punctuation |
c | pos_nounType.common | common |
p | pos_nounType.proper | proper |
m | pos_gender.m | masculine |
f | pos_gender.f | feminine |
n | pos_gender.n | neuter |
s | pos_number.s | singular |
p | pos_number.p | plural |
- “format”
- comma separated ordinal list of Enum type names (from imbNLP.PartOfSpeech.flags namespace), of which the tag-set string is composed
- corresponding word category, represented by pos_type.[member name], that uses the specified tag-set string format
Example:
pos_nounType, pos_gender, pos_number, pos_gramaticalCase, pos_animatness | pos_type.N |
pos_verbType, pos_verbform, pos_person, pos_number, pos_gender, pos_negation | pos_type.V |