Language resources specifications

Console applications, using imbNLP library (like imbWBI Console Tool) and/or imbNLP console plugins, may have several Excel (.xlsx) spreadsheets in [application path]/resources/ directory subtree, that declare different sets of meta information about languages, lexical resources and system knowledge, consumed by NLP / Knowledge Extraction procedures of the framework. This page enumerates and describes the most important ones.

 

List of Hunspell supported languages

Contains list of languages supported trough basicLanguage class of the imbNLP.Core namespace, coupled with with Hunspell dictionary files in the [application path]/resources/lexical/hunspell subdirectory. Several Frontier Ranking Rules of imbWEM use information from crawler_url_needles column, to anticipate language of content pointed by newly discovered url. 

File location:

  • [application path]/resources/lexical/hunspel_list.xlsx
Hunspell list file Column explanation
ColumnExampleDescription
basicLanguageEnumgermanName of Enum member
file_prefixde_DE_framiPrefix of Hunspell dictionary files
iso2codedeLanguage ISO 2-letter code
countryCodeDECountry ISO 2-letter code
englishNameGermanLanguage name, in English
nativeNameDeutschLanguage name, in native
crawler_url_needlesde,ger,nem,deutsch,german,deu,nemacki,nemačkiURL needles that suggest the language of targeted page/content

 

Morphosyntactic tags interpretation tables

The Business Entity Classification research uses srLex 1.2 (Ljubešić et al., 2016) lexicon for Serbian language, specified in MULTEXT-East V5 morphosyntactic tag-set format. For easier utilization of the framework by other researchers, tag-set format interpretation table is specified in an external Excel spreadsheet file. Virtually any morphosyntactic dictionary can be consumed by the framework, after simple modification of the template spreadsheet file (Template for MSD tag-set interpretation specification).

File location:

  • [application path]/resources/lexical/[MSDMorphosyntactic tagging - extends POS with given context. resource type name]/[language code]_[MSDMorphosyntactic tagging - extends POS with given context. resource type]_conversion.xlsx

File location, in particular case of BECBusiness Entity Classification system, implementation of Industry Term Model (business category / industry description model) for Business Entities classification by processing web site content. Part of imbWBI.... / imbWBIWeb Business Intelligence libraries of imbVeles Framework. Console Tool, interpretation table for MULTEXT-East v5 tag-set, used for srLex 1.2:

  • [application path]/resources/lexical/multext/sr_multitext_conversion.xlsx

The Excel file contains two sheets (order of the sheets is relevant, not name):

  • “translation”
    • character in the tag-set format
    • corresponding Enum type and member name, of the imbNLP.PartOfSpeech.flags namespace
    • description string (optional)

Example:

I pos_type.INT Interjection
Y pos_type.ABB Abbreviation
X pos_type.RES Residual
Z pos_type.PUNCT Punctuation
c pos_nounType.common common
p pos_nounType.proper proper
m pos_gender.m masculine
f pos_gender.f feminine
n pos_gender.n neuter
s pos_number.s singular
p pos_number.p plural
  • “format”
    • comma separated ordinal list of Enum type names (from imbNLP.PartOfSpeech.flags namespace), of which the tag-set string is composed
    • corresponding word category, represented by pos_type.[member name], that uses the specified tag-set string format

Example:

pos_nounType, pos_gender, pos_number, pos_gramaticalCase, pos_animatness pos_type.N
pos_verbType, pos_verbform, pos_person, pos_number, pos_gender, pos_negation pos_type.V

 

Attachments

  • Default hunspel_list.xlsx file
    Default hunspel_list.xlsx file
    File size: 7 KB Downloads: 375
  • Template for MULTEXT-East v5 MSD tag-set
    Template file for MSDMorphosyntactic tagging - extends POS with given context. tag-set interpretation. Contains specification for MULTEXT-East v5 MSDMorphosyntactic tagging - extends POS with given context. tag-set.
    File size: 9 KB Downloads: 435
Spread the love