The Language module uses heuristic language identifier token table to recognize the most probable language of the linked page content. If no clear language identifier recognized, it uses Hunspell dictionaries for Serbian (Goran Rakić, Igor Nestorović, 2013) and English (US) (Kevin, 2017) with TF-IDF term weighting schema to estimate the language.
The Language Module inherits the Frontier Layer Module base class and declares three layers:
• M01.L01: The Primary (the surface layer)
• M01.L02: The Secondary (the 1st layer)
• M01.L03: The Reserve (the 2nd layer)
The Targets are distributed into layers by nine Passive rules followed by two Active, all testing the Target Semantic Terms collection (TST). The collection is created from both URL path (Table 1) and anchor text tokens with several transformation and filtration operations:
• only URL part relative to the domain name is considered
• tokens are extracted by Regex matching alphanumeric chunks
• solely numerical tokens are excluded
• numeric characters in alphanumeric tokens are trimmed
• all tokens are transformed into lower case representation
URL |
Tokens |
pekarska-pec-sa-komorom-elp64fk128/ |
pekarska, pec, sa, komorom, elp, fk |
daikin-klima-uredjaj-inverter-ftxb35c-rxb35c-cena-akcija | daikin, klima, uredjaj, inverter, ftx, rxb, c, cena, akcija |
sr/cinkovanje/sta-je-toplo-cinkovanje/#branding | sr, cinkovanje, sta, je, toplo, cinkovanje, branding |
Table 1: Examples of token extraction from URL path.
The first two rules in the evaluation sequence are checking if identification needle (Table 2) for Serbian and English language is contained in the TST collection and distribute the Targets to the corresponding layer.
Serbian language |
English language |
|
Layer |
Primary |
Secondary |
ISO 639 alpha-2 |
sr |
en |
ISO 639 alpha-3 |
srp |
eng |
ISO 3166-1 alpha-2 |
rs |
gr |
ISO 3166-1 alpha-3 |
srb |
gbr |
TLD |
rs |
uk |
Serbian name |
srpski |
engleski |
English name |
Serbian |
English |
Additional |
yu |
– |
Table 2: Needles used to test TST for Serbian and English language identities. For Serbian language additional “yu” identifier (for Yugoslavia, country the Republic of Serbia is successor of) was introduced after observed in crawl logs during debug sessions.
The next seven Passive rules are sending Targets to the Reserve layer if needle of any other language (Table 3) covered by the module is detected.
Language | Language identification needles |
German |
de, ger, nem, deutsch, german, deu, nemacki, nemački |
Russian | ru, rus, russky, ruski, russian |
Italian | it, ita, italian, italiano, italijanski |
French | fr, fra, french, français, francuski |
Hungarian | hu, hun, hungarian, magyar, mađarski, madjarski |
Spanish | es, esp, spanish, espanol, španski, spanski |
Slovenian | si, sl, slovenian, slo, slovenački, slovenacki, slovenski |
Table 3: Needles used to test TST for other seven language identities
The last two rules are designed to resolve the Targets without obvious language identification. In context of TF-IDF model all Targets the module received for evaluation (including the ones resolved by first two rules) are treated as documents. Target language weight (TLW) is calculated as follows:
where Wi is TF-IDF term weight, l has value 1 if the token is recognized by Hunspell dictionary, otherwise the l value is set to -1 and n is number of tokens in TST. If the Target language weight is above 0 the Target is considered positive to the language estimation test. All unassigned targets are distributed to the Reserve layer.