The Language Module

The Language module uses heuristic language identifier token table to recognize the most probable language of the linked page content. If no clear language identifier recognized, it uses Hunspell dictionaries for Serbian (Goran Rakić, Igor Nestorović, 2013)⁠ and English (US) (Kevin, 2017)⁠ with TF-IDF term weighting schema to estimate the language.

The Language Module inherits the Frontier Layer Module base class and declares three layers:
• M01.L01: The Primary (the surface layer)
• M01.L02: The Secondary (the 1st layer)
• M01.L03: The Reserve (the 2nd layer)

Execution flow of the Frontier Ranking Algorithm – the crawler design with all four modules SM-LTSD

The Targets are distributed into layers by nine Passive rules followed by two Active, all testing the Target Semantic Terms collection (TST). The collection is created from both URL path (Table 1) and anchor text tokens with several transformation and filtration operations:
• only URL part relative to the domain name is considered
• tokens are extracted by Regex matching alphanumeric chunks
• solely numerical tokens are excluded
• numeric characters in alphanumeric tokens are trimmed
• all tokens are transformed into lower case representation

URL

Tokens

pekarska-pec-sa-komorom-elp64fk128/

pekarska, pec, sa, komorom, elp, fk

daikin-klima-uredjaj-inverter-ftxb35c-rxb35c-cena-akcija daikin, klima, uredjaj, inverter, ftx, rxb, c, cena, akcija
sr/cinkovanje/sta-je-toplo-cinkovanje/#branding sr, cinkovanje, sta, je, toplo, cinkovanje, branding

Table 1: Examples of token extraction from URL path.

 

The first two rules in the evaluation sequence are checking if identification needle (Table 2) for Serbian and English language is contained in the TST collection and distribute the Targets to the corresponding layer.

 

Serbian language

English language

Layer

Primary

Secondary

ISO 639 alpha-2

sr

en

ISO 639 alpha-3

srp

eng

ISO 3166-1 alpha-2

rs

gr

ISO 3166-1 alpha-3

srb

gbr

TLD

rs

uk

Serbian name

srpski

engleski

English name

Serbian

English

Additional

yu

Table 2: Needles used to test TST for Serbian and English language identities. For Serbian language additional “yu” identifier (for Yugoslavia, country the Republic of Serbia is successor of) was introduced after observed in crawl logs during debug sessions.

The next seven Passive rules are sending Targets to the Reserve layer if needle of any other language (Table 3) covered by the module is detected.

Language Language identification needles
German

de, ger, nem, deutsch, german, deu, nemacki, nemački

Russian ru, rus, russky, ruski, russian
Italian it, ita, italian, italiano, italijanski
French fr, fra, french, français, francuski
Hungarian hu, hun, hungarian, magyar, mađarski, madjarski
Spanish es, esp, spanish, espanol, španski, spanski
Slovenian si, sl, slovenian, slo, slovenački, slovenacki, slovenski

Table 3: Needles used to test TST for other seven language identities

The last two rules are designed to resolve the Targets without obvious language identification. In context of TF-IDF model all Targets the module received for evaluation (including the ones resolved by first two rules) are treated as documents. Target language weight (TLW) is calculated as follows:

T_{LW}= \sum_{i=1}^n w_i \cdot l

where Wi is TF-IDF term weight, l has value 1 if the token is recognized by Hunspell dictionary, otherwise the l value is set to -1 and n is number of tokens in TST. If the Target language weight is above 0 the Target is considered positive to the language estimation test. All unassigned targets are distributed to the Reserve layer.

Spread the love