The imbNLP.Data.evaluate namespace contains two content language evaluation mechanisms, both using Hunspell dictionaries to assess set of terms with the highest frequency in the content evaluated:
- textEvaluator
- evaluates string input against two specified languages (dictionaries) and returns textEvaluation object with scores for A, B and not-A-B language association
- multiLanguageEvaluator
- more sophisticated language detection, using unlimited number of dictionaries and discriminatory tactics, designed for precision when linguistically close languages are involved.
Description of the multiLanguageEvaluator algorithm
Page language detection is solved with multi-language detection algorithm that uses Hunspell dictionaries for: English (Kevin, 2017), Serbian (Goran Rakić, Igor Nestorović, 2013), Slovenian (Erjavec, Košir, & Peterlin, 2006), Russian (Lebedev, 2008), German (Baumann, 2017) and Italian (Volta & Mura, 2007) to determinate unambitious language association for tested words.
Parameter |
Initial | Final | |
TLEN |
Token minimum length (characters) |
4 | 4 |
TVT |
Single language match token test limit |
10 | 50 |
TTL |
Total token test limit |
30 | 150 |
Table 15: Multi-language detection algorithm configuration parameters: the initial values and the final values updated after the procedure results evaluation.
Before language detection iteration loop a number of content preprocessing operations is performed:
-
Visible textual content from the HTML DOM is extracted
-
the text is split into tokens by word separators (tabs, space, new line and punctuation characters)
-
HTML entities are decoded into Unicode characters
-
Numerical tokens are excluded
-
Clean, letter-only, words are extracted from alphanumerical tokens
-
Tokens shorter than TLEN are excluded
-
Tokens found in the domain name are excluded
-
All tokens are transformed to lower case
-
Frequency table is created and sorted set of distinct words is extracted
-
The top TTL words from the sorted set is fed into WSET to be tested against dictionaries in the language detection iteration
Language evaluation result sets |
|
WSLM |
Single language match |
WMLM |
Multiple language match |
WNLM |
No language match |
WSET |
Words to be tested |
Table 16: Collections populated during language detection iterations
The detection procedure iteration starts with the most frequent word in the WSET and continues trough the set until the termination criterion is met or the complete set is evaluated. For each word it runs inner iteration in which the word is tested for match against each of dictionaries. On the first match the word is temporarily associated with the corresponding language identification tag. If the second match occurs, the inner iteration is broken, association tag is canceled and the word is assigned to WMLM. If the inner iteration is finished without any match: the word is assigned to WNLM, otherwise it is assigned to the WSLM – keeping its language identification tag. The termination of the detection iteration is triggered if number of words in WSLM reach the TVT limit.
In the final stage, the language with the highest count of words in WSLM, having corresponding identification tag, is declared as the language of the page. Initially, the configuration parameters were set to minimize resources footprint. However, the review of procedure execution logs exposed significant number of false negatives due number of content quality issues, widespread use of original spelling, for adopted technical terms having foreign language origin, as well as, popular product branding practices in the domain. To ensure high reliability of the procedure the final configuration granted five times greater test scope at cost of slower crawler iteration execution.