Crawl Job Execution

The Crawl Job consists of the web domain list and the configuration parameters. The result of the job execution, the Result Set, is fed into index database for later use by the Company Semantic Profile (CSP) construction and enrichment (Figure 1) procedures. Resource Employment features (Table 2) are related to two different levels of the architecture (Figure 4): the Job Level Context (JLC) and the Domain Level Crawl (DLC).

Web Crawlers – Literature review

The greatest algorithmic challenges of the web crawling are: loaded page and discovered links relevance estimation. Usually, the both are playing a crucial role in the frontier scheduling. The earliest relevant works on page importance ranking are: • the PageRank [1] which defines web page relevance as function of link-reference page relationship where sum of…