Web Crawlers – Literature review

The greatest algorithmic challenges of the web crawling are: loaded page and discovered links relevance estimation. Usually, the both are playing a crucial role in the frontier scheduling.

The earliest relevant works on page importance ranking are:
• the PageRank [1]

which defines web page relevance as function of link-reference page relationship where sum of all page ranks is one, and

• the Hyperlink-induced Topic Search (HITS) [2]

defining web page relevance by separate functions for two distinct page roles on the early Web: the hubs (link directories) and the authorities (pages containing the author’s own content) where for the first relevancy is function of out-bound link count and for the second it is function of in-bound link count.

The general-purpose link ranking algorithms, described above, could be applied in our scenario due their immanent topical neutrality but such complete linguistic blindness would decrease crawl precision down to the level of relevant page participation in the complete web site content. In case of equal count of relevant and irrelevant pages, equally linked from the landing page, both PR and HITS algorithms would hardly outperform random or the Breadth-First driven crawler. However, these bottom-line algorithms are covered for their role in benchmark evaluation and historical importance for the research field. While general-purpose crawlers use non-topical factors to assert link relevance, focused ones are driven by topical preference given in a form of: query terms, semantic query or other data structures.

Similarity quantification method is key enabler in development of focused crawlers based on non-binary relevance evaluation. The Vector Space Model (VSM) [3] is fundamental work in this context. It defines documents and queries as vectors of identifiers in n-dimensional space, where each algebraic dimension corresponds to a distinct term, having non-zero weight attribute when it is part of queried document. The most common weight computation schema applied with VSM is the Term Frequency – Inverse Document Frequency (TF-IDF). It merges normalized Term Frequency with statistic describing the information volume of the term in context of the scoped documents: the Inverse Document Frequency [4]. While many variations of the TF-IDF and document-query similarity computations emerged in the last decades, we utilized at several points of the solution, the most widely used ones:

 

w_{di} = tf_i \cdot idf_i = { F_i \over F_max } \cdot log( { N \over N_i } )

where: Wdi is term i weight in the document d, Fi is frequency of the term i, Fmax is the frequency of the most frequent term, N is total number of documents, Ni number of documents including the term i.

Sim_{(d,t)} = { { \sum_{i=1}^{n} w_{di} w_{ti} } \over { \sqrt { \sum_{i=1}^ {n} w^2_{di} \cdot \sum_{j=1}^{n} w^2_{tj}}} }

where: Sim(d,t) is cosine similarity measure between document d and topic t, wdi and wti are weights of the term i in context of document and topic respectively, n is number of common terms between the document d and the topic t.

The first focused crawler: the Fish Search [5] combined dynamic crawl depth limitation and link anchor text binary, regular expression based, relevance evaluation to heuristically drive focused crawl. One of its the most popular derivation, the Shark-Search [6], introduced decimal similarity score, decaying score distribution from parent to children nodes and link meta-information contextual enrichment. The score is calculated as cosine similarity (VSM) and propagated, down the descendants chain, with a predefined decay factor d. The Fish Search is unable to cross the irrelevancy tunnel, while the success of Shark Search depends to chosen decay factor and the particular site structure. To avoid such dead-ends other factors were introduced like link semantic role assessment by page content decomposition. One of the most popular algorithm for page segmentation is Vision based Page Segmentation – VIPS [7]. The VIPS became the benchmark in the field, but visual analysis, that includes content rendering, doesn’t fit well in heuristic premise of focused crawling optimization, therefore lightweight approaches dominated the last decade. A such [8] decomposition is performed by simple stacking of text tokens until the sentence period or one of predefined HTML tags is reached. After a document is segmented, into set of sub-documents, these are evaluated and fetched as separate nodes. Another solution, the Content Block + Best First [9] crawler, uses decomposition based on HTML markup cues to split the page into semantically homogeneous blocks. The BlockWeb [10] builds directed graph representation of content blocks, combining DOM transformation and Mozilla Gecko page render visual analysis.

Beside page structural awareness many took broader scope into consideration by introducing inner PageRank as one of the factors. Experimental evaluation [11] of popular frontier relevance functions found high efficiency of the a) PageRank, b) link (anchor text and URL tokens) to query similarity and c) link distance (from known relevant page). Thus, mixing query and general relevance factors seems natural evolutionary step. The Fractional Page Rank (FPR) [12] crawler family combines several VSM, similarity based assessments, with modified PR algorithm that prioritizes frontier links considering only some of them and skipping relevance score propagation toward already loaded pages. Another algorithm, having similar blend, is a focused HITS derivation mixing three score factors: Web Page Topic Similarity (VSM), Common Reference Degree (cross-links) and Trust-degree (PCTHITS) [13]. Learning Crawler [14] also combines HITS with VSM but instead of query terms it is driven by exemplary pages representing single topical category. The pages are crawled to construct Term frequency – Inverse document frequency Definition Semantic table. The table has role of the driving query in each conclusive focused crawl, returning new set of the most relevant hubs and authorities. Opposite to such one-time learning approach, the Treasure Graph (T-Graph) [15] dynamically constructs topical directed graph in each iteration. The graph drives the frontier ranking by VSM cosine similarity comparison against anchor text (and related content). Beside generic data structures like string arrays, tables and directed graphs, custom data structures are also used to drive the crawlers made for specific application domains. The Focused Crawler for Events (FCE) collects web pages related to event of interest [16] represented as the Event Model data structure with topic (set of keywords), date and location properties. Page relevancy is function of three criteria: having matches with the topic, having publication date close to the event date and having match of location attributes. The most significant limitation of the algorithms, discussed so far, comes from literal token string comparison for topical matching. If two collections of terms have no exact matches it still doesn’t mean they are semantically unrelated, especially in applications involving language with complex morphology. The Semantic Similarity Retrieval Model (SSRM) [17] is proposed as a semantically enhanced alternative to the VSM, providing better approximation of human topical perception. After terms are expanded with synonyms, derived from a lexical ontology, their weights are calculated with typical TF-IDF formula (1). In the first step, the query term weight qi of each query term i is adjusted based on its relationships with other semantically similar terms j within the same query vector:

 

q_i = q_i + \sum_{sim(i,j)\geq t}^{j\neq i} q_j \cdot sim(i,j) }

where t is predefined threshold (t = 0.8). Multiple related terms, in the same query, reinforce each other. The weights of non-similar terms remain unchanged. For short queries, specifying only a few terms, the weights are initialized to 1 and adjusted according to the formula above. The second step is query term expansion by up to three levels of ontological hierarchy, in both hyponym and hypernym direction. Weight for, newly added, terms is computed as:

q`_i = \sum_{sim(i,j)\geq t}^{j\neq i} {1 \over n } q_j \cdot sim(i,j) }

where n is the number of hyponyms of each expanded term j. In case a term returned by expansion was already in the query collection the result (4) is summed with the existing weight. The Document Similarity between the query q and document d is calculated as:

Sim_{SSRM(q,d)} = { { \sum_i \sum_j q_i d_j sim_{(i,j)} } \over { \sum_i \sum_j q_i d_j } }

where i and j are terms in the query and in the document, respectively. Query terms qi are expanded and re-weighted according to the steps described above, while document terms dj weights are defined as TF-IDF (1). Semantic similarity (or relevance) between two terms sim(i,j) may be computed with several methods, where the edge counting (inverse proportion of ontological distance) and information content approaches are recommended. The SSRM and other semantic information retrieval methods are adopted by many web crawling algorithms, then called: the semantic crawlers. Typical feature is term expansion using general or specific term hierarchical taxonomy (ontology) and application of semantic similarity as frontier factor. One of such is multi-threading Semantic Focused Crawler (SFC) [18]. The framework introduces the Dynamic Semantic Relevance (DSR) to assert crawled pages with assumption that highly relevant page contains relevant links. The driving query is expanded by all parent and few levels of child nodes of the semantic lexicon. The edge counting approach is adopted for the Semantic Relevance. The DSR is defined as sum of products between page concept frequencies (FCj) and their semantic relevance. Another study [19], pointed out that algorithms using only semantic similarity tend to assign non-zero score for completely irrelevant links. To solve this issue an improved solution was forged: the Semantic Similarity Vector Space Model (SSVSM), the SSRM [17] with the VSM. The link priority score is computed by linear integration of page full text versus topic similarity and anchor text versus topic similarity, computed with the VSM cosine formula (2).

These are the most relevant works in each of the focused crawler sub groups [20]: classic focused crawlers: variations of Best-First based on page content, anchor text or both; the learning crawlers using learning algorithms to navigate the crawl; and semantic crawlers, working with term synonym sets or a semantic lexicon (ontology).


[1] T.H. Haveliwala, Efficient Computation of PageRank, Stanford Univ. (1999). doi:10.1145/971701.50239.
[2] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM. 46 (1999) 604–632. doi:10.1145/324133.324140.
[3] G. Salton, A. Wong, C. Yang S., A vector space model for automatic indexing, Commun. ACM. 18 (1975) 613–620. doi:10.1145/361219.361220.
[4] K.S. Jones, A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL, J. Doc. 28 (1972) 11–21. doi:10.1108/eb026526.
[5] P. De Bra, G.-J. Houben, Y. Kornatzky, R. Post, Information Retrieval in Distributed Hypertexts, in: Intell. Multimed. Inf. Retr. Syst. Manag. – Vol. 1, LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, Paris, France, France, 1994: pp. 481–491. http://dl.acm.org/citation.cfm?id=2856823.2856864.
[6] M. Hersovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalhaim, S. Ur, The shark-search algorithm. An application: tailored Web site mapping, Comput. Networks ISDN Syst. 30 (1998) 317–326. doi:10.1016/S0169-7552(98)00038-5.
[7] D. Cai, S. Yu, J.R. Wen, W.Y. Ma, VIPS: a visionbased page segmentation algorithm, Beijing Miciosoft Res. Asia. (2003) 1–29. doi:MSR-TR-2003-79.
[8] J. Yang, J. Kang, J. Choi, A Focused Crawler with Document Segmentation, Text. (2005) 94–101. doi:10.1007/11508069_13.
[9] N. Luo, W. Zuo, F. Yuan, C. Zhang, A New Method for Focused Crawler Cross Tunnel, First Int. Conf. Rough Sets Knowl. Technol. (2006) 632–637. doi:10.1007/11795131_92.
[10] E. Bruno, N. Faessel, H. Glotin, J. Le Maitre, M. Scholl, J. Le, M. Michel, Indexing and querying segmented web pages: The BlockWeb Model, World Wide Web. 14 (2011) 623–649. doi:10.1007/s11280-011-0124-6.
[11] J. Cho, H. Garcia-Molina, L. Page, Reprint of: Efficient crawling through URL ordering, Comput. Networks. 56 (2012) 3849–3858. doi:10.1016/j.comnet.2012.10.006.
[12] M.H. Alam, J.W. Ha, S.K. Lee, Novel approaches to crawling important pages early, Knowl. Inf. Syst. 33 (2012) 707–734. doi:10.1007/s10115-012-0535-4.
[13] Y. Du, X. Tian, W. Liu, M. Wang, W. Song, Y. Fan, A novel page ranking algorithm based on triadic closure and hyperlink-induced topic search, 19 (2015) 1131–1149. doi:10.3233/IDA-150762.
[14] M. Kumar, R. Vig, Learnable Focused Meta Crawling Through Web, Procedia Technol. 6 (2012) 606–611. doi:10.1016/j.protcy.2012.10.073.
[15] A. Sey, A. Patel, Computer Standards & Interfaces A focused crawler combinatory link and content model based on T-Graph principles, 43 (2016) 1–11. doi:10.1016/j.csi.2015.07.001.
[16] M.M.G. Farag, S. Lee, E.A. Fox, Focused crawler for events, Int. J. Digit. Libr. (2017) 1–17. doi:10.1007/s00799-016-0207-1.
[17] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.
[18] P. Bedi, A. Thukral, H. Banati, A. Behl, V. Mendiratta, A Multi-Threaded Semantic Focused Crawler, J. Comput. Sci. Technol. 27 (2012) 1233–1242. doi:10.1007/s11390-012-1299-8.
[19] Y. Du, W. Liu, X. Lv, G. Peng, An improved focused crawler based on Semantic Similarity Vector Space Model, Appl. Soft Comput. J. 36 (2015) 392–407. doi:10.1016/j.asoc.2015.07.026.
[20] S. Batsakis, E.G.M. Petrakis, E. Milios, Data & Knowledge Engineering Improving the performance of focused web crawlers, Data Knowl. Eng. 68 (2009) 1001–1013. doi:10.1016/j.datak.2009.04.002.

Spread the love