Content of the article:
- Research related to the issue of parallel crawling
- Description of supported execution patterns
- Related configuration parameters
Regarding parallel execution aspects we found the RCrawler 1 as the most valuable reference. It is a package for R 2 statistical computing environment containing an general-purpose Breadth-first inner crawler and content scraper equipped with duplicate 3 and near-duplicate 4 content detection. The MD5 5 algorithm is utilized for the complete duplicate page content detection and the SimHash 6, a perceptual hashing algorithm for near-duplicate cases. Experimental evaluation of parallelism was performed in 1, 2, 4 and 8 thread scenarios, proving the significant increase of crawled pages per time unit.
The work pool handler takes n URLs from the frontier to create the pool, where n is function of available CPU cores. The pool is processed in parallel: a separate thread is created for each of URLs. Once all threads are completed, the page content is processed, newly discovered URLs are normalized, filtered and fed to the frontier queue. Similar parallel execution pattern is proposed by other focused designs that do include frontier ranking operation after all threads of the current iteration are completed 7 8 9. We adopted this pattern as one of two supported by our framework. The second runs each domain crawl in a separate thread – similarly to some other parallel crawl solutions 1011.
Here we discuss two parallel execution patterns supported by the proposed architecture, together with related required features and design goals. The first pattern is implemented on the lower level and assumes parallel page loads within the same DLC, while the second provides ability for parallel execution of complete DLC threads, within the JLC scope. The error handling (RF2 err) and crawler trap evasion (RF2 dlc-trap) were partially covered above, in respect to the scope of the paper we’ll not go into further details. The algorithm suitability for parallel execution (RF2 dlc-mt) and cross domain balancing of bandwidth pull (RF2 jlc-mt) are highly inter-related topics that shaped at large the solution design. Parallel processing of inner pages means issuing parallel HTTP requests to a single domain – which is in opposition to the politeness rules and may lead to access denial by the hosting server or DoS-like temporal access failure. The compromise comes with implementation of the second request (RF2 jlc-mt), that will be discuss later.
Group 2 |
Optimal Resource Employment (DG-Rda) |
RF2 dlc-mt |
algorithm adaptation for parallelism |
RF2 jlc-mt |
making requests across multiple domains simultianeously to balance bandwidth pull |
RF2 err |
error and crash-like states handling |
RF2 dlc-trap |
the crawler traps evasion |
Regarding the issue of the algorithmic compatibility for parallel (multi-threaded) execution (i.e. utilization of all available CPU cores instead of just one) we would like to underline the general conflict between quality (DG-Qda) and resource employment maximization (DG-Rda) based on the parallelism. Vast majority of non-parallel focused crawlers take only the best ranked link to be loaded in next iteration. This implies that after every single page load, the frontier ranking is performed, together with all additional processing and calculation required by the particular ranking algorithm. In our view, two most important questions of the parallel crawling implementation are: a) when the frontier is updated and the next ranking performed? and b) what number of the parallel threads should run? (Figure 1, n=?). If the frontier is updated at the end of each thread in purpose of maximum execution time reduction, as shown on the flowchart (Figure 1), the crawler would became unpredictable: two runs with the same setup would likely lead to significantly different result. Single occurrence of a slower response would change crawl path drastically. In scenario where frontier links from the first (L1) and the third (L3) place were loaded and processed before the link of the second (L2) rank: the ranking algorithm consistency is broken – a link found at third crawled page could take a position in the next frontier iteration that otherwise, if crawl is performed in single thread, would be impossible. The problem may be formalized as complementary relationship between two coefficients: Crawling Data Quality (CDqk) and Crawler Ranking Algorithm Corruption (CRAck). The CRA corruption emerges when links of loaded target L3 are extracted and ranked before L2 was loaded and processed.
To avoid repetitive and disordered ranking procedure (CRA) calls, the proposed execution pattern excludes it from the parallel retrieval and extraction procedure. The domain level crawling (Figure 2) iteration waits for all n page retrieval threads to complete before the CRA is performed, using data collected from all threads – this is trade-off in favor of the DG-Rda, against the crawling precision (DG-Qda). On one side: we make only one call to the CRA, instead of repeated n-calls at the end of each thread, on the other side the collected crawling data is deliberately corrupted because intermediate re-composition of the Target Pool is prevented.
To determinate optimum count of the parallel threads an experimental evaluation is required, but before it, a range of values to evaluate has to be adopted. RCrawler 12 gave unclear range describing the nmax as “similar” to the number of CPU cores – in their evaluation context nmax is 4. Such designation of nmax is generally recommended for procedures entirely or almost entirely computational in nature, however, in our case the execution is greatly influenced by I/O delays. Therefore, we will evaluate thread counts between 100-200% of CPU cores number, where the optimum should bring greater rate of computing resources exploitation, resulting in reduction of the execution time met with acceptable CRACK. On the DLC flowchart (Figure 2) n is implicitly equalized to LT – while for analytical purposes these two may be considered separately, we failed to find any reasonable scenario in which LT greater than n could be beneficial.
Another parallel execution pattern, supported by the architecture, runs complete DLC in separate threads as shown on the flowchart (Figure 3). In scenario where crawl job contains “enough” domains (Sdc-min) and the DLC is single-threaded i.e. n=1, only technical characteristics of the platform running the crawler designate the optimal number of threads (TCRefers to Term Category induced term weight factor, in context of Industry Term Model (imbWBI)...opt) – as no information from the job level (JLC) influences the CRA. For this pattern we define the optimal TCRefers to Term Category induced term weight factor, in context of Industry Term Model (imbWBI)... solely as the one that results in the highest reduction of the execution time. This approach goes well with interests of both DG-Rda and DG-Qda: no CRAck is produced on DLC because of single-thread per target (n=1), the computing resources are employed on greater rate and domain request shuffle (RF2jlc-mt) is implicitly respected.
Both of the parallel execution patterns may be turned off separately by setting n/LT and/or TCRefers to Term Category induced term weight factor, in context of Industry Term Model (imbWBI)... values to 1, or used together if other values are set to the Crawler Job configuration parameters mentioned.
Parameter |
Value |
Unit |
|
LT |
Load Take per iteration |
1 |
n of t |
Imax |
Iteration limit |
100 |
n of i |
TTmax |
Total Targets limit |
10000 |
n of t |
PLmax |
Total Page-loads limit |
500 |
n of p |
TDL-max |
Domain crawl time limit |
15 |
min |
TLL-max |
Single load time limit |
1 |
min |
TST-max |
Targets count stability limit |
3 |
n of i |
Table 1: A set of crawl size limitation parameters and values used for preliminary survey. Abbreviations: n for number; t for target/link, i for DLC iteration, p for target loaded i.e. web page;
The LT controls number of pages to be taken from the ranked frontier and loaded in the next iteration. Common feature of the most crawlers described in the literature is to take only the first link i.e. the one with the best relevance score. Iteration limit Imax and PLmax are simple termination criteria, whose values in the final design are to be adjusted, according the preliminary survey findings, to reduce unnecessary workload as the most relevant pages on a web site should be crawled before reaching any value near these limits. The TTmax, if reached, would suggest the domain was mistakenly included in the research sample, having in mind the highest number of pages the survey detected. Time limit for DLC, the TDL-max, has role to protect the JLC execution from otherwise undetected problems in: the crawler source code, algorithm, research sample, network/uplinks or if a kind of crawl trap occurs. When triggered, both TTmax and TDL-max, leave warning message with all details in a log XML file for later investigation. The TST-max is a special termination rule used to prevent unnecessary workload during the preliminary survey.
- S. Khalil and M. Fakir, “SoftwareX RCrawler : An R package for parallel web crawling and scraping,” SoftwareX, vol. 6, pp. 98–106, 2017.
- R. C. Team, “R: A Language and Environment for Statistical Computing,” 2017, R Foundation for Statistical Computing, Vienna, Austria.
- R. Rivest, “{RFC} 1321: The {MD5} Message-Digest Algorithm,” Status: INFORMATIONAl, 1992.
- A. Recitation, “Rolling Hash (Rabin-Karp Algorithm) Notes on Strings ‘ Numerical ’ Example,” 2011.
- R. Rivest, “{RFC} 1321: The {MD5} Message-Digest Algorithm,” Status: INFORMATIONAl, 1992.
- A. Recitation, “Rolling Hash (Rabin-Karp Algorithm) Notes on Strings ‘ Numerical ’ Example,” 2011.
- D. Yadav, A. K. Sharma, and J. P. Gupta, “Parallel crawler architecture and web page change detection,” WSEAS Trans. Comput., vol. 7, no. 7, pp. 929–940, 2008.
- R. Nath and N. Kumar, “A Novel Parallel Domain Focused Crawler for Reduction in Load on the Network,” Int. J. Comput. Eng. Res., vol. 2, no. 7, pp. 2250–3005, 2012.
- N. Pappas, G. Katsimpras, and E. Stamatatos, “An agent-based focused crawling framework for topic- and genre-related web document discovery,” Proc. – Int. Conf. Tools with Artif. Intell. ICTAI, vol. 1, pp. 508–515, 2012.
- V. Papavassiliou and P. Prokopidis, “A modular open-source focused crawler for mining monolingual and,” pp. 43–51, 2013.
- P. Sirohi, A. Goel, and N. Singhal, “Architecture of a Parallel Focused Crawler for Online Social Networks,” vol. 8491, pp. 3–6, 2013.
- S. Khalil and M. Fakir, “SoftwareX RCrawler : An R package for parallel web crawling and scraping,” SoftwareX, vol. 6, pp. 98–106, 2017.