The features of Frontier Optimization group (List 1), embedded in Path Resolver component, ensure that only relevant and unique links are fed into frontier. The Path Resolver component uses Top Level Index and Page Source Hash Index for its operation. Its measures for duplicate and unwanted content load prevention are applied on all newly extracted links that are subject of rule based filtration (FO-flt), path resolution (FO-res) and subsequent unification of multiple URLs predicted to lead to the same content (FO-uni). Two or more paths are equalized i.e. considered as the same in cases where:
-
are with or without default sub-domain (www) specified, like: http://www.ibc.rs = http://ibc.rs
-
have HTTP or HTTPS protocol prefix, like:
https://www.unimet.rs/ = http://www.unimet.rs
-
pointing to a directory with or without default page filename, like: /en/index.html and /en/
-
being the same after all anchors in the URL removed, like: /sr/index.html#site and /sr/index.html
-
absolute and relative paths pointing at the same place after being resolved, like:
/en/products/../index.html and /en/index.html or /en/about/./contact.html and /contact.html or http//www.eef.rs/sr/contact.html and contact.html, where page with the second path is located inside /sr directory
- FO-uni
unification of obvious URL variants
- FO-res
URL resolution and canonical transformation
- FO-dpl
page with duplicate content detection
- FO-flt
file extension and mime type filter
- FO-mda
multi domain site architecture compatibility
List 1: Features embedded in the Path Resolver component that are providing optimization of frontier links
The Path Resolver component has to support all valid and obviously invalid forms of both absolute and relative URLs, in respect to location of the page from which the link was extracted and the DLC contextual information. Obviously invalid URL forms are the ones that are possible to fix using DLC contextual information, like: double or triple back-slashes in directory path. To seamlessly resolve paths on sites with multi domain architecture – the component has to recognize occurrence of a host name variation using Top Level Domain index table to equalize obvious domain pairs like: termomont.co.rs and termomont.rs or euromodul.rs and euromodul.hr. The URL collection is also filtered to remove the special link types like: mailto, javascript calls, and on-page anchors and links having filename extensions matching the predefined blacklist. When all preventive measures fail, resulting in duplicate content load, the Duplication Checker component will exclude all loaded data from further processing chain and dispose associated resources. Duplicate content evaluation uses domain level Page Source Hash index to compare crawled pages with newly loaded ones.