It is designed with premise that URL segments from left to right may be interpreted as parent-child hierarchy structure in order to construct factors of FRA influence based on the following assumptions, given in order of dominance:
- link-path node closer to the root is more relevant than the one positioned deeper in the graph
- link-path leaf is more relevant then sibling link path node
- link-path node with greater number of child nodes is more relevant then the one with less.
where the link-path node represents a path folder, link-path leaf represents actual Target having filename in path or being without any minor URL segments (file path, query pairs) targeting a folder’s default index.
The Structure Module inherits the Frontier Layer Module base class and declares three layers:
- M03.L01: The Primary (the surface layer)
- M03.L02: The Secondary (the 1st layer)
- M03.L03: The Reserve (the 2nd layer)
Quantification is facilitated by construction of link-path structural directed graph from a collection of URLs instances. URL of each Link Instance is broken into segments guiding the tree builder trough existing graph nodes or instructing it to build new node for unmatched segment. The graph nodes are defined by name, score and children collection. The score is risen for one point each time the tree builder walks trough it i.e. the node score represents cumulative frequency of children path match against link detected within the known content. In case of Target having two or more Link Instances the node score is risen only once per unique target.
The module use tree graphs to drive the distribution:
- GT made of the current Targets
- GC made of the crawled URLs
- initial node score set to 2
- GB is a bonus factor score graph, initially emptyand
- GD built as union of these two with recalculated node score:
- GT score is divided by GC score of the matching node
- if intersecting nodes between GB i GD exist, GD score is increased with the score from the matching GB node
In the special case:
- when the GC graph is empty GD is clone of GT.
The GT and GD graphs are temporary instances used in scope of one crawling iteration, while GB and GC are preserved in the domain level crawling context.
The Targets are distributed into layers by active rules evaluation:
GD node with the highest score (GDtop) is selected
if it has the leafs – all are sent to the Primary
if it has the child nodes with leafs – all leafs from children are sent to the Secondary
the rest of targets goes into the Reserve
if GDtop is without leafs but has children nodes with leafs and/or with grandchildren nodes – the score redistribution is performed:
the score of GDtop is equally propagated among the first line children by writing proper positive score into GB graph
the score of GDtop is neutralized by writing negative score of the same amount to the matching GB node
In the Structure rank implementation the module is extended with another active rule providing the link relevance score adopted from the score of GD node with matching URL.