The content processing pipeline framework is multi-threading, modular and multi-functional set of tools for content decomposition, tagging, transformation and information extraction. At current state of implementation, a pipeline model can be designed only trough code. However, the goal is to have graphical interface for declarative modeling and XML-based model file format, so it can be modified manually. Although, the framework is developed primarily for web content processing, it can be used for any type of parallelised workflow operation.
Main components of the framework are:
Executes a pipeline model using initial set of tasks and model-specific data
Controls multi-threading and other general aspects of pipeline model execution
Directed graph structure of interconnected procedural units, called pipeline nodes.
Model execution context is complex data structure, containing various results of single execution call. It is product of the pipeline model execution, by pipeline machine.
Elements of the framework, at model and execution layer are:
Atom unit of job that should be executed within model. Instances of tasks are flowing from start node through pipeline, caring its instance of subject being processed
Unit of the content, associated to a pipeline task
Atomic unit of the processing operation. Once a pipeline task enters the node, the node performs its algorithm on the subject, carried by the task, and directs the task to another node, depending on current node type and settings.
Storage unit, collecting processed subjects. It is declared by the model, but real instance of each bin resides in the pipeline execution context.
Base class of pipeline model has one starting node and two bins: Trash and Output. Once model and content are sent to pipeline machine for execution, newly created pipeline execution context hooks the bin nodes to corresponding collections, it contains.
The tasks are flowing from the start node, trough the model. If no node directed the task and its subject to the Output bin, once task reaches end of the pipeline branch, it is discarded and the subject is sent to the Trash bin.
There are three core base classes of pipeline nodes:
Data structure used for content representation is directed tree-graph, having mining content repository of a category as the root node. The graph resembles the following hierarchy: Repository → Web site → Web Page → Content Block →Token Stream → Token, where between Token Stream and Tokens, Phrases (chunks) are injected where matched by algorithm applied later.