Content processing pipeline

The content processing pipeline framework is multi-threading, modular and multi-functional set of tools for content decomposition, tagging, transformation and information extraction. At current state of implementation, a pipeline model can be designed only trough code. However, the goal is to have graphical interface for declarative modeling and XML-based model file format, so it can be modified manually. Although, the framework is developed primarily for web content processing, it can be used for any type of parallelised workflow operation.

 

Main components of the framework are:

  • pipelineMachine
    Executes a pipeline model using initial set of tasks and model-specific data
  • pipelineMachineSettings
    Controls multi-threading and other general aspects of pipeline model execution
  • pipelineModel
    Directed graph structure of interconnected procedural units, called pipeline nodes.
  • pipelineModelExecutionContext
    Model execution context is complex data structure, containing various results of single execution call. It is product of the pipeline model execution, by pipeline machine.

Elements of the framework, at model and execution layer are:

  • pipelineTask
    Atom unit of job that should be executed within model. Instances of tasks are flowing from start node through pipeline, caring its instance of subject being processed
  • pipelineTaskSubject
    Unit of the content, associated to a pipeline task
  • pipelineNode
    Atomic unit of the processing operation. Once a pipeline task enters the node, the node performs its algorithm on the subject, carried by the task, and directs the task to another node, depending on current node type and settings.
  • pipelineBin
    Storage unit, collecting processed subjects. It is declared by the model, but real instance of each bin resides in the pipeline execution context.

Base class of pipeline model has one starting node and two bins: Trash and Output. Once model and content are sent to pipeline machine for execution, newly created pipeline execution context hooks the bin nodes to corresponding collections, it contains.

The tasks are flowing from the start node, trough the model. If no node directed the task and its subject to the Output bin, once task reaches end of the pipeline branch, it is discarded and the subject is sent to the Trash bin.

There are three core base classes of pipeline nodes:

Task node class – splits content element (subject) into new set of subjects, and creates task instances for each.

 

Test node class: performs specific evaluation of content element (subject). Based on results of the evaluation it forwards the task to related node

 

Transformation node class: performs changes on content element (subject) data. Changes may be unconditional or conditional.

Data structure used for content representation is directed tree-graph, having mining content repository of a category as the root node. The graph resembles the following hierarchy: Repository → Web site → Web Page → Content Block →Token Stream → Token, where between Token Stream and Tokens, Phrases (chunks) are injected where matched by algorithm applied later.

A conceptual sketch of web information extraction pipeline model

Spread the love