imbSCI: Data Annotation

Keeping all model parameters, experiment factors and evaluation measures, properly aligned trough all stages and layers of the research life cycle is of great importance. If you have a model parameter called e.g. Popularity measure, letter-named in the article as: Pm , you would have less headache if it follows the same nomenclature across all of the layers: manuscript elaboration, source code of your research tool application, the column headers of the report spreadsheets and in all supplementary data and files, backing your findings and conclusions. It is even better if the description sentence, unit of measurement label and nomenclature of the parent category/group – is also synchronized, so what you write once in the code, you get in e.g. Sandcastle API documentation, IntelliSense tooltips, the reports and etc.

Following the idea of such manuscript ≡ code ≡ report ≡ data coupling, a Declarative Coding data annotation toolkit is embedded in the very core level of the imbSCI package. The main class of the toolkit is the imbAttribute in imbSCI.Core library. When combined with .NET Framework ComponentModel and DataAnnotation attributes and a number of specific imbSCI enumeration types, it enables you to define: data column headings, aggregated data views, value rendering and styling properties as well as heading and description on class-level. I found this functionality saving me vast amount of time and energy, as it provides a centralized control over data-model description and report representation. Use it together with a set of code snippets that ease property declaration and annotation, and keep the XML comments synchronized as well. The snippets of this group start with prefix: _imbSci, followed by type/role name (e.g. _imbSciBool, _imbSciCount, _imbSciRatio, _imbSciString…).

The most important annotation fields – in the imbSCI and imbACE libraries.

The figure shown above, highlights the most important aspects that are controlled by the Declarative Coding toolkit. The descriptive annotation (orange) is only one part of the puzzle: depending on output format used for reporting, the Rendering ( green) instructions may become worth of your attention. For instance, if the report output format is one of feature rich HTML/JS/CSS report templates, and you have an URL in a property of your data-model: by triggering “Is Link” flag, the URL will be rendered as fully functional hyperlink.

Screenshot of table, rendered with proper hyperlink (column: Report), using Structural Reporting workflow to generate a static HTML web site report on a crawl experiment.

Before diving in the other groups, let’s drop an eye on few source code examples – so you get a practical idea on the concept.

Example: Declaration of a property  using _imbSciRatio snippet

/// <summary> Ratio of relevant versus irrelevant pages, currently in the index datatable </summary>
[imb(imbAttributeName.measure_letter, "p/n")]
[imb(imbAttributeName.measure_setUnit, "%")]
[Description("Ratio of relevant versus irrelevant pages, currently in the index datatable")]
// [imb(imbAttributeName.reporting_valueformat, "F2")]
// [imb(imbAttributeName.measure_important)][imb(imbAttributeName.reporting_escapeoff)]
public Double RelevantContentRatio { get; set; }=0;

Screenshot from LibreOffice Calc: Spreadsheet report showing the property declared above

The CategoryAttribute, DisplayNameAttribute and DescriptionAttribute are self-explanatory – and the effects are observable on the screenshot. The snippet adds few annotation proposals that are commented-out by default. As you can see, the RelevantContentRatio (H) column contains non-formatted Double value – since line declaring the float formatting F2 (Standard Numeric Format) is left under the comment prefix. By the way, F2 would render the value in the first row as: 0.93, and we don’t want that, as the percentage sign is the measure unit of our preference.

// [imb(imbAttributeName.reporting_valueformat, "F2")]


LibreOffice – screen shot, Direct Reporting workflow

[DisplayName("Pot. Prec. Change (avg)")]
[Description("What difference the frontier layers made from MT_ipp to MT_opp. Positive value means the pot. precision  is increased.")]
[imb(imbAttributeName.measure_setUnit, "%")]
[imb(imbAttributeName.reporting_valueformat, "P2")]
[imb(dataPointAggregationAspect.overlapMultiTable, dataPointAggregationType.avg)]
[imb(imbAttributeName.measure_important, dataPointImportance.important)]
public Double potentialPrecissionChangeAvg { get; set; } = 0;

In the second code block, we enabled the proper percentage formatting P2 and declared the property as dataPointImportance.important which resulted in Bold text style for the complete column. In this block you may notice the dataPointAggregationType.avg Enum, used to define cell value calculation in dataPointAggregationAspect.overlapMultiTable aggregation scenario. To learn more on data annotation and Excel spreadsheet reporting check the posts in the Reporting and Data Manipulation category, especially on the Direct Reporting workflow and DataTable aggregation operations, as it is relevant for the Data Handling (♦ blue) group.

The File Data Structure group ( violet) is narrowly related to an very useful feature in the imbACE.Core library. It provides a high-level API for saving and loading your data-model classes, actually any class compatible with default XML serialization (check this post). Common sense tell us, that saving an object into single XML, JSON or Binary file – has greater rationale then to split it into several files (in several different formats) and spread it across directory three (that reflects its structure).

It has in deed, if the output is meant only for software interpretation and the data contained is not a subject of your manual review for research and development needs. But, the adventure of doing Coding for Science comes from a fact that you are developing a software and trying to confirm your research hypothesis in the same, parallel, workflow. A group of imbACE.Core classes around IFileDataStructure interface helps you to split object data into several files of plain text, XML and other formats. In same time, according to the instructions defined with the Data Annotation attributes, it will automatically generate required directory hierarchy, set filenames, and “read me” markdown files, describing content of each branch and files contained. Optionally, it can create additional documentation/report for desired properties and objects contained in the host class (the one that implements the IFileDataStructure interface). Such self-documenting feature is particularly nice, having in mind that e.g. Mendeley Data demands such file lists and descriptions for archives published via their service.

Mendeley Data FAQ: To publish your research data with Mendeley, you have to provide “read me” file with list and description about all files in the archive.

Example: Use of class and property annotation attributes for IFileDataStructure functionality

/// <summary>
/// Repository holding Mining Context corpus for a set of web site
/// </summary>
/// <seealso cref="aceCommonTypes.files.fileDataStructure.IFileDataStructure" />
[DisplayName("Web site content repository")]
[Description(@"All crawled pages from a domain, approved for data mining / knowledge extraction, are contained in this repository.
The repository also covers general objects&data about the web site.")]
[fileStructure(nameof(, fileStructureMode.subdirectory,
    fileDataFilenameMode.propertyValue, fileDataPropertyOptions.textDescription)]
public class imbMCWebSite : fileDataStructure, IFileDataStructure
    /// <summary>
    /// Gets or sets the domain information.
    /// </summary>
    /// <value>
    /// The domain information.
    /// </value>
    [fileData(fileDataFilenameMode.memberInfoName, fileDataPropertyMode.XML)]
    public domainAnalysis domainInfo { get; set; }

The code above is part of actual implementation of the imbMCWebSite class (part of imbMiningContext, a imbWEM library that manages the Mining Context repositoriums, more about it). Above the class declaration opening bracket, you may observe the: fileStructureAttribute, one of two attribute classes managing the IFileDataStructure behaviour. The second is fileDataAttribute, applied to the domainInfo property.

The class-level attribute states that:

  • an imbMCWebSite instance should be saved in its own subdirectory (fileStructureMode.subdirectory),
  • the subdirectory should be named by value (fileDataFilenameMode.propertyValue) of the name property ( nameof( ) of this instance,
  • and that this instance should be documented with a textual (markdown) description (fileDataPropertyOptions.textDescription), stored inside its subdirectory.

When the fileStructureMode.subdirectory is used, the object it self is serialized as XML and placed inside the subdirectory under filename reflecting the class name, in this case: imbMCWebSite.xml.

Regarding the domainInfo property: it has XmlIgnore attribute, preventing the property from being saved in the object’s XML file. That is because, the next attribute states that the domainAnalysis instance, if any associated to the property, should be saved separately, under name of the property (fileDataFilenameMode.memberInfoName) and using XML serialization (fileDataPropertyMode.XML).

Example: Loading object instance of class implementing IFileDataStructure interface

imbMCRepository instance = null;

String path = folder.pathFor("\\" + repo);

if (Directory.Exists(path))
    instance = repo.LoadDataStructure<imbMCRepository>(folder, output);
    instance.loger.log("Repository loaded ".add(log_msg, ". "));
} else
    String descriptionForNew = "MC Repository created [" + DateTime.Now.ToLongDateString() + " " + DateTime.Now.ToLongTimeString() + "]. " + log_msg;
    instance = new imbMCRepository(repo, descriptionForNew, folder);
    instance.loger.log("Repository created ".add(log_msg, ". "));

The code demonstrate typical pattern for File Data Structure instance loading, or new instance creation in case it was not found at specified path. The code below shows how to save a File Data Structure object (activeRepository) into desired directory (folder), how to get feedback on the final path where the object is saved (filepath) and to stay informed for case something went wrong: by passing an optional log builder instance (output).

String filepath = activeRepository.SaveDataStructure(folder, output);

To learn more on IFileDataStructure functionality: File Data Structure and its API.

To learn about the other stuff featuring in this code block:

The last group shown on the diagram ( blue) provides additional descriptive information, meant to provide application instructions for the user of your tool, or reviewer of your research reports. The help content defined with these attributes is incorporated as introduction of an auto-generated help/instructions file. The help file generation is available for any imbACE.Service Console implementation, using extension methods of the commandTreeReportTools class in the imbACE.Services library.

Screenshot of markdown preview on auto-generated help.txt file for imbAnalyticConsole

Beside Declarative Coding pattern explained so far, the same idea of attribute-driven, self-explanatory coding concept features all around the imbSCI and imbACE libraries. One of such cases is declaration pattern for ACE Script / ACE Console method (also called: aceOperation), that is supported with _aceMethodOperation snippet. The snippet will create a skeleton method with a set of attributes, documenting the method as well as its parameters. Below is shown code of a simple command of the analyticConsole.

[Display(GroupName = "define", Name = "CJEngineSetup", ShortName = "CJES", Description = @"Crawl Job Engine controls the parallel execution of the Crawl Job. 
    Tdl_max defines max. minutes per one domain level crawl, Tll_max per single link load and TC_max defines number of parallel domain loads.")]
[aceMenuItem(aceMenuItemAttributeRole.ExpandedHelp, "This command sets the most important parameters of the Crawl Job execution. For Tdl_max and Tll_max value -1 means limit is off, for TC_max value -1 means auto management.")]
/// <summary>Crawl Job Engine controls the parallel execution of the Crawl Job.</summary> 
/// <remarks><para>This command sets the most important parameters of the Crawl Job execution</para></remarks>
/// <param name="TC_max">Maximum number of parallel DLC executing in the same moment</param>
/// <param name="Tdl_max">Maximum minutes allowed for single DLC to run</param>
/// <param name="Tll_max">Maximum minutes of single iteration allowed for a DLC before its termination</param>
/// <param name="Tcjl_max">Maximum minutes for the complete Crawl Job execution</param>
/// <seealso cref="aceOperationSetExecutorBase"/>
public void aceOperation_defineCrawlJobEngineSettings(
    [Description("Maximum number of parallel DLC executing in the same moment")] Int32 TC_max = 8,
    [Description("Maximum minutes allowed for single DLC to run")] Int32 Tdl_max = 50,
    [Description("Maximum minutes of single iteration allowed for a DLC before its termination")] Int32 Tll_max = 20,
    [Description("Maximum minutes for the complete Crawl Job execution")] Int32 Tcjl_max = 100)
    state.crawlerJobEngineSettings = new crawlerDomainTaskMachineSettings();
    state.crawlerJobEngineSettings.TC_max = TC_max;
    state.crawlerJobEngineSettings.Tdl_max = Tdl_max;
    state.crawlerJobEngineSettings.Tll_max = Tll_max;
    state.crawlerJobEngineSettings.Tcjl_max = Tcjl_max;

Calling this aceOperation from command-line, command console text input or from ACE Script:

CJEngineSetup TC_max=8;Tdl_max=120;Tll_max=50;Tcjl=100;

To learn more on related stuff:


Spread the love