How to: crawl web content of a category in the project?

If you have no repository – you should make sure that the following option is set to false: false in [application]/config/application.xml file

The following commands will call crawl script generation for the categories specified within the IndustryTermModelIndustry Term Model is working title for the Web Classification algorithm, and it refers to particular namespace within imbWBI (documentation). The namespace contains few classes that are just connecting different parts of imbWBI.Core (documentation), imbNLP.PartOfSpeech (documentation) and imbWEM.Core (documentation) libraries, together to perform classification of business entities, actually their web sites, using natural language processing, ontology construction and at the finale:... project with name “itm01”:

// loads the project "itm01" 
itm.Open "itm01";
// here we use explicit ACE syntax
itm.CrawlScript name="constructions";clearRepo=false;debug=true;autorun=true;
itm.CrawlScript name="cooling";clearRepo=false;debug=true;autorun=true;
itm.CrawlScript name="energetics";clearRepo=false;debug=true;autorun=true;

// then a bit of implicit :)
itm.CrawlScript "heating";
itm.CrawlScript "furniture";

// saves the project, which is actually not neaded, but why not
itm.Save;

// quits the console
Quit

Each CrawlScript call will generate proper Crawling Script, calling imbWEM plugin. Generated scripts will be saved at imbWBIToolState directory. If you set autorun=true, it will be executed as well.
The content of one such crawling script:

// This is auto-generated script to build MC Repository for Industry Term Model Project
// Date 12/31/2017
// Defining job
wem.Job "MCRepo for constructions";"Building MCRepo for ITMP itm01";true;"";1;
// Loading web domains
wem.SampleFile "G:\imbWBI_Test\projects\imbWBIToolState_jobs\imbWBIToolState\constructions_crawl.txt",false,"Domains of constructions",true,0,-1,True;
// Creates new instance of built-in crawler
wem.Crawler classname="SM_LTS";LT_t=1;I_max=50;PL_max=15;PS_c=10;instanceNameSufix="_MC";primLanguage="serbian";secLanguage="english";
// Configuring Crawl Job Engine
wem.CrawlJobEngineSettings TC_max=2;Tdl_max=20;Tll_max=50;Tcjl_max=120;
// Opens new session with the Index Engine
wem.OpenSession experimentSession="itm01_constructions";IndexID="itm01";useJobSettings=false;crawlFolderNameTemplate="*";
// Opens new session with the Mining Context manager
mcm.Open repo="constructions_const"; log_msg="MCRepo construction for constructions"; debug=True;
// Adds plugin
wem.plugin plugin_classname="reportPlugIn_CrawlToMC";
// Runs the crawl job
wem.Run;
// Closes the currently opened Mining Context session
mcm.Close log_msg="Ending MCRepo construction for constructions"; doReport=true; debug=True;

On same location a text file with crawl targets will appear.

 

Spread the love