Document representation

Bag of words

The bag-of-words representation of a document is the matrix representation. It neglects word order and only stores the word counts in each document.


[Topic Model] Perplexity is a standard performance measure used to evaluate models of text data. It measures a model’s ability to generalise and predict new documents: the perplexity is an indication of the number of equally likely words that can occur at an arbitrary position in a document. A lower perplexity therefore indicates better generalisation. We calculate…


A termset – also known as itemset [16] or word combina- tion feature [28] – is assumed to occur in a given document if all members are present, regardless of their order and position. Selecting a discriminative set of n-termsets is a highly crucial, but very challenging, task since all groups of n terms can…