Text document representation models – literature review

The post is collection of short-notes on text document representation models, in context of text categorization / web page categorization. Its purpose is to serve as continuously growing list of relevant models, developed in the field.

Main document representation models

Document representation methods

Three (currently) most popular document representation methods are 1:

  • Vector Space Model – VSM
    • TF-IDF & Cosine similarity
    • Latent Semantic IndexingThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (LSIThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More), also called: Latent Semantic AnalysisThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (LSAThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More)
    • Semantic Similarity Retrieval Model (SSRM)
  • Probabilistic topic model
    • Probabilistic Latent Semantic IndexingThe Vector Space Model, document representation method, doesn’t give the semantic relations of term. The LSI method overcomes the limitation of VSM. LSI is an approach that use particular matrix transformation technique called Singular Value Decomposition (SVD).... More (PLSI)
    • Latent Dirichlet AllocationLDA is a generative probabilistic topic model. It represents the documents as a random mixtures of topics over the latent topic space, where each topic is characterized by a distribution over a dictionary of words. LDA and its extensions are ineffective when used with short documents (texts). Issues are coming from: ineffective word relation induction and difficulties with distinguishing ambiguous... More
  • Statistical language model
    • n-gram language models

 


Bag of words 2

Set of distinct words, extracted from the document. Word order is lost.

Bag of n-grams 3

Keeps record on word order in short context; high dimensionality (produces lot of n-grams); poor generalization;


Continuous Bag of words (CBOW) 4

  • it is a probabilistic bag-of-wordsThe bag-of-words representation of a document is the matrix representation. It neglects word order and only stores the word counts in each document.... model
  • uses continuous distribution of the context
  • predicts current word, based on the context
  • neural network output layer (projection layer) projects probabilities for the complete corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... into current word position
  • order of words in the history does not influence the projection

Continous Skip-gram model (SG) 5

  • similar to CBOW
  • predicts word context, based on the current word

Parameters for the CBOW and Skip-gram

  • C – maximum distance from the words
  • R – number of words from the history

Paragraph vector 6

Unsupervised framework; Learns continuous distributed vector representations; Inspired by the word vectors (word embeddings), used to predict the next word in the sentence.
Paragraph vectors are used as features, for machine learning techniques; Key advantage: PVs are learnt from unlabeled data.


Topic modeling of short texts

Strategies:

a) Aggregation of short texts into pseudo-documents

  • aggregate tweets containing the same word
  • aggregate tweets based on hashtags

b) Adding strong assumptions

  • each short text is a mixture of unigrams sampled from only one topic
  • Biterm topic model (BTM) – turns the whole corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... into a biterm set, where a biterm is constructed by any
    two distinct words in a short context.

 

Latent concept topic model (LCTM)

 

Utilization between normal documents and short texts (user communication)

 

Master-Slave Topic Model (MSTM)

Extended Master-Slave Topic Model (ESTM)

Co-occuring topic model (COTM) 7

  • developed for short texts (user comments) and informal language

 

 

 

  1. [1] K.N. Singh, H.M. Devi, Document representation techniques and their effect on the document Clustering and Classification : A Review, Int. J. Adv. Res. Comput. Sci. 8 (2017) 1780–1784.
  2. Zellig S. Harris (1954) Distributional Structure, WORD, 10:2-3, 146-162, DOI: 10.1080/00437956.1954.11659520
  3. Zellig S. Harris (1954) Distributional Structure, WORD, 10:2-3, 146-162, DOI: 10.1080/00437956.1954.11659520
  4. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, (2013) 1–12. doi:10.1162/153244303322533223.
  5. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, (2013) 1–12. doi:10.1162/153244303322533223.
  6. Q. V. Le, T. Mikolov, Distributed Representations of Sentences and Documents, 32 (2014). doi:10.1145/2740908.2742760.
  7. Y. Yang, F. Wang, J. Zhang, J. Xu, P.S. Yu, A topic model for co-occurring normal documents and short texts, World Wide Web. 21 (2018) 487–513. doi:10.1007/s11280-017-0467-8.
Spread the love