Term Weighting methods – literature overview

Term Weighting methods, used in Text Classification

Term Frequency (TF)

TF_{t,d}=\frac{fr_{t,d}}{\sqrt{\sum^n_{t=1}fr_{t,d}^2}}}}

Where:

  • frtd = raw frequency of term t in document d

Document Frequency – DF 1

Global term weighting method, where terms occurring in more documents are considered as more relevant.

DFt=d 1n 1t d0t i / nd

Term Frequency – Inverse Document Frequency – TF-IDF 2

The most common weight computation schema applied with Vector Space Model is the Term Frequency – Inverse Document Frequency (TF-IDF)

TF-IDF_{t,d} = TF_{t,d} \cdot IDF_t

Where:

  • TF is normalized term frequency (raw / max) of term t in document d
  • IDF is inverse document frequency of term t computed with log()

IDF_{t}= log(\frac{N}{DF_t}) + 1


 


Binary independence model (BIM)

Vector has binary dimensions (found or not found at the document). Performs no better then simple IDF-based weighting.

BM25

IDF and document length based weighting model

Divergence From Randomness (DFR) 3

Normalization effect – NED(α) 4

  • can solve the “out-of-range” problem.

NE_D(a) =  Var \( { T_d \over T_{d,max} } \)  , d \in D

where D is the set of documents containing at least one of the query terms.
d is a bin in D. Td,max is the maximum Td among all the bins in D, which is
the Td of the bin with the smallest average document length (the smallest bin
length), since Td = tfn x tf is a decreasing function of document length.v


Glasgow weight 5

Terms have greater value in shorter documents, then in longer ones. Like TF-IDF, uses global IDF measure.

w_{t,d} = \frac{log(ft_{t,d}+1)}{log(len_d)} \times \left ( { log \left ( \frac{N}{DF_t} \right )+1 } \right )


Entropy 6

Gives higher weight to terms that are more frequent, but occurring in fewer documents.

 

 


Information Gain – IG 7

Based on raw term frequency and chi-square (χ2), testing if occurrence of a term and the specific class are independent

 

Mutual Information – MI 8


Term Frequency – Relevance Frequency – tf.rf 9

tf.rf_{td} = \cdot log(2 + \frac{a}{max(1,c)})

where:

  • tf = term frequency
  • rf = relevance frequency
  • t = term
  • d = document
  • a = positive examples
  • c = negative examples

Introduces relevance score, computed by positive and negative training data


Methods that consider missing/absent terms:

Balanced Term Weighting Scheme 10

Modifies TF-IDF term method, by counting single occurrence per document for missing terms.

Aspect Bernouli (AB) model 11

Extends previous approach by distinguishing between “true absences” and “false absences”. It is based on assumption that terms are not independent in their appearance in the document: if a term is present in the document, other will not occur in the same document.


Modified TF-IDF 12

extends traditional TF, IDF and TF-IDF methods. It includes a factor in the calculation that represents the missing terms from the document. Two modified components of TF-IDF are proposed:

  • Modified TF – mTF [2]

 

mTF_{t,d} = \frac{tf_{t,d} \times log \left ( \frac{\sqrt{T_c}}{T_t} \right )}{log \left [ \left ( \sum^n_{t=1} tf_{t,d}^2 \right ) \times \left ( \frac{len_d^2}{\sqrt{T_c}}\right ) \right ]}

Where:

  • lend = length of the document, measured as number of distinctive terms
  • Tt = total count of a term in all documents
  • Tc = total token count of the corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....

T_t = \sum^D_{d=1} tf_{t,d}

tf_{t,d} > 0

T_c = \sum^D_{d=1} \sum_t tf_{t,d}

  • Modified IDF – mIDF

mIDF_t = log \left [ \frac{N}{1 / ((N-DF_t)+1)} \right ]

in simplified form:

mIDF_t = log \left ( N^2 - NDF_t + N \right )

 

Modified mTF and mIDF are evaluated in three combinations (mTF – IDF, TF-mIDF, mTF-mIDF), where mTF-IDF had the highest performances (for two corporaIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....).

 

mTFIDF_{t,d} = mTF_{t,d} \cdot IDF_t

TFmIDF_{t,d} = TF_{t,d} \cdot mIDF_t

mTFmIDF_{t,d}=mTF{t,d} \cdot mIDF_t


Complex term weighting methods

Term Frequency – Inverse Gravity Moment (TF-IGM) 13

IGM is a global weighting factor measuring the inter-class distribution concentration of a term for known set of classes.

 


Notes

General categorization:

  • Supervized methods
    • Use categorical information
  • Unsupervized methods
    • Are corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... wide
  1. C.-M. Chen, H.-M. Lee, Y.-J. Chang, Two novel feature selection approaches for web page classification, Expert Syst. Appl. 36 (2009) 260–272. doi:https://doi.org/10.1016/j.eswa.2007.09.008.
  2. G. Salton, A. Wong, C. Yang S., A vector space model for automatic indexing, Commun. ACM. 18 (1975) 613–620. doi:10.1145/361219.361220.
  3. B. He, I. Ounis, Term frequency normalisation tuning for BM25 and DFR models, Proc. 27th Eur. Conf. Inf. Retr. (2005) 200–214. doi:10.1007/b107096.
  4. B. He, I. Ounis, Term frequency normalisation tuning for BM25 and DFR models, Proc. 27th Eur. Conf. Inf. Retr. (2005) 200–214. doi:10.1007/b107096.
  5. Z. Sam Lee, M. Maarof, A. Selamat, S.M. Shamsuddin, Text Content Analysis For Illicit Web Pages By Using Neural Networks, J. Teknol. 50 (2009).
  6. A. Selamat, S. Omatu, Web page feature selection and classification using neural networks, Inf. Sci. (Ny). 158 (2004) 69–88. doi:https://doi.org/10.1016/j.ins.2003.03.003.
  7. T. Mori, Information Gain Ratio As Term Weight: The Case of Summarization of IR Results, in: Proc. 19th Int. Conf. Comput. Linguist. – Vol. 1, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002: pp. 1–7. doi:10.3115/1072228.1072246.
  8. Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: D.H. Fisher (Ed.), Proc. {ICML}-97, 14th Int. Conf. Mach. Learn., Morgan Kaufmann Publishers, San Francisco, US, Nashville, US, 1997: pp. 412–420. citeseer.nj.nec.com/yang97comparative.html.
  9. M. Lan, C.L. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 721–735.
  10. D. Do, G. Ballard, P. Tillmann, Technical Report, (2015). doi:10.1016/S0007-8506(07)90004-9.
  11. E. Bingham, A. Kabán, M. Fortelius, The aspect Bernoulli model: Multiple causes of presences and absences, Pattern Anal. Appl. 12 (2009) 55–78. doi:10.1007/s10044-007-0096-4.
  12. T. Sabbah, A. Selamat, M.H. Selamat, F.S. Al-Anzi, E.H. Viedma, O. Krejcar, H. Fujita, Modified frequency-based term weighting schemes for text classification, Appl. Soft Comput. J. 58 (2017) 193–206. doi:10.1016/j.asoc.2017.04.069.
  13. K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Syst. Appl. 66 (2016) 245–260. doi:10.1016/j.eswa.2016.09.009.
Spread the love