### Term Frequency (TF)

Where:

*fr*= raw frequency of term_{td}*t*in document*d*

### Document Frequency – DF ^{1}

$D{F}_{t}^{}=\sum _{d1}^{n}\left\{\begin{array}{ll}1& t\in d\\ 0& ti/nd\end{array}\right.$Global term weighting method, where terms occurring in more documents are considered as more relevant.

### Term Frequency – Inverse Document Frequency – TF-IDF ^{2}

The most common weight computation schema applied with Vector Space Model is the Term Frequency – Inverse Document Frequency (TF-IDF)

Where:

*TF*is normalized term frequency (raw / max) of term*t*in document*d**IDF*is inverse document frequency of term*t*computed with log()

### Binary independence model (BIM)

Vector has binary dimensions (found or not found at the document). Performs no better then simple IDF-based weighting.

### BM25

IDF and document length based weighting model

### Divergence From Randomness (DFR) ^{3}

## Normalization effect – N

_{ED}(α)^{4}

- can solve the “out-of-range” problem.

where

Dis the set of documents containing at least one of the query terms.

d is a bin inD.Td,maxis the maximumTdamong all the bins inD, which is

theTdof the bin with the smallest average document length (the smallest bin

length), sinceTd = tfn x tf is a decreasing function of document length.v

### Glasgow weight ^{5}

Terms have greater value in shorter documents, then in longer ones. Like TF-IDF, uses global IDF measure.

### Entropy ^{6}

Gives higher weight to terms that are more frequent, but occurring in fewer documents.

### Information Gain – IG ^{7}

Based on raw term frequency and chi-square (χ

^{2}), testing if occurrence of a term and the specific class are independent

### Mutual Information – MI ^{8}

### Term Frequency – Relevance Frequency – tf.rf ^{9}

where:

- tf = term frequency
- rf = relevance frequency
- t = term
- d = document
- a = positive examples
- c = negative examples

Introduces relevance score, computed by positive and negative training data

### Balanced Term Weighting Scheme ^{10}

Modifies TF-IDF term method, by counting single occurrence per document for missing terms.

### Aspect Bernouli (AB) model ^{11}

Extends previous approach by distinguishing between “true absences” and “false absences”. It is based on assumption that terms are not independent in their appearance in the document: if a term is present in the document, other will not occur in the same document.

### Modified TF-IDF ^{12}

extends traditional TF, IDF and TF-IDF methods. It includes a factor in the calculation that represents the missing terms from the document. Two modified components of TF-IDF are proposed:

- Modified TF – mTF [2]

Where:

*len*_{d }= length of the document, measured as number of distinctive terms*T*= total count of a term in all documents_{t}*T*= total token count of the corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory...._{c}

- Modified IDF – mIDF

in simplified form:

Modified mTF and mIDF are evaluated in three combinations (mTF – IDF, TF-mIDF, mTF-mIDF), where mTF-IDF had the highest performances (for two corporaIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory....).

### Term Frequency – Inverse Gravity Moment (TF-IGM) ^{13}

IGM is a global weighting factor measuring the inter-class distribution concentration of a term for known set of classes.

Term Generality Index – TGI

General categorization:

- Supervized methods
- Use categorical information

- Unsupervized methods
- Are corpusIn linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.... wide

- C.-M. Chen, H.-M. Lee, Y.-J. Chang, Two novel feature selection approaches for web page classification, Expert Syst. Appl. 36 (2009) 260–272. doi:https://doi.org/10.1016/j.eswa.2007.09.008.
- G. Salton, A. Wong, C. Yang S., A vector space model for automatic indexing, Commun. ACM. 18 (1975) 613–620. doi:10.1145/361219.361220.
- B. He, I. Ounis, Term frequency normalisation tuning for BM25 and DFR models, Proc. 27th Eur. Conf. Inf. Retr. (2005) 200–214. doi:10.1007/b107096.
- B. He, I. Ounis, Term frequency normalisation tuning for BM25 and DFR models, Proc. 27th Eur. Conf. Inf. Retr. (2005) 200–214. doi:10.1007/b107096.
- Z. Sam Lee, M. Maarof, A. Selamat, S.M. Shamsuddin, Text Content Analysis For Illicit Web Pages By Using Neural Networks, J. Teknol. 50 (2009).
- A. Selamat, S. Omatu, Web page feature selection and classification using neural networks, Inf. Sci. (Ny). 158 (2004) 69–88. doi:https://doi.org/10.1016/j.ins.2003.03.003.
- T. Mori, Information Gain Ratio As Term Weight: The Case of Summarization of IR Results, in: Proc. 19th Int. Conf. Comput. Linguist. – Vol. 1, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002: pp. 1–7. doi:10.3115/1072228.1072246.
- Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: D.H. Fisher (Ed.), Proc. {ICML}-97, 14th Int. Conf. Mach. Learn., Morgan Kaufmann Publishers, San Francisco, US, Nashville, US, 1997: pp. 412–420. citeseer.nj.nec.com/yang97comparative.html.
- M. Lan, C.L. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 721–735.
- D. Do, G. Ballard, P. Tillmann, Technical Report, (2015). doi:10.1016/S0007-8506(07)90004-9.
- E. Bingham, A. Kabán, M. Fortelius, The aspect Bernoulli model: Multiple causes of presences and absences, Pattern Anal. Appl. 12 (2009) 55–78. doi:10.1007/s10044-007-0096-4.
- T. Sabbah, A. Selamat, M.H. Selamat, F.S. Al-Anzi, E.H. Viedma, O. Krejcar, H. Fujita, Modified frequency-based term weighting schemes for text classification, Appl. Soft Comput. J. 58 (2017) 193–206. doi:10.1016/j.asoc.2017.04.069.
- K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Syst. Appl. 66 (2016) 245–260. doi:10.1016/j.eswa.2016.09.009.