Understanding the significance of a word in a text is crucial for analyzing and interpreting large volumes of data. This is where the term frequency-inverse document frequency (TF-IDF) technique in Natural Language Processing (NLP) comes into play. By overcoming the limitations of the traditional bag of words approach, TF-IDF enhances text classification and bolsters machine learning models’ ability to comprehend and analyze textual information effectively. This article will show you how to build a TF-IDF model from scratch in Python and how to compute it numerically.
TfidfVectorizer
from scikit-learn to convert text documents into a TF-IDF matrix.Before diving into the calculations and code, it’s essential to understand the key terms:
The frequency with which a term occurs in a document is measured by term frequency (TF). A term’s weight in a document is directly correlated with its frequency of occurrence. The TF formula is:
The significance of a document within a corpus is gauged by its Document Frequency (DF). DF counts the number of papers that contain the phrase at least once, as opposed to TF, which counts the instances of a term in a document. The DF formula is:
DF(t)=occurrence of t in documents
The informativeness of a word is measured by its inverse document frequency, or IDF. All terms are given identical weight while calculating TF, although IDF helps scale up uncommon terms and weigh down common ones (like stop words). The IDF formula is:
where N is the total number of documents and DF(t) is the number of documents containing the term t.
TF-IDF stands for Term Frequency-Inverse Document Frequency, a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It combines the importance of a term in a document (TF) with the term’s rarity across the corpus (IDF). The formula is:
Let’s break down the numerical calculation of TF-IDF for the given documents:
Document 1: “The sky is blue.”
Term | Count | TF |
the | 1 | 1/4 |
sky | 1 | 1/4 |
is | 1 | 1/4 |
blue | 1 | 1/4 |
Document 2: “The sun is bright today.”
Term | Count | TF |
the | 1 | 1/5 |
sun | 1 | 1/5 |
is | 1 | 1/5 |
bright | 1 | 1/5 |
today | 1 | 1/5 |
Document 3: “The sun in the sky is bright.”
Term | Count | TF |
the | 2 | 2/7 |
sun | 1 | 1/7 |
in | 1 | 1/7 |
sky | 1 | 1/7 |
is | 1 | 1/7 |
bright | 1 | 1/7 |
Document 4: “We can see the shining sun, the bright sun.”
Term | Count | TF |
we | 1 | 1/9 |
can | 1 | 1/9 |
see | 1 | 1/9 |
the | 2 | 2/9 |
shining | 1 | 1/9 |
sun | 2 | 2/9 |
bright | 1 | 1/9 |
Using N=4N = 4N=4:
Term | DF | IDF |
the | 4 | log(4/4+1)=log(0.8)≈−0.223 |
sky | 2 | log(4/2+1)=log(1.333)≈0.287 |
is | 3 | log(4/3+1)=log(1)=0 |
blue | 1 | log(4/1+1)=log(2)≈0.693 |
sun | 3 | log(4/3+1)=log(1)=0 |
bright | 3 | log(4/3+1)=log(1)=0 |
today | 1 | log(4/1+1)=log(2)≈0.693 |
in | 1 | log(4/1+1)=log(2)≈0.693 |
we | 1 | log(4/1+1)=log(2)≈0.693 |
can | 1 | log(4/1+1)=log(2)≈0.693 |
see | 1 | log(4/1+1)=log(2)≈0.693 |
shining | 1 | log(4/1+1)=log(2)≈0.693 |
Now, let’s calculate the TF-IDF values for each term in each document.
Document 1: “The sky is blue.”
Term | TF | IDF | TF-IDF |
the | 0.25 | -0.223 | 0.25 * -0.223 ≈-0.056 |
sky | 0.25 | 0.287 | 0.25 * 0.287 ≈ 0.072 |
is | 0.25 | 0 | 0.25 * 0 = 0 |
blue | 0.25 | 0.693 | 0.25 * 0.693 ≈ 0.173 |
Document 2: “The sun is bright today.”
Term | TF | IDF | TF-IDF |
the | 0.2 | -0.223 | 0.2 * -0.223 ≈ -0.045 |
sun | 0.2 | 0 | 0.2 * 0 = 0 |
is | 0.2 | 0 | 0.2 * 0 = 0 |
bright | 0.2 | 0 | 0.2 * 0 = 0 |
today | 0.2 | 0.693 | 0.2 * 0.693 ≈0.139 |
Document 3: “The sun in the sky is bright.”
Term | TF | IDF | TF-IDF |
the | 0.285 | -0.223 | 0.285 * -0.223 ≈ -0.064 |
sun | 0.142 | 0 | 0.142 * 0 = 0 |
in | 0.142 | 0.693 | 0.142 * 0.693 ≈0.098 |
sky | 0.142 | 0.287 | 0.142 * 0.287≈0.041 |
is | 0.142 | 0 | 0.142 * 0 = 0 |
bright | 0.142 | 0 | 0.142 * 0 = 0 |
Document 4: “We can see the shining sun, the bright sun.”
Term | TF | IDF | TF-IDF |
we | 0.111 | 0.693 | 0.111 * 0.693 ≈0.077 |
can | 0.111 | 0.693 | 0.111 * 0.693 ≈0.077 |
see | 0.111 | 0.693 | 0.111 * 0.693≈0.077 |
the | 0.222 | -0.223 | 0.222 * -0.223≈-0.049 |
shining | 0.111 | 0.693 | 0.111 * 0.693 ≈0.077 |
sun | 0.222 | 0 | 0.222 * 0 = 0 |
bright | 0.111 | 0 | 0.111 * 0 = 0 |
Now let’s apply the TF-IDF calculation using the TfidfVectorizer from scikit-learn with an inbuilt dataset.
Ensure you have scikit-learn installed:
pip install scikit-learn
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
Fetch the 20 Newsgroups dataset:
newsgroups = fetch_20newsgroups(subset="train")
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
Convert the text documents to a TF-IDF matrix:
tfidf_matrix = vectorizer.fit_transform(newsgroups.data)
Convert the matrix to a DataFrame for better readability:
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df_tfidf.head()
By using the 20 Newsgroups dataset and TfidfVectorizer, you can convert a large collection of text documents into a TF-IDF matrix. This matrix numerically represents the importance of each term in each document, facilitating various NLP tasks such as text classification, clustering, and more advanced text analysis. The TfidfVectorizer from scikit-learn provides an efficient and straightforward way to achieve this transformation.
Ans. A: Taking the log of IDF helps to scale down the effect of extremely common words and prevent the IDF values from exploding, especially in large corpora. It ensures that IDF values remain manageable and reduces the impact of words that appear very frequently across documents.
Ans. Yes, TF-IDF can be used for large datasets. However, efficient implementation and adequate computational resources are required to handle the large matrix computations involved.
Ans. The TF-IDF’s limitation is that it doesn’t account for word order or context, treating each term independently and thus potentially missing the nuanced meaning of phrases or the relationship between words.
Ans. TF-IDF is used in various applications, including:
1. Search engines to rank documents based on relevance to a query
2. Text classification to identify the most significant words for categorizing documents
3. Clustering to group similar documents based on key terms
4. Text summarization to extract important sentences from a document