Text Feature Extraction (2/3): TF-IDF Model

Shachi Kaul
Geek Culture
Published in
6 min readJun 29, 2021

--

Source

In Natural Language Processing, any text-based problem needs to be converted into a form that can be modeled. A simple text could be converted into features with various techniques such as Bag of Words (BOW), TF-IDF, or Word embeddings. In the last blog of the Text Feature Extraction series, we studied about CountVectorizer from scratch and its usecase on text classification. We deciphered there that it has a major drawback of lack of semantic meaning. CountVectorizer considers the word count to create a feature hence it doesn't take the sentence structure and order into account. Also, it results in a large sparse matrix. Thus TF-IDF comes into the picture.

This blog series projects the detailed TF-IDF technique. So, let’s get started.

Aim

The TF-IDF technique gives you the relevant term. This relevant term is the one by which the whole context can be understood instead of reading the whole text.

Intuition

  • The word occurring multiple times implies its importance (TF).
  • But at the same time if it appears across multiple docs too frequently then it may not be relevant (IDF). These words we can refer to as stopwords such as the, this, etc.

Theory & Concept

  • TF-IDF (Term Frequency-Inverse Document Frequency) depicts the importance of a word. It is the computation of the dot product of TF and IDF. Before this, let’s understand the terms individually first.
  • Term Frequency (TF):
    - It demonstrates the importance of the word to a doc with an intuition that the more the term in doc means higher the importance is.
equation-1
  • Inverse Document Frequency (IDF):
    - Shows how a term is actually relevant. It is not necessary that term frequently in some docs could be relevant such as stopwords (the that, of, etc). Stopwords do not reveal the context and thus these should be avoided. IDF works in such a way that it ignores them.
    - It penalizes the word appearing frequently across docs
    - The IDF score is higher for the relevant term while lower weight to the stopword
    - It considers natural logarithmic function aka log e
equation-2
  • Nutshell, TFIDF value relates to doc while IDF depends on corpus
  • Computing TF-IDF manually is different from Sklearn’s TF-IDF.
    Difference: TF term remains the same while IDF term differs. Let’s dive in!
    Standard TF-IDF
    The standard notation is shown below where N is docs across corpus and n is doc having term m.
equation-3

Sklearn TF-IDF
By definition, Tfidf should work as the above formula but Sklearn provides some more advanced computation. TfidfVectorizer and TfidfTransformer compute differently than the standard as follows:
- Adding 1 to the numerator
- Adding 1 to the denominator to prevent zero division
- Adding 1 to the whole log term as a constant for smoothing
- Resulting TF * IDF vectors then normalized with L2 (Euclidean)

equation-4

Curious for more, visit this StackOverflow link and a blog with in-depth detail.

Process

Let’s cut to the chase and study the steps to build a TF-IDF model. We will be following the standard notation equation-3.

corpus = “She is a wanderlust”, “She is lovely”

  1. Compute TF: Refer to equation 1.

TF_doc2 (“is”) = 1/3
TF_doc2(“lovely”) = 1/3

2. Compute IDF: Refer to equation-3.

IDF(“is”) = log (2/2) = 0
IDF(“lovely”) = log (2/1) = 0.30
As can be seen clearly that the weightage of “is” is less than the “lovely”. Thus lovely seems more relevant.

  • Dot product of TF and IDF word:

TF-IDF (“is”) = TF . IDF = (1/3) * 0 = 0
TF-IDF (“lovely”) = (1/3) * 0.3 = 0.09

The results showed that the word “is” is irrelevant while “lovely” holds some importance. Reading just the word “lovely” distinct the sentence.

Result summary as follows:

table-1

Implementation

To understand the TF-IDF model, let’s first see how to implement manually then will for Sklearn implementation.

  1. Manually
  • Let’s create our corpus of sentences and convert them into lowercase to not differentiate between “this” and “This”.
doc = "She is a wanderlust”, “She is lovely”
#Convert into lowercase
doc = list(map(str.lower, doc))
  • Create a bag of words for each doc in the corpus
cv = CountVectorizer()
count_occurrences = cv.fit_transform(corpus)
#For doc1
bagOfWords_1 = dict.fromkeys(cv.get_feature_names())
for ind,key in enumerate(bagOfWords_1):
bagOfWords_1[key] = count_occurrences.toarray()[0][ind]
bagOfWords_1
Out[2]:
{'is': 1, 'lovely': 0, 'she': 1, 'wanderlust': 1}
#For doc2
bagOfWords_2 = dict.fromkeys(cv.get_feature_names())
for ind,key in enumerate(bagOfWords_2):
bagOfWords_2[key] = count_occurrences.toarray()[1][ind]
bagOfWords_2
Out[3]:
{'is': 1, 'lovely': 1, 'she': 1, 'wanderlust': 0}
  • Compute TF
def compute_tf(bow, doc):
tf_dict ={}
doc_count = len(doc)
for word, count in bow.items():
tf_dict[word] = count/doc_count
return tf_dict
tf_doc1 = compute_tf(bagOfWords_1, corpus[0].split(' '))
tf_doc2 = compute_tf(bagOfWords_2, corpus[1].split(' '))
tf_doc1
Out[4]:
{'is': 0.25, 'lovely': 0.0, 'she': 0.25, 'wanderlust': 0.25}
  • Compute IDF
def compute_idf(docs):    
N = len(docs)
idfDict = dict.fromkeys(docs[0].keys(),0)
for doc in docs:
for word, val in doc.items():
if val > 0:
idfDict[word] +=1

for word, val in idfDict.items():
# standard notation
idfDict[word] = math.log(N / float(val))
#sklearn notation
#idfDict[word] = (math.log((N+1) / (val+1))) + 1
return idfDict
idfs = compute_idf([bagOfWords_1,bagOfWords_2])
idfs
Out[5]:
{'is': 0.0,
'lovely': 0.6931471805599453,
'she': 0.0,
'wanderlust': 0.6931471805599453}
  • Compute TF * IDF
def compute_tfidf(tf,idf):
tfidf = {}
for word, tfVal in tf.items():
tfidf[word] = tfVal * idf[word]
return tfidf
tfidf_doc1 = compute_tfidf(tf_doc1, idfs)
tfidf_doc2 = compute_tfidf(tf_doc2, idfs)
tfidf_doc1
Out[6]:
{'is': 0.0, 'lovely': 0.0, 'she': 0.0, 'wanderlust': 0.17328679513998632}

2. Sklearn
The above scratch steps could be done in just a few lines of code. Scikit-learn provides a library called TfidfVectorizer which computes the Tfidf weights.

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(corpus).todense()
vectors
"""
matrix([[0.50154891, 0. , 0.50154891, 0.70490949],
[0.50154891, 0.70490949, 0.50154891, 0. ]])
"""

Drawbacks

  • Context understanding
    BagOfWords and TFIDF techniques lack the understanding of context. TFIDF could decipher the sentence structure but not the context.
  • Large vocabulary
    In the case of a large vocabulary, features become so voluminous that it challenges memory and time.

To overcome the above cons, a new feature extraction came called Word Embedding. Let’s learn about this in Text Feature Extraction (3/3): Word Embedding Model.

Check out my GitHub repo summarizing all the code demonstrated here.

Also, feel free to grasp more on how to develop a TF-IDF model to predict the movie reviews sentiments via this GitHub repository.

References

Feel free to follow this author if you liked the blog because this author assures you to back again with more interesting ML/AI-related stuff.
Thanks,
Happy Learning! 😄

Can get in touch via LinkedIn.

--

--

Shachi Kaul
Geek Culture

Data Scientist by profession and a keen learner. Fascinates photography and scribbling other non-tech stuff too @shachi2flyyourthoughts.wordpress.com