What is TF-IDF

TF-IDF stands for “Term Frequency, Inverse Document Frequency.” It’s a way to score the importance of words (or “terms”) in a document based on how frequently they appear across multiple documents.

Intuitively…If a word appears frequently in a document, it’s important. Give the word a high score.But if a word appears in many documents, it’s not a unique identifier. Give the word a low score.Therefore, common words like “the” and “for,” which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

Scikit-learn provides two methods to get to our end result (a tf-idf weight matrix).
​One is a two-part process of using the CountVectorizer class to count how many times each term shows up in each document, followed by the TfidfTransformer class generating the weight matrix.
​The other does both steps in a single TfidfVectorizer class

After setting up our CountVectorizer, we follow the general fit/transform convention of scikit-learn, starting with fitting.

​You can find the complete notebook related this post on my github link below, feel free to ask any queries related to this post:

https://github.com/abhibisht89/tfidf_example

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s