Representing Text (BoW & TF-IDF)
Subject: Natural language processing (VU-CSC 322)
Introduction: Why Represent Text as Numbers?
Computers don’t understand words like humans do. They understand numbers, not language.
So before a machine can work with text (like analyzing WhatsApp chats, tweets, or student feedback), we must convert text into numbers. This process is called:
Text Vectorization (or Text Representation)ExampleImagine asking a computer:
- “I love jollof rice”
- “I hate jollof rice”
To a human, these mean opposite things.
To a computer, both are just strings unless we convert them into numbers.
Another example: If you say “I love pizza”, the computer needs to turn that into something like [1, 1, 1] (counts of words).
Bag-of-Words (BoW)
Bag-of-Words is the simplest way to represent text. It counts how many times each word appears in a sentence, ignoring grammar and order.
Example“I love jollof rice”
“I love fried rice”
“I hate burnt rice”
Step 1: Create VocabularyList all unique words:
[I, love, jollof, rice, fried, hate, burnt]
Step 2: Count Words
Key Idea- Each sentence becomes a vector (list of numbers).
- Order of words does NOT matter.
Advantages- It is very simple
- It is easy to understand and implement
Disadvantages- Ignores meaning and word order
- Common words like “I” may dominate
TF-IDF (Term Frequency – Inverse Document Frequency)
TF-IDF improves BoW by giving importance to meaningful words and reducing the weight of common words.
Idea Behind TF-IDF- Words like “rice” appear in all sentences
less important- Words like “jollof” or “burnt”
more informativeTwo Parts1. Term Frequency (TF): kow often a word appears in a sentence. For example, in a classroom word like "student" is use frequently - (less useful)
2. Inverse Document Frequency (IDF): how rare the word is across all sentences. In a classroom word like "scholarship" is rarely used - (more meaningful)
Therefore In TF-IDF gives:
Important words have
Higher weightCommon words have
lower weight4. Hands-on with Python (scikit-learn)
Step 1: Install Librarypip install scikit-learn
Step 2: Bag-of-Words Examplefrom sklearn.feature_extraction.text import CountVectorizer
sentences = [
"I love jollof rice",
"I love fried rice",
"I hate burnt rice"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Output: BoW Matrix['burnt' 'fried' 'hate' 'jollof' 'love' 'rice']
[[0 0 0 1 1 1]
[0 1 0 0 1 1]
[1 0 1 0 0 1]]
Step 3: TF-IDF Examplefrom sklearn.feature_extraction.text import TfidfVectorizer
sentences = [
"I love jollof rice",
"I love fried rice",
"I hate burnt rice"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Output: TF-IDF Matrix['burnt' 'fried' 'hate' 'jollof' 'love' 'rice']
[[0. 0. 0. 0.72033345 0.54783215 0.42544054]
[0. 0.72033345 0. 0. 0.54783215 0.42544054]
[0.65249088 0. 0.65249088 0. 0. 0.38537163]]
Observation- Values are no longer just 0 and 1
- Important words have higher decimal values
These are all the unique words found across your sentences.
The vectorizer:- Converts all text to lowercase
- Removes duplicates
- Sorts words alphabetically
- So each word becomes a column (feature) in the TF-IDF matrix.
Each row = one sentence, each column = one word.

Remember
a. High value → Important word in that sentence
b. Zero → Word does not appear in that sentence
Words unique to a sentence have higher scores"jollof" → only in sentence 1 → 0.7203
"fried" → only in sentence 2 → 0.7203
"burnt", "hate" → only in sentence 3 → 0.6525
These words are very important because they distinguish the sentences.
Words shared across sentences have lower scores"love" appears in sentence 1 and 2 → lower score (0.5478)
"rice" appears in ALL sentences → lowest score (~0.42–0.38)
This is because TF-IDF reduces importance of common words.
TF-IDF gives high scores to rare, meaningful words and low scores to common words.5. Classroom Activity (Very Important)Activity 1: Given the following sentences:
“I like Python programming”
“Python is difficult”
“I like coding”
- List vocabulary
- Create BoW table manually
Task1. Collect 5–10 sentences (WhatsApp chats, tweets, or news headlines)
2. Clean the text
3. Convert to: Bag-of-Words AND TF-IDF
4. Compare results
Count Words
Intuition
Matrix 2
By:
Vision University
Login to comment or ask question on this topic
Previous Topic Next Topic