Technology. Education. Art (TEA) tealearn.org

Home > Courses > Natural language processing (VU-CSC 322) > Text Preprocessing Basics

Text Preprocessing Basics

Subject: Natural language processing (VU-CSC 322)

When we work with raw text data, it often comes in a messy form: sentences with punctuation, mixed cases, numbers, and common words that don’t add much meaning. Text preprocessing is the process of cleaning and preparing this raw text so that machine learning models can understand it better. Without preprocessing, models may struggle to capture meaningful patterns because the input is inconsistent or noisy.

The first step is tokenization, which means splitting text into smaller units (tokens). These tokens are usually words, but they can also be sentences or even characters depending on the task. For example, the sentence:

"Natural Language Processing is fun and powerful!"

can be tokenized into:
["Natural", "Language", "Processing", "is", "fun", "and", "powerful", "!"]

Next, we often remove stopwords. Stopwords are common words like “is,” “and,” “the” that occur frequently but don’t carry much meaning for tasks like classification. Removing them reduces noise and helps models focus on the important words. For instance, after removing stopwords, the same sentence becomes:

["Natural", "Language", "Processing", "fun", "powerful"]

This cleaned version is much easier for algorithms to process because it highlights the meaningful content.

In practice, libraries like NLTK and spaCy make these steps straightforward. NLTK provides simple functions for tokenization and stopword removal, while spaCy offers more advanced tools that also capture linguistic features like part-of-speech tags and lemmas.

spaCy is a modern, open-source library in Python designed specifically for Natural Language Processing (NLP). Unlike older libraries such as NLTK, which are more educational and research-focused, spaCy was built with production use in mind; meaning it’s fast, efficient, and ready to handle real-world applications like chatbots, search engines, and text analytics systems.

Tokenization & Stopword Removal with spaCy


import spacy
# Load English model
nlp = spacy.load("en_core_web_sm")

# Sample text
doc = nlp("Natural Language Processing is fun and powerful!")

# Tokenization
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)

# Stopword removal
filtered_tokens = [token.text for token in doc if not token.is_stop]
print("Filtered Tokens:", filtered_tokens)

OUTPUT
spaCy Tokens: ['Natural', 'Language', 'Processing', 'is', 'fun', 'and', 'powerful', '!']
Filtered Tokens: ['Natural', 'Language', 'Processing', 'fun', 'powerful', '!']

Code explanation

• import spacy Brings the spaCy library into your Python program.
• nlp = spacy.load("en_core_web_sm") Loads a small, pre-trained English language model that knows how to tokenize, tag, parse, and recognize entities.
• doc = nlp("Natural Language Processing is fun and powerful!") Passes the text through the spaCy pipeline, creating a Doc object that stores tokens and linguistic annotations.
• tokens = [token.text for token in doc] Extracts the raw text of each token (word or punctuation) from the Doc.
• print("spaCy Tokens:", tokens) Displays the list of tokens produced by spaCy’s tokenizer.
• filtered_tokens = [token.text for token in doc if not token.is_stop] Creates a list of tokens that are not stopwords (removes common words like “is” and “and”).
• print("Filtered Tokens:", filtered_tokens) Displays the cleaned list of tokens after stopword removal.

Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form (called a lemma).

Note: Stopword removal and lemmatization are both text preprocessing techniques, but they serve different purposes: stopword removal eliminates very common words such as “is,” “and,” or “the” that usually don’t add much meaning to analysis, thereby reducing noise and helping models focus on the more informative words; lemmatization, on the other hand, transforms words into their base or dictionary form (lemma), ensuring that variations like “running,” “ran,” and “runs” are all standardized to “run.” In essence, stopword removal filters out unimportant words entirely, while lemmatization keeps words but normalizes them to improve consistency and meaning in the text.

For example:
“running” → “run”
“better” → “good”
“studies” → “study”

This makes text cleaner and more consistent, helping models recognize that different word forms actually represent the same concept.


# Lemmatization
lemmas = [token.lemma_ for token in doc if not token.is_stop]
print("Lemmatized Tokens:", lemmas)

OUTPUT
spaCy Tokens: ['Benjamin', 'Onuorah', 'is', 'a', 'fine', 'artist', 'and', 'a', 'software', 'developer', '.', 'He', 'lives', 'in', 'Lagos', ',', 'Nigeria', '.']
Filtered Tokens: ['Benjamin', 'Onuorah', 'fine', 'artist', 'software', 'developer', '.', 'lives', 'Lagos', ',', 'Nigeria', '.']
Lemmatized Tokens: ['Benjamin', 'Onuorah', 'fine', 'artist', 'software', 'developer', '.', 'live', 'Lagos', ',', 'Nigeria', '.']

Notice the difference: “lives” has been reduced to its lemma → “live”.
Proper nouns like “Benjamin” and “Onuorah” remain unchanged because their lemma is the same as the word itself.
Many words don’t change because they’re already in their base form (“artist”, “developer”, “Nigeria”).

By: Vision University

Comments

No Comment yet!

Login to comment or ask question on this topic

Previous Topic Next Topic

1 Introduction to Natural language processing (NLP)

2 Text Preprocessing Basics

3 POS Tagging and NER

4 Representing Text (BoW & TF-IDF)

5 Introduction to Text Classification

6 Models in scikit-learn

7 Text Classification with Multiple Models