Tokenization, Stemming and Lemmatization

March 16, 2021 · 3 min read

Subhrajyoti Halder

What is Tokenization?

Tokenization is breaking down a big chunk of text into smaller chunks. Whether it is breaking the paragraph into sentences or sentences into words or words into characters. Tokenization can be very well done using NLTK library.


Tokenization

Code

# import the existing word and sentence tokenizing 
# libraries 
from nltk.tokenize import sent_tokenize, word_tokenize 

text = "Natural language processing (NLP) is a field " + \ 
    "of computer science, artificial intelligence " + \ 
    "and computational linguistics concerned with " + \ 
    "the interactions between computers and human " + \ 
    "(natural) languages, and, in particular, " + \ 
    "concerned with programming computers to " + \ 
    "fruitfully process large natural language " + \ 
    "corpora. Challenges in natural language " + \ 
    "processing frequently involve natural " + \ 
    "language understanding, natural language" + \ 
    "generation frequently from formal, machine" + \ 
    "-readable logical forms), connecting language " + \ 
    "and machine perception, managing human-" + \ 
    "computer dialog systems, or some combination " + \ 
    "thereof."

print(sent_tokenize(text)) 
print(word_tokenize(text))

# code taken from https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/

What is Stemming?

The process of converting the words to their stem word is called as stemming. Here stem word means base word. Here the stem word has no meaning in that language.

Like chang and fina.


Stemming

There is one problem in stemming that the words which are converted has no meaning. To resolve this issue lemmatization comes into the picture.

Code

# import these modules 
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 

ps = PorterStemmer() 

# choose some words to be stemmed 
words = ["program", "programs", "programer", "programing", "programers"] 

for w in words: 
    print(w, " : ", ps.stem(w)) 
# code taken from https://www.geeksforgeeks.org/python-stemming-words-with-nltk/

What is Lemmatization?

It is a technique which is used to reduce words to a normalized form. This transformation uses the dictionary to map the different variants of word back to its root format.

Like final and history.


Lemmatization

It takes lot of time as it maps every word to the dictionary. Sometimes both stemming and lemmatization can be same.

Code

# import these modules 
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() 

print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 

# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a")) 

# code taken from https://www.geeksforgeeks.org/python-lemmatization-with-nltk/

What is Tokenization?#

What is Stemming?#

What is Lemmatization?#

The END#

What is Tokenization?

What is Stemming?

What is Lemmatization?

The END