Tokenization, Stemming and Lemmatization

What is Tokenization?

Tokenization is breaking down a big chunk of text into smaller chunks. Whether it is breaking the paragraph into sentences or sentences into words or words into characters. Tokenization can be very well done using NLTK library.

Tokenization
Tokenization

Code

# import the existing word and sentence tokenizing
# libraries
from nltk.tokenize import sent_tokenize, word_tokenize
text = "Natural language processing (NLP) is a field " + \
"of computer science, artificial intelligence " + \
"and computational linguistics concerned with " + \
"the interactions between computers and human " + \
"(natural) languages, and, in particular, " + \
"concerned with programming computers to " + \
"fruitfully process large natural language " + \
"corpora. Challenges in natural language " + \
"processing frequently involve natural " + \
"language understanding, natural language" + \
"generation frequently from formal, machine" + \
"-readable logical forms), connecting language " + \
"and machine perception, managing human-" + \
"computer dialog systems, or some combination " + \
"thereof."
print(sent_tokenize(text))
print(word_tokenize(text))
# code taken from https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/

What is Stemming?

The process of converting the words to their stem word is called as stemming. Here stem word means base word. Here the stem word has no meaning in that language.

Like chang and fina.

Stemming
Stemming

There is one problem in stemming that the words which are converted has no meaning. To resolve this issue lemmatization comes into the picture.

Code

# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
# choose some words to be stemmed
words = ["program", "programs", "programer", "programing", "programers"]
for w in words:
print(w, " : ", ps.stem(w))
# code taken from https://www.geeksforgeeks.org/python-stemming-words-with-nltk/

What is Lemmatization?

It is a technique which is used to reduce words to a normalized form. This transformation uses the dictionary to map the different variants of word back to its root format.

Like final and history.

Lemmatization
Lemmatization

It takes lot of time as it maps every word to the dictionary. Sometimes both stemming and lemmatization can be same.

Code

# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))
# code taken from https://www.geeksforgeeks.org/python-lemmatization-with-nltk/

The END