Text Vectorization
Lemmatization
The steps for text preprocessing:
- Tokenization: you'll need to divide the text into tokens.
- Lemmatization: you'll need to reduce the words to their root forms.
You can use these libraries for both tokenization and lemmatization:
- Natural Language Toolkit (NLTK)
- spaCy
Import the tokenization function and create a lemmatization object:
1from nltk.tokenize import word_tokenize2from nltk.stem import WordNetLemmatizer34lemmatizer = WordNetLemmatizer()
Pass the text "All models are wrong, but some are useful" as separate tokens to the lemmatize()
function:
1text = "All models are wrong, but some are useful."23tokens = word_tokenize(text.lower())45lemmas = [lemmatizer.lemmatize(token) for token in tokens]
The word_tokenize()
function splits a text into tokens, and the lemmatize()
function returns the lemma for a token passed to it. Because we're interested in lemmatizing a sentence, the result is usually presented as a list of lemmatized tokens.
1['all', 'model', 'be', 'wrong', ',', 'but', 'some', 'be', 'useful', '.']
We've also converted the text to lowercase because that's required by the NLTK lemmatizer.
Use the join()
function to convert the list of processed tokens back into a text line, separating the elements with an optional space:
1" ".join(lemmas)
We get:
1'all model be wrong , but some be useful .'
A set of texts is collectively called a corpus. Each text record is treated by a machine learning algorithm according to its "position" in a corpus.
1corpus = data['review']
Regular expressions
A regular expression is an instrument for finding complex patterns in texts.
Python has a built-in module for working with regular expressions: re
(Regular Expressions)
1import re
Take a look at the re.sub()
function. It finds all the parts of the text that match the given pattern and then substitutes them with the chosen text.
1# pattern2# substitution — what each pattern match should be substituted with3# text — the text which the function scans for pattern matches4re.sub(pattern, substitution, text)
We only need to keep letters and spaces in these lemmatized review texts, so let's write a regular expression to find them.
The expression starts with r and is followed by square brackets in quotation marks:
1r'[]'
All the letters that match the pattern are listed in square brackets without spaces and can be placed in any order. Let's find the letters from "a" to "z." If we assume they can be in both lowercase and uppercase, then the code should be written as follows:
1# a range of letters is indicated by a hyphen:2# a-z = abcdefghijklmnopqrstuvwxyz3r'[a-zA-Z]'
Let's take one of the reviews from the dataset. We need to keep all Latin letters, spaces, and apostrophes, so our pattern will need to correctly identify them. If we call re.sub()
, they will be substituted for spaces. To indicate the characters that don't match the pattern, put a caret (^) in front of the sequence:
1# review text2text = """3I liked this show from the first episode I saw, which was the "Rhapsody in Blue" episode (for those that don't know what that is, the Zan going insane and becoming pau lvl 10 ep). Best visuals and special effects I've seen on a television series, nothing like it anywhere.4"""5re.sub(r'[^a-zA-Z\']', ' ', text)
1" I liked this show from the first episode I saw which was the Rhapsody in Blue episode for those that don't know what that is the Zan going insane and becoming pau lvl ep Best visuals and special effects I've seen on a television series nothing like it anywhere "
Let's get rid of the extra spaces since they hinder the analysis. We can eliminate them by using the combination of the functions join()
and split()
.
If we call split()
without any arguments, it splits the text by spaces or groups of spaces:
1text = " I liked this show "2text.split()
The result is a list without spaces:
1['I', 'liked', 'this', 'show']
Using the join()
method, combine these elements into a line with spaces:
1" ".join(['I', 'liked', 'this', 'show'])
So we get a line with no extra spaces:
1'I liked this show'
Bag-of-words and n-gram
One common technique for converting text is called the bag-of-words model. It transforms texts into vectors without considering word order, and that's why it's called a bag.
Let's take a famous proverb:
1For want of a nail the shoe was lost.2For wnat of a shoe the horse was lost.3For want of a horse the rider was lost.
If we get rid of uppercase letters and lemmatize it with spaCy, we get this:
1for want of a nail the shoe be lose2for want of a shoe the horse be lose3for want of a horse the rider be lose
Let's count how many times each word occurs:
- "for", "want", "of", "a", "the", "be", "lose" - 3
- "shoe", "horse" - 2
- "nail", "rider" - 1
Here's the vector for this text:
1[2,2,2,1,1,1,1,1]
If there are several texts, then the bag-of-words transforms them into a matrix.
The bag-of-words counts every unique word. But the word order and connections between words are not taken into account.
Look at this lemmatized text, for example:
1Peter travel from Tuscon to Vegas
Here's the list of words: "Peter," "travel," "from," "Tuscon," to "Vegas." So where does Peter go? To answer the question, let's look at the phrases, or N-grams.
An n-gram is a sequence of several words. indicates the number of elements and is arbitrary. For instance if N=1, we have separate words or unigrams. If N=2, we have two-word phrases or bigrams. N=3 produces trigrams. You get the idea, right?
Let's find all the the trigrams for the sentence "Sunset raged like a beautiful bonfire."
N-grams are similar to bag-of-words because they can also be converted into vectors.Here's the vector for the text about Peter:
Creating a bag-of-words
To convert a text corpus into a bag-of-words, use the CountVectorizer()
class from the sklearn.feature_extraction.text module
.
Import the class:
1from sklearn.feature_extraction.text import CountVectorizer
Create a counter:
1count_vect = CountVectorizer()
Pass the text corpus to the counter. Call the fit_transform()
function. The counter extracts unique words from the corpus and counts how many times they appear in each text of the corpus. The counter doesn't count separate letters.
1# bow = bag of words2bow = count_vect.fit_transform(corpus)
This method returns a matrix where rows represent texts and the columns display unique words from the corpus. The numbers at their intersections represent how many times a given word appears in the text.
Let's use the corpus from the previous lesson:
1corpus = [2 'for want of a nail the shoe be lose',3 'for want of a shoe the horse be lose',4 'for want of a horse the rider be lose',5 'for want of a rider the message be lose',6 'for want of a message the battle be lose',7 'for want of a battle the kingdom be lose',8 'and all for the want of a horseshoe nail'9]
Let's create a bag-of-words for the matrix. Use the shape
attribute to find out the size of the matrix:
1bow.shape
1(7, 16)2
The result is 7 texts and 16 unique words.
Here's our bag-of-words as an array:
1print(bow.toarray())2
1[[0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1]2 [0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1]3 [0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1]4 [0 0 0 1 0 0 0 1 1 0 1 1 0 1 1 1]5 [0 0 1 1 0 0 0 1 1 0 1 0 0 1 1 1]6 [0 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1]7 [1 1 0 1 0 1 0 0 0 1 1 0 0 1 1 0]]
The list of unique words in the bag is called a vocabulary. It's stored in the counter and can be accessed by calling the get_feature_names()
method:
1count_vect.get_feature_names()
Here's the vocabulary for our example:
1['all',2 'and',3 'battle',4 'for',5 'horse',6 'horseshoe',7 'kingdom',8 'lost',9 'message',10 'nail',11 'of',12 'rider',13 'shoe',14 'the',15 'want',16 'was']
CountVectorizer()
is also used for n-gram calculations. Specify the n-gram size with the ngram_range
argument to make it count the phrases.
If we need to find two-word phrases, we should specify the range this way:
1count_vect = CountVectorizer(ngram_range=(2, 2))
The counter works with phrases, just like it does with words.
Usually you can drop conjunctions and prepositions without losing the meaning of the sentence. If you have a smaller and cleaner bag-of-words, it'll be easier to find the words most important for text classification.
To make sure you get a cleaner bag-of-words, find the stop words.
Look at the stopwords
package from the nltk.corpus
module:
1from nltk.corpus import stopwords
1import nltk2nltk.download('stopwords')
Call the stopwords.words()
function and use 'english'
as an argument to get a set of stop words for English.
1stop_words = set(stopwords.words('russian'))
Pass the stop word list to the CountVectorizer()
when you create the counter.
1count_vect = CountVectorizer(stop_words=stop_words)
Now the counter knows which words should be excluded from the bag-of-words.
TF-IDF
The importance of a given word is determined by the TF-IDF value (Term Frequency — Inverse Document Frequency). TF is the number of occurrences of a word in a text, and IDF measures how frequently it appears in the corpus.
The formula for TF-IDF:
How you calculate TF:
In the formula, (term) is the number of word occurrences and is the total number of words in the text.
IDF's role in the formula is to reduce the weight of the most frequently used words in any other text in the given corpus. IDF depends on the total number of texts in a corpus () and the number of texts where the word occurs ().
TF-IDF in sklearn
The TfidfVectorizer()
class can be found in the sklearn.feature_extraction.text
module.
1from sklearn.feature_extraction.text import TfidfVectorizer
Create a counter and define stop words, just like we did with CountVectorizer()
:
1stop_words = set(stopwords.words('english'))2count_tf_idf = TfidfVectorizer(stop_words=stop_words)
Call the fit_transform()
function to calculate the TF-IDF for the text corpus:
1tf_idf = count_tf_idf.fit_transform(corpus)
We can calculate n-grams by passing the ngram_range
argument to TfidfVectorizer()
.
If the data is split into train and test sets, call the fit()
function only for the training set. Otherwise, the testing will be biased.
Sentiment analysis
In order to determine the tonality of the text, we can use TF-IDF values as features.
Sentiment analysis identifies emotionally-charged texts.
Sentiment analysis works by labeling text as positive or negative. Positive text is given a "1", and negative text is assigned a "0".