Text Vectorization

Glossary

Bag-of-words: a simple model to convert texts into vectors regardless the order of words in texts and their importance.

Corpus: set of texts, usually thematically associated with each other.

Lemmatization: reducing a word to its root form or lemma.

N-gramm: a sequence of N tokens.

Regular expression: a sequence of characters that defines a search pattern, one pattern is usually used for finding different occurrences confirming to this pattern, e.g. one pattern can describe different phone numbers, email addresses, and so on.

Sentiment Analysis: an NLP task used to identify tonality of a text (document).

TFIDF: a simple model to convert texts into vectors without considering word order but with respect to their importance.

Tokenization: dividing text into tokens: separate phrases, words, and symbols.

Practice


1# NLTK: getting list of lemmatized words for a text
2
3# you may have to download the wordnet files for the first time
4# import nltk
5# nltk.download('wordnet')  # https://wordnet.princeton.edu/
6
7from nltk.tokenize import word_tokenize
8from nltk.stem import WordNetLemmatizer
9
10lemmatizer  = WordNetLemmatizer()
11
12text = "All models are wrong, but some are useful."
13
14tokens = word_tokenize(text.lower())
15lemmas = [lemmatizer.lemmatize(token) for token in tokens]
16
17# printing as a single line
18print(" ".join(lemmas))


1# spaCy: getting list of lemmatized words for a text
2
3# you may have to download a spaCy model for the first time with
4# python -m spacy download en
5
6import spacy
7
8nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
9
10doc = nlp(text.lower())
11
12lemmas = [token.lemma_ for token in doc]
13
14print(" ".join(lemmas))


1# pattern - check for the syntax at https://docs.python.org/3.7/library/re.html
2# substitution — what each pattern match should be substituted with
3# text — the text which the function scans for pattern matches
4
5import re
6re.sub(pattern, replacement, text)


1# getting stop words for English
2
3from nltk.corpus import stopwords
4
5stop_words = set(stopwords.words('English'))


1# constructing a bag-of-words, a simple model for vectorizing texts
2# stopwords - list of stop words
3
4from nltk.corpus import stopwords
5from sklearn.feature_extraction.text import CountVectorizer
6
7stop_words = set(stopwords.words('english'))
8count_vect = CountVectorizer(stop_words=stop_words)
9
10bow = count_vect.fit_transform(corpus)
11
12# dictionary of unique words
13words = count_vect.get_feature_names()
14
15# printing a bag-of-words array
16print(bow.toarray())


1# constructing a bag-of-words of n-gramms (without stop words)
2# min_n - min value for n
3# max_n - max value for n
4
5from sklearn.feature_extraction.text import CountVectorizer
6
7count_vect = CountVectorizer(ngram_range=(min_n, max_n))
8
9bow = count_vect.fit_transform(corpus)
10
11# dictionary of n-gramms
12words = count_vect.get_feature_names()
13
14# printing a bag-of-words array
15print(bow.toarray())


1# Constructing TF-IDF for a corpus
2
3from nltk.corpus import stopwords
4from sklearn.feature_extraction.text import TfidfVectorizer
5
6stop_words = set(stopwords.words('english'))
7
8tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)
9tfidf = tfidf_vectorizer.fit_transform(corpus)


1# Constructing TF-IDF for n-gramms of a corpus
2# min_n - min value for n
3# max_n - max value for n
4
5from nltk.corpus import stopwords
6from sklearn.feature_extraction.text import TfidfVectorizer
7
8stop_words = set(stopwords.words('english'))
9
10tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(min_n, max_n))
11tfidf = tfidf_vectorizer.fit_transform(corpus)