Text Vectorization
Glossary
Bag-of-words: a simple model to convert texts into vectors regardless the order of words in texts and their importance.
Corpus: set of texts, usually thematically associated with each other.
Lemmatization: reducing a word to its root form or lemma.
N-gramm: a sequence of N tokens.
Regular expression: a sequence of characters that defines a search pattern, one pattern is usually used for finding different occurrences confirming to this pattern, e.g. one pattern can describe different phone numbers, email addresses, and so on.
Sentiment Analysis: an NLP task used to identify tonality of a text (document).
TFIDF: a simple model to convert texts into vectors without considering word order but with respect to their importance.
Tokenization: dividing text into tokens: separate phrases, words, and symbols.
Practice
1# NLTK: getting list of lemmatized words for a text23# you may have to download the wordnet files for the first time4# import nltk5# nltk.download('wordnet') # https://wordnet.princeton.edu/67from nltk.tokenize import word_tokenize8from nltk.stem import WordNetLemmatizer910lemmatizer = WordNetLemmatizer()1112text = "All models are wrong, but some are useful."1314tokens = word_tokenize(text.lower())15lemmas = [lemmatizer.lemmatize(token) for token in tokens]1617# printing as a single line18print(" ".join(lemmas))
1# spaCy: getting list of lemmatized words for a text23# you may have to download a spaCy model for the first time with4# python -m spacy download en56import spacy78nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])910doc = nlp(text.lower())1112lemmas = [token.lemma_ for token in doc]1314print(" ".join(lemmas))
1# pattern - check for the syntax at https://docs.python.org/3.7/library/re.html2# substitution — what each pattern match should be substituted with3# text — the text which the function scans for pattern matches45import re6re.sub(pattern, replacement, text)
1# getting stop words for English23from nltk.corpus import stopwords45stop_words = set(stopwords.words('English'))
1# constructing a bag-of-words, a simple model for vectorizing texts2# stopwords - list of stop words34from nltk.corpus import stopwords5from sklearn.feature_extraction.text import CountVectorizer67stop_words = set(stopwords.words('english'))8count_vect = CountVectorizer(stop_words=stop_words)910bow = count_vect.fit_transform(corpus)1112# dictionary of unique words13words = count_vect.get_feature_names()1415# printing a bag-of-words array16print(bow.toarray())
1# constructing a bag-of-words of n-gramms (without stop words)2# min_n - min value for n3# max_n - max value for n45from sklearn.feature_extraction.text import CountVectorizer67count_vect = CountVectorizer(ngram_range=(min_n, max_n))89bow = count_vect.fit_transform(corpus)1011# dictionary of n-gramms12words = count_vect.get_feature_names()1314# printing a bag-of-words array15print(bow.toarray())
1# Constructing TF-IDF for a corpus23from nltk.corpus import stopwords4from sklearn.feature_extraction.text import TfidfVectorizer56stop_words = set(stopwords.words('english'))78tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)9tfidf = tfidf_vectorizer.fit_transform(corpus)
1# Constructing TF-IDF for n-gramms of a corpus2# min_n - min value for n3# max_n - max value for n45from nltk.corpus import stopwords6from sklearn.feature_extraction.text import TfidfVectorizer78stop_words = set(stopwords.words('english'))910tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(min_n, max_n))11tfidf = tfidf_vectorizer.fit_transform(corpus)