An introduction to text processing for machine learning applications

Text processing is one of the most common tasks in machine learning applications. These applications deal with a huge amount of text to perform a given task and involve a lot of work on the back end to transform the text into something an algorithm can digest. We will shortly discuss the steps involved in this process.

Author: Goran Sukovic, PhD in Mathematics, Faculty of Natural Sciences and Mathematics, University of Montenegro


The goal of an information extraction system is to find and link the relevant information while ignoring the extraneous and irrelevant one. The research and development in this area have been mainly motivated by the Message Understanding Conferences (MUC). According to the MUC community, the generic information extraction system can be represented as a pipeline of many components: pre-processing modules and filters, tools for syntactic and semantic analysis (linguistic components), and post-processing modules.

Of machine learning applications, text processing is one of the more commonly used. Some examples of such applications include language translation, sentiment analysis (determining positive, negative, or neutral sentiment towards products or topics from a text corpus), data extraction from documents, and spam filtering (detecting unwanted and unsolicited messages). These applications deal with a huge amount of text to perform a given task, and involve a lot of work on the back end to transform the text into something an algorithm can digest. We will shortly discuss the steps involved in this process.

The first step is data preprocessing. Pre-processing steps usually include tokenization, removing stop words, stemming, and lemmatization. Tokenization is a process of converting sentences to words and removing unnecessary punctuation and tags. The next step is removing frequent words that do not have specific semantics, such as ”a”, ”the”, ”is”, etc. The goal of stemming is to determine the root of words through dropping unnecessary characters, usually a suffix of a word (in other words, removing inflection). Lemmatization is another approach to remove inflection by determining the part of speech and utilizing a detailed database of the language. The goal of both stemming and lemmatization, according to Stanford NLP group, is “to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form”. For example, we should conclude that are, is and am are forms of the verb be, or car, cars, car’s, cars’ are forms of car. The following table shows an example of a difference between stemming and amortization.

WordStemmed formLemmatized form
studiesstudistudy
studyingstudystudy



If you want to know more about stemming and lemmatization check What is the difference between stemming and lemmatization? or Stemming and lemmatization. Note that not all of the aforementioned steps are mandatory and are dependent on the application. For example, spam filtering may follow all the steps but language translation may not.

The second step is feature extraction. In most text processing tasks, words of the text represent categorical (discrete) features and we have to encode such data in a way that is ready to be used by machine learning algorithms. Feature extraction in text processing is a process of mapping from text data to real-valued vectors (or tuples). There are a number of ways to accomplish this task. One of the simplest and most popular techniques to numerically represent text is Bag of Words. First, we create a vocabulary – a list of unique words in the text corpus. Then we can represent each sentence or document as a binary vector. Each word of a sentence is represented as 1 for present and 0 for absent from the vocabulary. Another representation can count the number of times each word appears in a document, using, for example, the technique known as Term Frequency-Inverse Document Frequency (TF-IDF). We can define Term Frequency (TF) as the following quotient: (Number of times term “t” appears in a document)/(Number of terms in the document). Inverse Document Frequency (IDF) is log(N/n), where, N is the number of documents and n is the number of documents a term “t” has appeared in. The main idea is that IDF of a rare word is high, whereas the IDF of a frequently used word is likely to be very low. Finally, we can calculate the TF-IDF value of a term as TF * IDF. For example, consider a document containing 100 words wherein the word “dog” appears three times. The term frequency (i.e., TF) for word “dog” is then (3 / 100) = 0.03. Furthermore, our text corpus contains 10 million documents and the word “dog” appears in 0.01% of these (1000 documents). Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4. Thus, the TF-IDF is the product of these quantities: 0.03 * 4 = 0.12.



For many machine learning algorithms, especially in natural language processing (NLP), it is very important to maintain the context of the words. Various approaches can be used to solve this problem. Most popular is the word embedding. It is a representation of text where words that have the same meaning have a similar representation. You can imagine that we build a coordinate system where related words are placed closer together, based on a corpus of relationships.

Some of the well-known models of word embedding are Word2vec and GloVe (Global Vectors for Word Representation).

The main idea behind the famous Word2vec is to create vector space from a large corpus of text. Each unique word from the text is assigned to a corresponding vector in the space. The words that share common contexts in the corpus are located close to one another in the vector space. We can perform vector operations to deduce semantic relationship between words. Let vec(word) denote vector assigned to a word. One famous example is the following: vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”). If you want more technical details, check this post from Stitch Fix with a great animated gif that explains the vectorization.

For more details about Word2vec refer to The amazing power of word vectors or The Illustrated Word2vec.

The Global Vectors for Word Representation, or GloVe, is an extension to the word2vec method for efficiently learning word vectors. Using statistics across the whole text corpus, GloVe constructs an explicit word co-occurrence matrix, and the resulting learning model, in general, gives a better word embedding.

We will provide an example of a co-occurrence matrix from [Pennington at al., 2014]. First, we introduce some notation. Let the matrix of word-word co-occurrence counts be denoted by X, whose entries Xij tabulate the number of times word j occurs in the context of the word i. Let Xi = ∑kXik be the number of times any word appears in the context of the word i. Finally, let Pij = P(j | i) = Xij /Xi be the probability that word j appears in the context of the word i.

The following table shows co-occurrence probabilities for target words “ice” and “steam” with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like “water” and “fashion” cancel out, so that large values (much greater than 1) correlate well with properties specific to “ice”, and small values (much less than 1) correlate well with properties specific of “steam”.

Probability and Ratiox = solidx = gasx = waterx = fashion
P(x | ice)1.9 × 10−46.6 × 10−53.0 × 10−31.7 × 10−5
P(x | steam)2.2 × 10−57.8 × 10−42.2 × 10−31.8 × 10−5
P(x | ice) / P(x | steam)8.98.5 × 10−21.360.96


Consider a word strongly related to “ice”, but not to “steam”, such as “solid”. P(solid | ice) will be relatively high, and P(solid | steam) will be relatively low. Thus the ratio of P(solid | ice) / P(solid | steam) will be large. If we take a word such as “gas” that is related to “steam” but not to “ice”, the ratio of P(gas | ice) / P(gas | steam) will instead be small. For a word related to both “ice” and “steam”, such as “water” we expect the ratio to be close to one. We would also expect a ratio close to one for words related to neither “ice” nor “steam”, such as “fashion”.


The vectorized documents are subject to a machine learning algorithm, specific to the domain. Depending on domain and data available, we can develop various machine learning models. Classical approaches like “Naive Bayes” or “Support Vector Machines” are well suited for spam filtering and have been widely used. For sentiment analysis and language translation, deep learning techniques are giving better results. It has been seen that for many data extraction and text classification problems, classical machine learning approaches give similar results with quicker training time than deep learning methods. We will discuss some of the machine learning algorithms in the near future.

Thanks for reading this article. If you like it, please recommend, and share it.

UHURA IS AN ARTIFICIAL INTELLIGENCE PLATFORM THAT READS AND UNDERSTANDS CONTRACTS AND AGREEMENTS JUST AS HUMANS DO. IT OFFERS AUTOMATION CAPABILITIES TO HELP REDUCE COSTS AND SHORTEN DOCUMENT PROCESSING TIME FROM HOURS TO SECONDS.LOWER YOUR COSTS, SAVE TIME, AND ELIMINATE MANUAL PROCESSING OF CONTRACTS AND AGREEMENTS.
UHURA IS AN ARTIFICIAL INTELLIGENCE PLATFORM THAT READS AND UNDERSTANDS CONTRACTS AND AGREEMENTS JUST AS HUMANS DO. IT OFFERS AUTOMATION CAPABILITIES TO HELP REDUCE COSTS AND SHORTEN DOCUMENT PROCESSING TIME FROM HOURS TO SECONDS.LOWER YOUR COSTS, SAVE TIME, AND ELIMINATE MANUAL PROCESSING OF CONTRACTS AND AGREEMENTS.
Uhura Solutions LTD

6th Floor 9 Appold Street, London EC2A 2AP, United Kingdom