Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data.
One of the first steps in the text mining process is to organize and structure the data in some fashion so it can be subjected to both qualitative and quantitative analysis.
Doing so involves the use of natural language processing (NLP) technology. Language is broken down into shorter element pieces in order to teach machine to understanding their relationships and how they work together. By doing this, the computer can ascertain the meaning behind a sentence.
The image shows the basic pipeline of text mining. Textual data contains lots of little insignificant yet highly common words: “to”, “for”, “the”, “of” etc. Stopword removal is the process of filtering out these words to look for longer, more significant words.
Tokenization looks at the spaces between the words as word boundaries in order to ‘cut’ words from the text string at those points. Once the string has been broken into individual words, you can determine word frequency in the document and begin identifying relationships between words.
Stemming is the process of reducing inflected words to their word stem, base or root form. A more complex approach to the problem of determining a stem of a word is lemmatization. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech.
After data preprocessing several advanced analytics models could be applied on the data. Sentiment analysis is a widely used text mining application that can track customer sentiment about a company. Other common text mining uses include classifying website content, screening job candidates based on the wording in their resumes and cluster analysis to classify words into groups.