BUSINESS INSIGHTS

Oct 05, 2016

3 Simple Solutions to Bring Structure to your Text Data

David Eldersveld Posted by David Eldersveld

 If a consumer provides feedback in an online review, a survey, through social media or other means, are you in a position to capture that data? If so, how do you determine whether it is valuable? Is it potentially time-sensitive? Can you use insights about the data to adjust business processes or improve customer relations? By employing text analytics, you can help answer questions such as these using a wealth of data that does not initially fit into a tidy, structured format.

Getting unstructured data into a format that is more effective for analysis is a critical first step toward finding value that could help your business. It is no longer difficult or costly to retain large volumes of text data, and it is no longer time-intensive to analyze all of the text that you encounter from customers, prospects, or other sources. The ability to decipher and classify text rapidly might even provide tangible results by helping to prevent customer churn, cut costs, or add additional revenue.

wood-cube-473703_1920.jpg

Text analytics helps build more structure and metadata around initially unstructured text. By adding additional structure, it is possible to derive further value.

What follows are three simple ways to add structure to what starts as a string of text. These solutions provide the building blocks for more detailed analysis, classification, and automation later on. In a future post, we will consider how these three basic techniques to bring structure to text data can be easily employed in a generic analytics solution, but for now, let’s lay the foundation.

1. Isolate Key Words­

Whether you want to analyze a short phrase or an entire document, individual words make up the structure of your text. In the context of natural language processing or text analytics, you may also encounter the term token or hear about tokenizing your data. For a simple definition, think of tokens as words. Tokenizing is simply the process of splitting a body of text into individual words. From a language as well as an analytical standpoint, some words carry more importance than others, and it is helpful to isolate not only individual words but determine those that act as key words. More advanced classification has to start somewhere, and finding the key words or phrases within sentences puts you on the path to discovering more through subsequent techniques such as word matching, frequency counts, and other types of analysis.

2. Determine Topics

Another way to add structure to your text involves categorizing it by its subject matter. Depending on your data source, you may already know the general content. For example, you might be able to assume that a product review or targeted survey contains opinions pertaining to that product or survey topic. In many cases, however, the subject of a customer interaction may be a mystery until it is read. For instance, contact through social media could be particularly vague when a user tweets about your company. Until you review the text, you might not know why the customer is contacting you or whether it is a compliment, a complaint, or something else. Once you have a topic or topics, you can better categorize your data for storage or analysis.

3. Measure Sentiment

Another common way to add value to your text involves gauging the tone. Sentiment analysis has been popular and widely accessible for a few years, and there are many solutions for measuring sentiment. Outputs will sometimes appear as a predetermined classification such as positive or negative, but if possible, it is ideal to use raw sentiment scores in a numeric format. With a score, you can make better comparisons as well as determine what is positive or negative based on your own criteria and assumptions. You can also obtain an understanding of the distribution of sentiment and see if there are any noticeable outliers.

When working with sentiment analysis, however, be aware that there are different methodologies that can contribute to widely different numeric scores. Some techniques rely on matching the words in your text to a pre-scored list of words. This lexicon-based approach is fast but heavily depends on the judgment of the word list creators. Other techniques incorporate machine learning and provide scores based on a training data set rather than simple word matches. Scales differ as well, so initially be wary when using a sentiment score without first knowing more about it. A score of “1” might be considered highly positive using a technique that only provides a range from 0 to 1. A separate method might provide scores ranging from 0 to 5, and that 1 suddenly is not as positive. For any solution, consistently measuring sentiment is key.

Getting Started

Fortunately, you do not have to be an expert in natural language processing to add more structure around your text data. One tool that we have employed at BlueGranite is Microsoft’s Text Analytics service, which is part of their suite of intelligence application programming interfaces (APIs) called Cognitive Services. Not all of the APIs are valuable for a data and analytics project, but some of the language-oriented APIs can be easily leveraged as part of an analytics solution. Using Microsoft’s Text Analytics service, any developer can send text to the API and receive a list of key phrases, topics, and sentiment in return. In addition, we utilize technologies such as R, Azure Machine Learning, and Azure HDInsight for text analytics.

Regardless of the tools used to help with text analytics, the goal is greater insight into what was previously indecipherable – at least without an abundance of manual work. Depending on your data source, you might have incidental details about your text, such as what date and time it was written, the author’s location, and other attributes. Even with those attributes, you miss out on what the text ultimately says and the true value it might have. A basic understanding of your data through text analytics provides the key to unlocking that value.

To learn more about what BlueGranite can do with your text data, contact us today!

AdvancedAnalyticsWorkshop
David Eldersveld

About The Author

David Eldersveld

David is a Senior Consultant and Microsoft MVP who has employed skills in technology development, data integration, data analysis, and systems analysis for over ten years. David enjoys building BI and advanced analytics solutions with technologies such as SQL Server, Microsoft R, and Power BI. He is active in various technical communities. In addition to blogging for BlueGranite, he also writes at dataveld.wordpress.com.

Latest Posts

Distributed Computing Webinar | Mar 2017