Text analytics: The new data science frontier
The massive growth of unstructured data, coupled with the enormous potential value to be found in mining this information, has led to the rise of one of the newest data science arenas: text analytics.
Text analytics is a specialist field in the data science realm, and one which is taking on increasing importance, as evidenced by the fact that unstructured text data is currently the biggest data source produced by people, which in the data scientist's terminology means people are constantly creating new data points.
In fact, points out Melissa Jantjies, senior associate systems engineer at SAS, unstructured text is the single largest human-generated data source, and is growing at an incredible rate. She points to current statistics that show that globally, humans are producing some 470 000 tweets, 510 000 posts, 2.4 million searches and 16 million texts, per minute!
"The same is true for organisations, of course, and rich textual data can be collected from every part of the business. This can include everything from online channels to call centre notes, and encompass news stories, blogs, online forums, consumer reviews, social media, live chat and more. It can also take into account more confidential data like medical records, contracts and claims," she says.
"Therefore, the real challenge is not a lack of data, but rather how to actually gather and use it. Remember that language is generally a messy thing, because even within a single language, different people have very different ways of speaking, pronouncing and using words. Therefore, unearthing the data's full potential can be difficult with such complex data sources."
Jantjies says some of the main challenges faced by data scientists include inconsistent formats and large data volumes, as well as the fact that it is often not only from multiple sources, but can be in multiple languages as well. In addition, with text there are issues around misspellings, slang and abbreviations, and, of course, text is very dependent on interpretation and context, so there is a lot to think of when analysing text data.
"Naturally, artificial intelligence (AI) is critical here, as a manual review would be both inconsistent and far too time-consuming, while a sampling approach could easily lead to the scientist missing out on important information that would be valuable to the bigger picture. Therefore, the only way to handle this is via text analytics and natural language processing (NLP), a branch of AI that helps computers understand, interpret and manipulate human language.
"NLP is vital to text analytics because it enables a range of data-related functions, including categorisation, classification, topic detection, clustering and profiling, parsing and information extraction and even automatic summarisation."
She indicates the process flow here would begin with data that is unstructured text being fed into the NLP process system, with machine learning used for the purposes of scaling. In conjunction with this, the scientist should also apply rule-based human interventions for the sake of context – for example, a human would be far better placed to identify something like sarcasm. With all of this in place, it is simpler to undertake topic discovery, extraction and categorisation.
"The end result of all this is the ability to identify emerging trends, implement predictive analytics and gain easily summarised operational insights. The key factor here is to always remember that it is imperative to adopt a hybrid approach to text analytics – in other words, never leave all the work to the AI, as context is so critical here that human interventions are an essential part of the process.
"Ultimately, for data scientists to succeed in the text analytics field, they require a solution that offers a modern, flexible and end-to-end text analytics framework; one that combines text mining, contextual extraction, categorisation, sentiment analysis and search capabilities," she concludes.