Thoughts from the 6th annual Text Analytics Summit

I just returned from the 6th annual Text Analytics Summit in Boston.  It was an enjoyable conference, as usual.  Larger players such as SAP and IBM both had booths at the show alongside pure play vendors Clarabridge, Attensity, Lexalytics, and Provalis Research.  This was good to see and it underscores the fact that platform players acknowledge text analytics as an important piece of the information management story.   Additionally, more analysts were at the conference this year, another sign that the text analytics market is becoming more mainstream.   And, most importantly, there were various end-users in attendance and they were looking at using text analytics for different applications (more about that in a second).

Since a large part of the text analytics market is currently being driven by social media and voice of the customer/customer experience management related applications, there was a lot of talk about this topic, as expected.  Despite this, there were some universal themes that emerged which are application agnostic. Interesting nuggets include:

  • The value of quantifying success. I found it encouraging that a number of the talks addressed a topic near and dear to my heart:  quantifying the value of a technology.  For example, the IBM folks when describing their Voice of the Customer solution, specifically laid out attributes that could be used to quantify success for call center related applications (e.g. handle time per agent, first call resolution). The user panel in the Clarabridge presentation actually focused part of the discussion on how companies measure the value of text analytics for Customer Experience Management.   Panelists discussed replacing manual processes, identifying the proper issue, and other attributes (some easy to quantify, some not so easy to quantify).  Daniel Ziv, from Verint even cited some work from Forrester that tries to measure the value of loyalty in his presentation on the future of interaction analytics.
  • Data Integration. On the technology panel, all of the participants (Lexalytics, IBM, SPSS/IBM, Clarabridge, Attensity) were quick to point out that while social media is an important source of data, it is not the only source.   In many instances, it is important to integrate this data with internal data to get the best read on a problem/customer/etc.  This is obvious but underscores two points.  First, these vendors need to differentiate themselves from the 150+ listening posts and social media analysis SaaS vendors that exclusively utilize social media and are clouding the market.  Second, integrating data from multiple sources is a must have for many companies.  In fact, there was a whole panel discussion on data quality issues in text analytics.  While the structured data world has been dealing with quality and integration issues for years, aside from companies dealing with the quality of data in ECM systems, this is still an area that needs to be addressed.
  • Home Grown. I found it interesting that at least one presentation and several end-users I spoke to stated that they have built/will build home grown solutions.  Why? One reason was that a little could go a long way.  For example, Gerand Britton from Constantine Cannon LLP described that the biggest bang for the buck in eDiscovery was performing near duplicate clustering of documents.  This means putting functionality in place that can recognize that an email containing information sent to another person who responds that he or she received it is essentially the same document and a cluster like this should be reviewed by one person rather than two or three.  In order to put this together, the company used some SPSS technology and homegrown functionality.  Another reason for home grown is that companies feel their problem is unique.  A number of attendees I spoke to mentioned that they had either built their own tools or that their problem would require too much customization and they could hire University people to help build specific algorithms.
  • Growing Pains.  There was a lot of discussion on two topics related to this.  First, a number of companies and attendees spoke about a new “class” of knowledge worker.  As companies move away from manually coding documents to automating extraction of concepts, entities, etc.  the kind of analysis that will be needed to derive insight will no doubt be different.  What will this person look like?   Second, a number of discussions sprang up around how vendors are being given a hard time about figures such as 85% accuracy in classifying, for example, sentiment.  One hypothesis given for this was that it is a lot easier to read comments and decide what the sentiment should be than reading the output of a statistical analysis.
  • Feature vs. Solution?  Text analytics is being used in many, many ways.   This includes building full-blown solutions around problem areas that require the technology to embedding it as part of a search engine or URL shortener.   Most people agreed that the functionality would become more pervasive as time goes on.  People will ultimately use applications that deploy the technology and not even know that it is there.  And, I believe, it is quite possible that many of the customer voice/customer experience solutions will simply become part of the broader CRM landscape through time.

I felt that the most interesting presentation of the Summit was a panel discussion on the semantic web.  I am going to write about that conversation separately and will post it in the next few days.

What’s a semantic model and why should we care?

When most people think “analytical application” they think “classic BI” or “predictive modeling.” That’s no longer accurate. The very nature of analytical applications is changing. Text analytics brings unstructured information into the equation. Visualization changes the way that we look at data and information.

The reality is that companies are starting to build applications that require many types of information – structured, unstructured, images (e.g. from sensors and satellites), and audio, video et al. This mix of information may involve many layers of complexity and interconnected relationships and it won’t easily fit into a structured database or data warehouse. The underlying knowledge about the particular problem area may be evolving as well.

Let me give you some examples.

Consider portfolio modeling in the financial services sector. It is not enough to simply analyze past performance, it is necessary to look at external indicators, such as political events in other countries in order to manage a portfolio. Political unrest in an area of operation can directly impact company stock price. Currency issues may impede business opportunities. Not only does the portfolio manager need to access a wide variety of data, but the data needs to interconnect meaningfully. So there needs to be an underlying infrastructure that caters for dynamic changes to interrelated knowledge relevant to the portfolio.

In law enforcement, it may not be enough just to have records of criminals and crimes. Other information types, such as location (geospatial) data of crime scenes and surrounding areas can provide useful information. An appreciation of context is also necessary, for example, to know that terms such as “arrest” mean something specific in law enforcement vs. medicine. And of course, intelligence information is always changing.

Semantic knowledge modeling can account for discrete data such as these, in addition to qualitative influences, to answer larger questions about perpetrators, motives and patterns of behavior: What do this suspect’s relationships with these locations tell me about this seemingly unrelated event? Are these property crimes part of an organized effort?

What is semantic knowledge modeling?

Simply put, a knowledge model is a way to abstract disparate data and information. Knowledge modeling is about describing what data means and where it fits. It allows us to understand and abstract knowledge. Consequently it helps us to understand how different pieces of information relate to each other.

A semantic model is one kind of a knowledge model. The semantic model consists of a network of concepts and the relationships between those concepts. Concepts are a particular idea or topic with which the user is concerned. Using the financial services example above, a concept might be “political unrest”. Or in the law enforcement example, a concept might be a robbery. The concepts and relationships together are often known as an ontology; the semantic model that describes knowledge.

As knowledge changes, the semantic model can change too. For example, robberies may have occurred at various banks. As the number of robberies change, the model can be updated. Other qualitative data related to those banks, or those robberies, can be fed into a continuously updated model; demographic or population shifts in a bank’s neighborhood, a change in bank ownership, details of the circumstances of a particular robbery. This enriches the knowledge about patterns and influences.

Semantic models enable users to ask questions of the information in a natural way and help to identify patterns and trends in this information and discover relationships between disparate pieces of information.

The Thetus Solution

Thetus, a Portland, Oregon based software company has developed the infrastructure to support these kinds of applications. Their flagship product is called Thetus Publisher. It is infrastructure software for developing semantic models. The Publisher enables the modeling of complex problems represented in different or disparate data sets. The product set consists of three core components:

  • Model/Ontology Management – which enables users to build ontologies or to import them. The knowledge model provides a layer of abstraction required for users to interact with the information in a natural way. The model is populated with known concepts, facts and relationships and reveals what data means and where it fits in the model.
  • Lineage – The knowledge model tracks and records the history or “lineage” of the information, which is important in many analytical applications. Moreover, changes in the model are tracked to enable understanding of changes over time. This helps answer questions such as “when did we learn this?” and “why was this decision made?”
  • Workflow – The Thetus workflow engine enables the integration of various analytics including entity extraction, link analysis and geotagging. Workflow is automated based on rules and conditions defined in the model.

Thetus is actively building applications in defense, intelligence, energy, environmental services and law enforcement verticals. These are problem spaces characterized by data sources that are disparate and distributed, and a knowledge base that is evolving. However, the same technology solution is relevant to other business verticals, as well. While many companies are still struggling to analyze their structured data, there is nonetheless room to apply innovative approaches to analytic applications. And, although this technology is in the early adopter stage for many markets, there has been investment in semantic technology on the web and other industries that may help to push it ahead. Hurwitz & Associates plans to keep watching.


Get every new post delivered to your Inbox.

Join 1,710 other followers