Five Best Practices for Text Analytics

It’s been a while since I updated my blog and a lot has changed.  In January, I made the move to TDWI as Research Director for Advanced Analytics.  I’m excited to be there, although I miss Hurwitz & Associates.   One of the last projects I worked on while at Hurwitz & Associates was the Victory Index for Text Analytics.  Click here for more information on the Victory Index.  

As part of my research for the Victory Index, I spent I a lot of time talking to companies about how they’re using text analytics.  By far, one of the biggest use cases for text analytics centers on understanding customer feedback and behavior.  Some companies are using internal data such as call center notes or emails or survey verbatim to gather feedback and understand behavior, others are using social media, and still others are using both.  

What are these end users saying about how to be successful with text analytics?  Aside from the important best practices around defining the right problem, getting the right people, and dealing with infrastructure issues, I’ve also heard the following:

Best Practice #1 – Managing expectations among senior leadership.   A number of the end-users I speak with say that their management often thinks that text analytics will work almost out of the box and this can establish unrealistic expectations. Some of these executives seem to envision a big funnel where reams of unstructured text enter and concepts, themes, entities, and insights pop out at the other end.  Managing expectations is a balancing act.  On the one hand, executive management may not want to hear the details about how long it is going to take you to build a taxonomy or integrate data.  On the other hand, it is important to get wins under your belt quickly to establish credibility in the technology because no one wants to wait years to see some results.  That said, it is still important to establish a reasonable set of goals and prioritize them and to communicate them to everyone.  End users find that getting senior management involved and keeping them informed with well-defined plans on a realistic first project can be very helpful in handling expectations. 


for more visit my tdwi blog



Thoughts from the 6th annual Text Analytics Summit

I just returned from the 6th annual Text Analytics Summit in Boston.  It was an enjoyable conference, as usual.  Larger players such as SAP and IBM both had booths at the show alongside pure play vendors Clarabridge, Attensity, Lexalytics, and Provalis Research.  This was good to see and it underscores the fact that platform players acknowledge text analytics as an important piece of the information management story.   Additionally, more analysts were at the conference this year, another sign that the text analytics market is becoming more mainstream.   And, most importantly, there were various end-users in attendance and they were looking at using text analytics for different applications (more about that in a second).

Since a large part of the text analytics market is currently being driven by social media and voice of the customer/customer experience management related applications, there was a lot of talk about this topic, as expected.  Despite this, there were some universal themes that emerged which are application agnostic. Interesting nuggets include:

  • The value of quantifying success. I found it encouraging that a number of the talks addressed a topic near and dear to my heart:  quantifying the value of a technology.  For example, the IBM folks when describing their Voice of the Customer solution, specifically laid out attributes that could be used to quantify success for call center related applications (e.g. handle time per agent, first call resolution). The user panel in the Clarabridge presentation actually focused part of the discussion on how companies measure the value of text analytics for Customer Experience Management.   Panelists discussed replacing manual processes, identifying the proper issue, and other attributes (some easy to quantify, some not so easy to quantify).  Daniel Ziv, from Verint even cited some work from Forrester that tries to measure the value of loyalty in his presentation on the future of interaction analytics.
  • Data Integration. On the technology panel, all of the participants (Lexalytics, IBM, SPSS/IBM, Clarabridge, Attensity) were quick to point out that while social media is an important source of data, it is not the only source.   In many instances, it is important to integrate this data with internal data to get the best read on a problem/customer/etc.  This is obvious but underscores two points.  First, these vendors need to differentiate themselves from the 150+ listening posts and social media analysis SaaS vendors that exclusively utilize social media and are clouding the market.  Second, integrating data from multiple sources is a must have for many companies.  In fact, there was a whole panel discussion on data quality issues in text analytics.  While the structured data world has been dealing with quality and integration issues for years, aside from companies dealing with the quality of data in ECM systems, this is still an area that needs to be addressed.
  • Home Grown. I found it interesting that at least one presentation and several end-users I spoke to stated that they have built/will build home grown solutions.  Why? One reason was that a little could go a long way.  For example, Gerand Britton from Constantine Cannon LLP described that the biggest bang for the buck in eDiscovery was performing near duplicate clustering of documents.  This means putting functionality in place that can recognize that an email containing information sent to another person who responds that he or she received it is essentially the same document and a cluster like this should be reviewed by one person rather than two or three.  In order to put this together, the company used some SPSS technology and homegrown functionality.  Another reason for home grown is that companies feel their problem is unique.  A number of attendees I spoke to mentioned that they had either built their own tools or that their problem would require too much customization and they could hire University people to help build specific algorithms.
  • Growing Pains.  There was a lot of discussion on two topics related to this.  First, a number of companies and attendees spoke about a new “class” of knowledge worker.  As companies move away from manually coding documents to automating extraction of concepts, entities, etc.  the kind of analysis that will be needed to derive insight will no doubt be different.  What will this person look like?   Second, a number of discussions sprang up around how vendors are being given a hard time about figures such as 85% accuracy in classifying, for example, sentiment.  One hypothesis given for this was that it is a lot easier to read comments and decide what the sentiment should be than reading the output of a statistical analysis.
  • Feature vs. Solution?  Text analytics is being used in many, many ways.   This includes building full-blown solutions around problem areas that require the technology to embedding it as part of a search engine or URL shortener.   Most people agreed that the functionality would become more pervasive as time goes on.  People will ultimately use applications that deploy the technology and not even know that it is there.  And, I believe, it is quite possible that many of the customer voice/customer experience solutions will simply become part of the broader CRM landscape through time.

I felt that the most interesting presentation of the Summit was a panel discussion on the semantic web.  I am going to write about that conversation separately and will post it in the next few days.

Real Time Text Analytics

I’ve recently noticed a small buzz building about the notion of “real time” text analytics.  In fact, I’ve come across several vendors talking about it in relation to customer experience and financial trading.  The idea is that these companies analyze a lot of  unstructured data quickly and provide real time  information to the people who need it.  Of course, real time can mean something different depending on the context.  It might mean continuously monitoring customer feedback from multiple sources to improve customer retention.  This could mean analyzing information on an hourly basis.  Or it might mean millisecond response time analysis in the case of monitoring current events to use for trading purposes.   In the first example, millisecond response time may not be necessary.  In the case of financial trading and other activities, it can make a difference.


One vendor that offers this kind of analytical power in a SaaS model is Psydex.  Robin Bloor and I recently had the opportunity to speak to Rob Usey and Don Simpson about Psydex.    Robin has also written about this company in his blog.


Psydex analyzes huge amounts of unstructured feeds to assess the impact of news events.  The company takes in and analyzes feeds from various news sources like Thomson Reuters, Dow Jones, Associated Press, Business Wire as well as social networking sites like Twitter to extract useful information.   It can even pull in TV news feeds and text messaging.  Latency is less than 20 milliseconds to query decades of content.    The secret sauce is the company’s ability to organize streaming content in-memory and around time.  The goal is when an event hits, rather than taking hours or minutes to get information to the person who needs to know, it takes seconds. 


How it works


Psydex organizes information around semantic topics.  These topics are built using rules that represent events, people, themes, places, and so on.  For example oil might be a topic.  The topic oil might include oil, crude oil and the price of oil. Another topic, such as Oil Problem, might incorporate this topic as well as any information relating to spills, explosions, etc. 


The company uses a proprietary grid-based indexing scheme for organizing content in memory with topic models stored separately. These topics are then analyzed for trends and patterns.   Specifically, Psydex uses all of its information to establish a baseline for normal topic noise levels.  The company tracks these topics and can detect when a statistically significant deviation occurs.  The screen shot below illustrates this idea.  This shot shows what a Psydex user might see when a plane crash hits the news wires.  In this case, it is about the plane crash in Japan earlier this week.


In this example, one semantic topic is plane crash.  The topic was built using a rule that includes phrases such as plane is down,plane crashed,plane crash,jet has crashed,helicopter crashed,helicopter crash,plane down,jet crashed,jet crash,plane went down, just crashed,plane has crashed,airplane has crashed,helicopter has crashed,jet has crashed,airliner has crashed,plane just crashed,plan, crashes,flight+crashes,flight+crashed – you get the idea.   There is also another topic called Japan.




The view is a five day hourly view.  Immediately you see a spike for Japan and then two other spikes beginning at 1800 hours.   You can also see the associated content stating that a plane went down in flames.  This content is coming from multiple sources.  You can also see that the noise level is up significantly for the Plane Crash topic.   The user interface also allows you to see potential related topics. In this case, Japan is a related topic.  These are topics that were found in the same articles and/or time slots.  The spikes for the two topics together are shown in blue. 



While this plot shows hourly spikes, the company can take this time interval down to the second.   Psydex sent over a log from yesterday that showed the story about the fedex crash hit the wire at 18:21:47.  Psydex signals were generated at 18:21:48. 


Real time BI and Real Time Text Analytics


There has been a lot of hype about real time BI (dealing with structured data) over the past few years and its use in operational systems.  And, then came complex event processing (again, structured data).   And, now real time text analytics.  You can imagine some good use cases for analyzing news-related text in real time.  Trading is obviously a use case that might require some really fast response time.  Government applications might be another.  Psydex would argue that another use case might be brand management because companies might want to use some piece of news or chatter to update their online advertising campaign.  Or, be the first to know if something negative is being said about your company.    Of course, there are other scenarios in which continuously analyzing text other than news feeds as part of an operational process might be useful, as well.








Leximancer – Concept Maps for Text Analytics and the Customer Insight Portal

I had the opportunity last week to learn more about Leximancer, a text analytics company with its roots in Australia. Leximancer recently moved some of its operations to the U.S. and I had a very interesting conversation with CEO Neil Hartley about the company.

Leximancer was founded in 2005. Its technology is based on seven years of research by Dr. Andrew Smith of the University of Queensland. The product employs a mainly statistical approach to text analytics. You simply feed the software unstructured text and the corpus is analyzed and a concept map is generated. These concept maps display main concepts, their relationship with other concepts, and emergent themes.

Concept Maps

Here is an example of a concept map that was created from customer survey data from two different quarters:

This concept map displays some key themes associated with Q1 and Q2 survey responses (File Q1, File Q2). In Q1 (on the left), the major theme was “slow” and in Q2 the major theme was “experience”. These themes suggest that there was a problem with customer service in Q1. Other themes that are clustered around these themes provide more insight, as do the other concepts that make up the themes (e.g. slow and difficult associated with the slow theme) and the pathways between these concepts. For example, in Q1, the concepts such as slow, difficult, reporting, system, tools indicate that there may have been a problem with some of the company’s customer service. During Q2 the concepts better, quality, team appear to indicate that things are improving. However, it would be important to dive into the actual text associated with these concepts and pathways to determine if this is actually the case. Leximancer lets the user do this quite easily.

What is interesting about the approach is that it does not require any real set up. You simply submit your documents and generate a concept map. The simplified process goes something like this:

The documents are submitted ->Stop words (e.g. a, the) are omitted ->A keyword count is performed –>The corpus is then broken up into segments and co-occurrence of words determined such that the resultant concepts represent a thesaurus of words that travel through the text together.

Of course, some thought needs to go into the analysis that you want to perform so that you are feeding the system relevant information and getting useful concepts back.

Customer Insight Portal

Neil was nice enough to provide me with an account to Leximancer’s Customer Insight Portal – a SaaS offering. The portal is very easy to use. You simply login and then tell the system the files you would like to analyze. You can upload internal documents or specify the URL(s) you would like to mine. Once the analysis is complete, you can then drill in and out of the concepts and highlight the pathways between concepts.

I decided to explore the news about the financial crisis. I input two popular financial websites into the insight portal and got out a concept map that looks like this. Note that this is a piece of the concept map. You can see various themes – crisis, voters, falls, credit and so on. Associated with each theme are a number of concepts. For example, the economic crisis theme has concepts such as confidence, stock, unemployment, banking, economy and so on associated with it. The falls theme has information associated with the Dow as well as concepts around jobs and seats. I was interested to understand the seats concept and its relationship to the economic crisis, so I highlighted the path.

In a separate window (not shown here) all of the articles related to the concept path are highlighted. It then became obvious from the articles, that given the financial crisis, the democrats stand to gain more seats in the senate and lock up a 60 seat filibuster proof majority.

Customer Insight allows businesses to aggregate customer feedback and analyze it in order to get to the root cause. This feedback can be from surveys, blogs, forums, and so on. Leximancer also offers the Lexbox – a feedback widget that companies can insert on their own websites to use as an additional source of information.

Market Strategy

Leximancer has about 200 customers, mostly in the educational and government space. Police departments, for example, are using the software to connect people and events as well as using it to perform social network profiling. Leximancer is also beginning to branch out to other verticals. It is looking to pursue a predominately OEM strategy, which is a good idea. Some of the vendors it will partner with will probably use the concept maps directly (depending who their audience is). Others will take the output from the maps and use it in another way. I plan to do some further analysis using the customer insight portal and will provide my additional feedback then.

Syndicating Text Analytics

Over the past several weeks, I’ve been briefed by a number of text analytics vendors and companies in partnership with text analytics vendors about syndicated services that make use of text analytics. Of course, syndicated services such as brand monitoring and news services that make use this technology to some degree have been around for a while. But, how about some of the newer services?


An interesting example of this is illumin8, which is being offered by Elsevier, in partnership with Netbase. The service is targeted at R&D knowledge workers looking to solve technical and business problems. According to Elsevier, knowledge workers spend more time per week trying to discover relevant content relating to a particular problem area than analyzing that information (5.5 hours/week accessing vs. 4.7 hours/week analyzing). These workers are usually using a google-like search engine. I think everyone can agree that the google-like search engine is not ideal for research purposes, so I won’t belabor the point here. In the case of the R&D knowledge worker, often one goal is to gather information relating to a particular problem, finding products that solve that problem, as well as understanding the approach used to solve the problem.

Elsevier has aggregated 5 billion business sources, 3 million full text articles, 33 million scientific records, and 21 million patents as the source of information for this service. Using the Netbase semantic index, Elsevier crawls through the information and extracts solutions that solve a problem and the approaches used to address a specific issue. In this way, R&D can help answer the following questions:

  • Solutions that exist to solve a problem
  • New applications and processes that might exist to help solve a problem
  • Information about what competitors are doing in the particular problem space
  • What the experts are saying about a particular problem area

Here is a screen shot of what an end-user might see using this service. In this example, the user is interested in solving the problem of fuel efficiency in boats. He or she wants to see what products and approaches are out on the market to address this problem and what companies are providing these solutions. The user enters the topic (boats) and the benefit (fuel efficiency) in the search box and gets back information that is organized in a logical way. In this example, you can see that query returns information about products that address the problem as well as the companies that make the products, organizations that deal with energy, as well approaches to solving the problem (drag, stroke, etc). These are ranked. Users can then drill down on any of these areas to get snippets (and full text) associated with areas that he/she is interested in analyzing.

During the demo, I asked to see what would happen if we input “text analytics” as the problem space in the search box. I was actually impressed that what was returned was a good set of information about the players, organizations dealing with text analytics and other information about it. The service is not inexpensive, but it does cull a lot of information.

Syndicated Services

I believe that the number of syndicated services using text analytics will continue to grow. We’re certainly seeing action in the brand monitoring space on this front. Vendors are also getting into the act. Expert System, for example, has its own service that is targeted at the auto industry. I believe that other vendors may get into the act if they determine that the financial benefits of offering syndicated services (as opposed to SaaS offerings) makes sense.

My two cents on the 2008 Text Analytics Summit

I add my thoughts to others who have blogged on the Summit. See the Seth Grimes posting, for all of the other comments.

  • I too agree that it was great to see a large number of end users at the Summit, this year. I was especially interested in the fact that a number of them were in the investigation phase. And, what were many investigating? You got it – Voice of the Customer and the closely intertwined area of sentiment analysis.
  • VoC was a major theme. In fact, I was overwhelmed by the number of talks in the area. I thought that the presentation about what Gaylord Hotels is doing with text analytics and VoC was extremely interesting. Tony Bodoh took the audience through a journey beginning with the fact that before text analytics it would take the company weeks to even see customer comments. He said that the working pilot the company did in conjunction with Clarabridge took only 10 weeks. He reported benefits in process improvements, value-oriented marketing, and facility improvements. He even told us about reticular activating systems! Go look that one up.
  • I also met a number of people from new start-ups in the sentiment space, each taking a slightly different angle. I wonder if sentiment analysis will become a confusing space in the near future.
  • There was some discussion at the Summit about text analytics and Web 2.0. I would have like to hear more about this, as text analytics will be important in Web 2.0
  • And speaking of important, another interesting concept was brought up several times – the idea that text analytics will morph to become part of something bigger. I don’t want to say component, although others were.
  • I was hoping to hear more about text analytics and content management. At one point, during the expert panel, I had the chance to ask the audience if anyone was deploying their text analytics in conjunction with their content management systems. A handful responded affirmatively. Unfortunately, I didn’t catch up with them. If you’re reading this and are deploying text analytics with your ECM system, I’d like to hear from you.

All in all though, I was impressed with the Summit.

Text Analytics and the Predictive Enterprise

I recently had the chance to get an update on what SPSS is up to in text analytics.  It was an interesting conversation for several reasons:


  • First, it highlighted an important point about text analytics – which we know but is worth repeating – which is that the analysis of unstructured data can be more useful, in many scenarios, when accompanied by structured data.
  • Second, it got me thinking more about social media/network analysis, which prompted the question on the recent “four questions about innovations in analysis blog I recently posted.



A few words of background.   SPSS’s goal is help its customers analyze everything about data associated with people – behavior, attitudes, and so on to help an organization understand anyone it interacts with.  In fact, Olivier Jouve, VP of Corporate Development at SPSS was quite clear that SPSS is not a BI company.  Rather, SPSS software helps to enable what SPSS refers to as the “Predictive Enterprise”.   The Predictive Enterprise makes use of analytics (not simply reports) to help manage multiple dimensions across the enterprise including customer intimacy, product placement, and even operational issues such as fraud. 


SPSS offers a suite of text-mining products that is based on 25 years of research in the application of natural language processing (NLP) technologies. In 2002, SPSS bought LexiQuest™, a linguistics-based text-mining company, intending to combine LexiQuest’s extraction capabilities with SPSS’s data-mining capabilities in order to strengthen the company’s position in predictive analytics. All of SPSS’s text-analytics products now share this same core linguistic functionality.



It’s not just about text


While the market for text analytics has moved out of the early adopter stage, depending on what type of analysis you’re trying to accomplish, it often is not just about the text.




For example, consider the following churn scenario:  A telecommunications company is concerned about churn.  The company realizes that it has a wealth of information at its disposal to help predict churn.  On the structured data side it has collected demographic information, usage information, trouble ticket, and product information about each of its customers.  On the unstructured side, it also has collected call center notes, emails, and customer satisfaction surveys.  The company decides to invest in text analytics software that can sift through its call center notes, emails, and survey notes. At the end of the exercise, the company has some great insight into customer complaints that it can certainly act on.  However, it has not exactly gotten the information it might need to solve the churn problem.  In order to do this, it is probably more useful to marry the unstructured information from the call centers and surveys and emails to an actual customer and all of the structured information about that customer.  This way, using some predictive modeling the company can train its system to zero in on those customers that are likely to drop its service and make the right decisions to help retain them.  


According to SPSS many of its customers have seen upwards of a 50% reduction in churn by combining data mining with text mining.


Social media is becoming an important source of information for companies


What about other forms of media such as blogs, message threads, etc.?  SPSS is also moving into social network/media analysis because as Olivier said, “The number of people participating in Web 2.0 activities is growing rapidly across all age groups, and businesses are using the direct influence they have traditionally had over customers’ decisions about their products.  Peer to Peer networks are now a trusted source of insight and information.”   This is quite true.  Our recent Hurwitz & Associates survey confirmed that companies do plan to make use of the information found in various kinds of social networks, even if they don’t think they are making use of text analytics.  One interesting point on this front is that blogs, message boards, etc. do provide a great source of information of customer sentiment, opinions, etc. The challenge will be mapping this kind of information back to the other information that a company keeps about its customers, and making sense of the behaviors.  I’ll look forward to hearing more about what SPSS is doing to help solve this problem.



Get every new post delivered to your Inbox.

Join 1,710 other followers