2009 Text Analytics Survey

 Several weeks ago, in preparation for the Text Analytics Summit, I deployed a short survey about the state of text analytics.  I supplemented the end-user survey with vendor interviews.   Here are some of the end-user findings. 

First, a few words about the survey itself and who responded to the survey.

  • I wanted to make the survey short and sweet.  I was interested in company’s plans for text analytics and whether the economy was affecting these plans. 
  • Let me say up front that given the topic of the survey and our list, I would categorize most of the respondents as fairly analytical and technology savvy.  Approximately 50 companies responded to the survey – split evenly between those companies that were deploying the technology and those that were not (note that this is a self selecting sample and does not imply that 50% of companies are currently using text analytics).  The respondents represent a good mix across a number of verticals including computing, telecommunications, education, pharmaceuticals, financial services, government, and CPG.  There were also a few market researchers in the mix.  Likewise, there was a mix of companies of various sizes. 
  • Here’s my caveat:  I would not view the respondents as a scientific sample and I would view these results as qualitative.  That said, many of the results paralleled results from previous surveys.  So, while the results are unscientific, in terms of a random sample and size, I believe they probably do reflect what many companies are doing in this space.


Kinds of applications, implementation schemes

I asked those respondents that were deploying text analytics, what kinds of applications they were using it for.  The results were not surprising.  The top three responses:  Voice of the Customer (VoC), Competitive Intelligence, and eDiscovery, were also in the top three the last time I asked the question. Additionally, many of the respondents were deploying more than one type of application (i.e. VoC and quality analysis).  This was a pattern that also emerged in a study I did on text analytics back in 2007. Once a company gains value from one implementation, it then sees the wider value of the technology (and realizes that it has a vast amount of information that needs to be analyzed).

 I asked those companies that were planning to deploy the technology, the top applications being considered.  In this case, VoC and Competitive Intelligence were again in the top two.  Brand Management and Product R&D were tied for third.  This is not surprising.  Companies are quite concerned with customer feedback and any issues that impact customer retention.  Companies want to understand what competitors are up to and how their brand is being perceived in the market.  Likewise, they are also trying to get smarter about how they develop products and services to be more cost effective and more market focused.



How Text Analytics is being deployed

I also wanted to find out how companies were deploying the technology. In particular, we’ve heard a lot this past year about organizations utilizing text analytics in a Software as a Service (SaaS) model.  This model has become particularly attractive in the Voice of the Customer/Competitive Intelligence/Brand Management area for several reasons.  For one thing, this kind of analysis might involve some sort of external information source such as news feeds and blog postings.  Even product R&D would draw from external sources such as trade journals, news about competitive products, and patent files.  Additionally, the SaaS model generally has a different price point that enterprise solutions.

In fact, SaaS was the model of choice for implementing the technology.  The SasS model does offer the flexibility and price point that many companies are looking for, especially in some of the above-mentioned areas.  However, that is not to say that companies are not deploying text analytics in other ways (note the values on the X axis).  Interestingly, companies are starting to deploy text analytics in conjunction with their content management systems.  I think we will see more of this as the technology continues to become more mainstream. 


Just as an FYI, all of the companies that had deployed text analytics stated that the implementations either met or exceeded their expectations.  And, close to 60% stated that text analytics had actually exceeded expectations. 

 What about those companies that aren’t deploying the technology?

Equally important to understanding the market are those companies that are not deploying text analytics.  I asked those companies if they had any plans to utilize the technology.  Eleven percent stated that plans had been put on hold due to funding constraints.  Twenty-eight percent stated that they had no plans to implement the technology.  Another 28% stated that they planned to implement the technology in the next year and 33% said they planned to implement it in the next few years. 

Reasons cited for not implementing the technology included not understanding enough about text analytics to implement it.  Other companies just never considered implementing it, or had other analytic projects on the radar.

What about the economy?

There have been numerous articles written about whether certain technologies are recession proof, with various BI related technology vendors stating/hoping/praying that their technology falls in to this category.  And certainly, companies do feel the need to gain insight about operational efficiency, their customers, the market, and the competition with, perhaps a greater urgency than in the past.  This has helped keep business analytics vendors moving forward in this economy.

The 11% number is relatively small.  However, I wonder what part of the 61% that said that they would be deploying it in the future, might actually fall into the hold category.  When I asked text analytics vendors (mostly private companies) whether the economy was impacting the sales cycle, they pretty much said the same thing.  Existing customers were not dropping projects (there is too much value there, as supported by this survey).  However, sales cycles are longer (companies are not necessarily rushing) and potential clients may be looking for creative financing and contracting options. 

I am participating in an analyst panel at the Text Analytics Summit in June.  I have more to say about the topic, as I am sure, do the other analysts who will be participating.

The Three C’s – conceptual search, clustering, and categorization


I recently had the opportunity to speak with Richard Turner, VP of Marketing, at Content Analyst.  Content Analyst was originally part of SAIC and spun out about 5 years ago.  The company provides content management, eDiscovery, and content workflow solutions to its clients – primarily as an OEM solutions.

The tool set is called CAAT (Content Analyst Analytic Technology).  It includes five core features:


  • Concept search: uses concepts rather than key words to search through documents.
  • Dynamic Clustering: classifies documents into clusters.
  • Categorization: classifies documents into user-defined categories.
  • Summarization: identifies conceptually definitive sentences in documents.
  • Foreign Language: identifies the language in a document and can work across any language.

Let me just briefly touch upon concept search and dynamic clustering and categorization.  Concept search is interesting because when most people think search, they think key word search.  However, key words may not give you what you’re looking for.  For example, I might be interested in finding documents that deal with banks.  However, the document might not state the word bank explicitly.  Rather, words like finance, money, and so on might occur in the document.  So, if you don’t insert the right word into the search engine, you will not get back all relevant documents.  Concept search allows you to search on the concept (not keyword) “bank” so you get back documents related to the concept even if they don’t contain the exact word.  CAAT learns the word bank in a given set of documents from words like “money”, “exchange rate”, etc.  It also learns that the word bank (as in financial institution) is not the same as the bank on the side of a river becuase of other terms in the document (such as money, transfer, or ATM).

Dynamic clustering enables the user to organize documents into categories based on content called clusters.  You can also categorize documents by using examples that fall into a certain cluster and then train the system to recognize similar documents that could fall into the same category.  You literally tag the document as belonging to a certain category and then give the system examples of other documents that are similar to this to train on.  In eDiscovery applications, this can help dramatically cut down the amount of time needed to find the right documents.  In the past, this was done manually, which obviously could be very time intensive. 

How do they do it?

The company uses a technique called Latent Semantic Indexing (LSI), along with other patented technology, to help it accomplish all of this.  Here is a good link that explains LSI.  The upshot is that LSI uses a vector representation of the information found in the documents to analyze the term space in a document.  Essentially, it removes the grammar, then counts and weights (e.g. how often a word appears on a page or in a document, etc.) the occurrence of the terms in the document.  It does this across all of the documents, and actually collapses the matrix using a technique patented at Bell Labs.  The more negative a term, the greater its distance from a page.  Since the approach is mathematical, there is no need to put together a dictionary or thesauri.  And, it’s this mathematical approach that makes the system language independent.

Some people have argued this technique can’t scale because the matrix would be too large and it would be hard to keep this in-memory.  However, when I asked the folks at Content Analyst about this they told me that they have been working on the problem and that CAAT contains a number of features to optimize memory and throughput.  The company regularly works with ligitation clients who might get 1-2 TB of information from opposing counsel and they are using CAAT for clustering, categorization, and search.  The company also works with organizations that have created indexes of 45+ million (>8 TB) documents.  That’s a lot of data!

Conceptual Search and Classification and Content Management

Aside from eDiscovery, Content Analyst is also being used in application such as improving search in media and publishing and of course, government applications.  The company is also looking into other application areas.

Concept search is definitely a big step up from keyword search and is important for any kind of document that might be stored in a content management system.  Automatic classification and clustering would also be huge (as would summarization and foreigh language recognition).  This could move Content Analyst into other areas including analyzing claims, medical records, and customer feedback.  Content management vendors such as IBM and EMC are definitely moving in the direction of providing more intelligence in their content management products.  This makes perfect sense, since a lot of unstructured information is stored in these systems.  Hopefully, other content management vendors will catch up soon.


Get every new post delivered to your Inbox.

Join 1,710 other followers