Five Best Practices for Text Analytics

It’s been a while since I updated my blog and a lot has changed.  In January, I made the move to TDWI as Research Director for Advanced Analytics.  I’m excited to be there, although I miss Hurwitz & Associates.   One of the last projects I worked on while at Hurwitz & Associates was the Victory Index for Text Analytics.  Click here for more information on the Victory Index.  

As part of my research for the Victory Index, I spent I a lot of time talking to companies about how they’re using text analytics.  By far, one of the biggest use cases for text analytics centers on understanding customer feedback and behavior.  Some companies are using internal data such as call center notes or emails or survey verbatim to gather feedback and understand behavior, others are using social media, and still others are using both.  

What are these end users saying about how to be successful with text analytics?  Aside from the important best practices around defining the right problem, getting the right people, and dealing with infrastructure issues, I’ve also heard the following:

Best Practice #1 - Managing expectations among senior leadership.   A number of the end-users I speak with say that their management often thinks that text analytics will work almost out of the box and this can establish unrealistic expectations. Some of these executives seem to envision a big funnel where reams of unstructured text enter and concepts, themes, entities, and insights pop out at the other end.  Managing expectations is a balancing act.  On the one hand, executive management may not want to hear the details about how long it is going to take you to build a taxonomy or integrate data.  On the other hand, it is important to get wins under your belt quickly to establish credibility in the technology because no one wants to wait years to see some results.  That said, it is still important to establish a reasonable set of goals and prioritize them and to communicate them to everyone.  End users find that getting senior management involved and keeping them informed with well-defined plans on a realistic first project can be very helpful in handling expectations. 


for more visit my tdwi blog



Four Findings from the Hurwitz & Associates Advanced Analytics Survey

Hurwitz & Associates conducted an online survey on advanced analytics in January 2011. Over 160 companies across a range of industries and company size participated in the survey. The goal of the survey was to understand how companies are using advanced analytics today and what their plans are for the future. Specific topics included:

- Motivation for advanced analytics
- Use cases for advanced analytics
- Kinds of users of advanced analytics
- Challenges with advanced analytics
- Benefits of the technology
- Experiences with BI and advanced analytics
- Plans for using advanced analytics

What is advanced analytics ?
Advanced analytics provides algorithms for complex analysis of either structured or unstructured data. It includes sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data mining techniques. Among its many use cases, it can be deployed to find patterns in data, prediction, optimization, forecasting, and for complex event processing. Examples include predicting churn, identifying fraud, market basket analysis, and analyzing social media for brand management. Advanced analytics does not include database query and reporting and OLAP cubes.

Many early adopters of this technology have used predictive analytics as part of their marketing efforts. However, the diversity of use cases for predictive analytics is growing. In addition to marketing related analytics for use in areas such as market basket analysis, promotional mix, consumer behavior analysis, brand loyalty, churn analysis, companies are using the technology in new and innovative ways. For example, there are newer industry use cases emerging including reliability assessment (i.e. predicting failure in machines), situational awareness, behavior (defense), investment analysis, fraud identification (insurance, finance), predicting disabilities from claims (insurance), and finding patterns in health related data (medical)

The two charts below illustrate several key findings from the survey on how companies use advanced analytics and who within the organization is using this technology.

• Figure 1 indicates that the top uses for advanced analytics include finding patterns in data and building predictive models.

• Figure 2 illustrates that users of advanced analytics in many organizations have expanded from statisticians and other highly technical staff to include business analysts and other business users. Many vendors anticipated this shift to business users and enhanced their offerings by adding new user interfaces, for example, which suggest or dictate what model should be used, given a certain set of data.

Other highlights include:

• Survey participants have seen a huge business benefit from advanced analytics. In fact, over 40% of the respondents who had implemented advanced analytics believed it had increased their company’s top-line revenue. Only 2% of respondents stated that advanced analytics provided little or no value to their company.
• Regardless of company size, the vast majority of respondents expected the number of users of advanced analytics in their companies to increase over the next six to 12 months. In fact, over 50% of respondents currently using the technology expected the number of users to increase over this time period.

The final report will be published in March 2011. Stay tuned!

Thoughts from the 6th annual Text Analytics Summit

I just returned from the 6th annual Text Analytics Summit in Boston.  It was an enjoyable conference, as usual.  Larger players such as SAP and IBM both had booths at the show alongside pure play vendors Clarabridge, Attensity, Lexalytics, and Provalis Research.  This was good to see and it underscores the fact that platform players acknowledge text analytics as an important piece of the information management story.   Additionally, more analysts were at the conference this year, another sign that the text analytics market is becoming more mainstream.   And, most importantly, there were various end-users in attendance and they were looking at using text analytics for different applications (more about that in a second).

Since a large part of the text analytics market is currently being driven by social media and voice of the customer/customer experience management related applications, there was a lot of talk about this topic, as expected.  Despite this, there were some universal themes that emerged which are application agnostic. Interesting nuggets include:

  • The value of quantifying success. I found it encouraging that a number of the talks addressed a topic near and dear to my heart:  quantifying the value of a technology.  For example, the IBM folks when describing their Voice of the Customer solution, specifically laid out attributes that could be used to quantify success for call center related applications (e.g. handle time per agent, first call resolution). The user panel in the Clarabridge presentation actually focused part of the discussion on how companies measure the value of text analytics for Customer Experience Management.   Panelists discussed replacing manual processes, identifying the proper issue, and other attributes (some easy to quantify, some not so easy to quantify).  Daniel Ziv, from Verint even cited some work from Forrester that tries to measure the value of loyalty in his presentation on the future of interaction analytics.
  • Data Integration. On the technology panel, all of the participants (Lexalytics, IBM, SPSS/IBM, Clarabridge, Attensity) were quick to point out that while social media is an important source of data, it is not the only source.   In many instances, it is important to integrate this data with internal data to get the best read on a problem/customer/etc.  This is obvious but underscores two points.  First, these vendors need to differentiate themselves from the 150+ listening posts and social media analysis SaaS vendors that exclusively utilize social media and are clouding the market.  Second, integrating data from multiple sources is a must have for many companies.  In fact, there was a whole panel discussion on data quality issues in text analytics.  While the structured data world has been dealing with quality and integration issues for years, aside from companies dealing with the quality of data in ECM systems, this is still an area that needs to be addressed.
  • Home Grown. I found it interesting that at least one presentation and several end-users I spoke to stated that they have built/will build home grown solutions.  Why? One reason was that a little could go a long way.  For example, Gerand Britton from Constantine Cannon LLP described that the biggest bang for the buck in eDiscovery was performing near duplicate clustering of documents.  This means putting functionality in place that can recognize that an email containing information sent to another person who responds that he or she received it is essentially the same document and a cluster like this should be reviewed by one person rather than two or three.  In order to put this together, the company used some SPSS technology and homegrown functionality.  Another reason for home grown is that companies feel their problem is unique.  A number of attendees I spoke to mentioned that they had either built their own tools or that their problem would require too much customization and they could hire University people to help build specific algorithms.
  • Growing Pains.  There was a lot of discussion on two topics related to this.  First, a number of companies and attendees spoke about a new “class” of knowledge worker.  As companies move away from manually coding documents to automating extraction of concepts, entities, etc.  the kind of analysis that will be needed to derive insight will no doubt be different.  What will this person look like?   Second, a number of discussions sprang up around how vendors are being given a hard time about figures such as 85% accuracy in classifying, for example, sentiment.  One hypothesis given for this was that it is a lot easier to read comments and decide what the sentiment should be than reading the output of a statistical analysis.
  • Feature vs. Solution?  Text analytics is being used in many, many ways.   This includes building full-blown solutions around problem areas that require the technology to embedding it as part of a search engine or URL shortener.   Most people agreed that the functionality would become more pervasive as time goes on.  People will ultimately use applications that deploy the technology and not even know that it is there.  And, I believe, it is quite possible that many of the customer voice/customer experience solutions will simply become part of the broader CRM landscape through time.

I felt that the most interesting presentation of the Summit was a panel discussion on the semantic web.  I am going to write about that conversation separately and will post it in the next few days.

What about Analytics in Social Media monitoring?

I was speaking to a client the other day.  This company was very excited about tracking its brand using one of the many listening posts out on the market.  As I sat listening to him, I couldn’t help but think that a) it was nice that his company could get its feet wet in social media monitoring using a tool like this and b) that they might be getting a false sense of security because the reality is that these social media tracking tools provide a fairly rudimentary analysis about brand/product mentions, sentiment, and influencers.  For those of you not familiar with listening posts here’s a quick primer.

Listening Post Primer

Listening posts monitor the “chatter” that is occurring on the Internet in blogs, message boards, tweets, etc.  They basically:

  • Aggregate content from across many,  many Internet sources.
  • Track the number of mentions of a topic (brand or some other term) over time and source of mention.
  • Provide users with positive or negative sentiment associated with topic (often you can’t change this, if it is incorrect).
  • Provide some sort of Influencer information.
  • Possibly provide a word cloud that lets you know what other words are associated with your topic.
  • Provide you with the ability to look at the content associated with your topic.

They typically charge by the topic.  Since these listening posts mostly use a search paradigm (with ways to aggregate words into a search topic) they don’t really allow  you to “discover” any information or insight that you may not have been aware of unless you happen to stumble across it while reading posts or put a lot of time into manually mining this information.  Some services allow the user to draw on historical data.  There are more than 100 listening posts on the market.

I certainly don’t want to minimize what these providers are offering.  Organizations that are just starting out analyzing social media will certainly derive huge benefit from these services.  Many are also quite easy to use and the price point is reasonable. My point is that there is more that can be done to derive more useful insight from social media.  More advanced systems typically make use of text analytics software.   Text analytics utilizes techniques that originate in computational linguistics, statistics, and other computer science disciplines to actually analyze the unstructured text.

Adding Text Analytics to the Mix

Although still in the early phases, social media monitoring is moving to social media analysis and understanding as text analytics vendors apply their technology to this problem.  The space is heating up as evidenced by these three recent announcements:

  • Attensity buys Biz 360. The other week, Attensity announced its intention to purchase Biz360, a leading listening post. In April, 2009, Attensity combined with two European companies that focus on semantic business applications to form Attensity Group (was formerly Attensity Corporation).  Attensity has sophisticated technology which makes use of “exhaustive extraction” techniques (as well as nine other techniques) to analyze unstructured data. Its flagship technology automatically extracts facts from parsed text (who did what to whom, when, where, under what conditions) and organizes this information.  With the addition of Biz360 and its earlier acquisitions, the Biz360 listening post will feed all Attensity products.  Additionally, the  Biz360 SaaS platform will be expanded to include deeper semantic capabilities for analysis, sentiment, response and knowledge management utilizing Attensity IP.  This service will be called Attensity 360.  The service will provide listening and deep analysis capabilities.  On top of this, extracted knowledge will be automatically routed to the group in the enterprise that needs the information.  For example, legal insights  about people, places, events, topics, and sentiment will be automatically routed to legal, customer service insights to customer service, and so on. These groups can then act on the information.  Attensity refers to this as the “open enterprise.” The idea is an end-to-end listen-analyze-respond-act process for enterprises to act on the insight they can get from the solution.
  • SAS announces its social media analytics software. SAS purchased text analytics vendor Teragram last year.  In April, SAS announced SAS® Social Media Analytics which, “Analyzes online conversations to drive insight, improve customer interaction, and drive performance.”  The product provides deep unstructured data analysis capabilities around both internal and external sources of information (it has partnerships with external content aggregators, if needed) for brand, media, PR, and customer related information.  SAS has then coupled with this the ability to perform advanced analytics such as predictive forecasting and correlation on this unstructured data.  For example, the SAS product enables companies to forecast number of mentions, given a history of mentions, or to understand whether sentiment during a certain time period was more negative, say than a previous time period.  It also enables users to analyze sentiment at a granular level and to change sentiment (and learn from this), if it is not correct.  It can deal with sentiment in 13 languages and supports 30 languages.
  • Newer social media analysis services such as NetBase are announced. NetBase is currently in limited release of its first consumer insight discovery product called ConsumerBase.  It has eight  patents pending around its deep parsing  and semantic modeling technology.  It combines deep analytics with a content aggregation service and a reporting capability.  The product provides analysis around likes/dislikes, emotions, reasons why, and behaviors.  For example, whereas a listening post might interpret the sentence, “Listerine kills germs because it hurts” as either a negative or neutral statement, the NetBase technology uses a semantic data model to understand not only that this is a positive statement, but also the reason it is positive.

Each of these products and services are slightly different.  For example, Attensity’s approach is to listen, analyze, relate (it to the business), and act (route, respond, reuse) which it calls its LARA methodology.   The SAS solution is part of its broader three Is strategy: Insight- Interaction- Improve.  NetBase is looking to provide an end to end service that helps companies to understand the reason around emotions, behaviors, likes and dislikes.   And, these are not the only game in town. Other social media analysis services announced in the last year (or earlier) include those from other text analytics vendors such as IBM, Clarabridge, and Lexalytics. And, to be fair, some of the listening posts are beginning to put this capability into their services.

This market is still in its early adoption phase, as companies try to put plans together around social media, including utilizing it for their own marketing purposes as well as analyzing it for reasons including and beyond marketing. It will be extremely important for users to determine what their needs and price points are and plan accordingly.

The Importance of multi-language support in advanced search and text analytics

I had an interesting briefing with the Basis Technology team the other week.  They updated me on the latest release of their technology called Rosette 7.   In case you’re not familiar with Basis Technology it is the multilingual engine that is embedded in some of the biggest Internet search engines out there – including Google, Bing, and Yahoo.  Enterprises and the government also utilize it.  But, the company is not just about keyword search.  Its technology also enables the extraction of entities (about 18 different kinds) such as organizations, names, and places.  What does this mean?  It means that the software can discover these kinds of entities across massive amounts of data and perform context sensitive discovery in many different languages.

An Example

Here’s a simple example.  Say you’re in the Canadian consulate and you want to understand what is being said about Canada across the world.   You type “Canada” into your search engine and get back a listing of documents.  How do you make sense of this?  Using Basis Technology entity extraction (an enhancement to search and a basic component of text analytics), you could actually perform faceted (i.e. guided) navigation across multiple languages.  This is illustrated in the figure below.  Here, the user typed “Canada” into the search engine and got back 89 documents.  In the main pane in the browser, you can see that an arrow in a number of different languages highlights the word Canada, so you know that it is included in these documents.  On the left hand side of the screen is the guided navigation pane.  For example, you can see that there are 15 documents that contain a reference to Obama and another 6 that contain a reference to Barack Obama.  This is not necessarily a co-occurrence in a sentence, just in the document.  So, any of these articles would contain a reference to Obama and Canada.  This would help you determine what Obama might have said about Canada. Or, what the connection is between Canada and the BBC (under organization).  This idea is not necessarily new, but the strong multilingual capabilities make it compelling for global organizations.

If you have eagle eyes, you will notice that the search on Canada returned 89 documents, but the entity “Canada” only returned 61 documents.  This illustrates what entity extraction is all about.  When the search for Canada was run on the Rosette Name Indexer tab (see upper right hand corner of the screen shot) the query searched for Canada against all automatically extracted “Canada” entities that existed in all of the documents.  This includes all persons, locations, and organizations that have similar names. This included entities like “Canada Post” and “Canada Life” which are organizations, not the country itself. Therefore the 28 other documents with a Canada variant are organizations or other entities.

Use Cases

There are obviously a number of different use cases where the ability to extract entities across languages can be important.  Here are three:

  • Watch lists.  With the ability to extract entities, such as people, in multiple languages, this kind of technology is good for government or financial watch lists.  Basis can resolve matches and translate names in 9 different languages. This includes resolving multiple spelling variations of foreign names. It also enables organizations to match names of people, places, and organizations against entries in a multilingual database.
  • Legal discovery.  Basis technology can identify  55 different languages.    Companies would use this technology, for example, to identify multiple languages within a document and then route them appropriately.  Additionally, Basis can extract entities in 15 different languages (and search in 21) so the technology could be used to process many documents and extract the entities associated with them to find the right set of documents needed in legal discovery.
  • Brand image, competitive intelligence.   The technology can be used to extract company names across multiple languages.  The software can also be used against disparate data sources, such as internal document management systems as well as external sources such as the Internet.  This means that it could cull the Internet to extract company name (and variations on the name) in multiple languages.  I would expect this technology to be used by “listening posts” and other “Voice of the Customer” services in the near future.

While this technology is not a text analytics analysis platform, it does provide an important piece of core functionality needed in a global economy.  Look for more announcements from the company in 2010 around enhanced search in additional languages.

Five Predictions for Advanced Analytics in 2010

With 2010 now upon us, I wanted to take the opportunity to talk about five advanced analytics technology trends that will take flight this year.  Some of these are up in the clouds, some down to earth.

  • Text Analytics:  Analyzing unstructured text will continue to be a hot area for companies. Vendors in this space have weathered the economic crisis well and the technology is positioned to do even better once a recovery begins.  Social media analysis really took off in 2009 and a number of text analytics vendors, such as Attensity and Clarabridge, have already partnered with online providers to offer this service. Those that haven’t will do so this year.  Additionally, numerous “listening post” services, dealing with brand image and voice of the customer have also sprung up. However, while voice of the customer has been a hot area and will continue to be, I think other application areas such as competitive intelligence will also gain momentum.  There is a lot of data out on the Internet that can be used to gain insight about markets, trends, and competitors.
  • Predictive Analytics Model Building:  In 2009, there was a lot of buzz about predictive analytics.  For example, IBM bought SPSS and other vendors, such as SAS and Megaputer, also beefed up offerings.  A newish development that will continue to gain steam is predictive analytics in the cloud.  For example, vendors Aha! software and Clario are providing predictive capabilities to users in a cloud-based model.  While different in approach they both speak to the trend that predictive analytics will be hot in 2010.
  • Operationalizing Predictive Analytics:  While not every company can or may want to build a predictive model, there are certainly a lot of uses for operationalizing predictive models as part of a business process.  Forward looking companies are already using this as part of the call center process, in fraud analysis, and churn analysis, to name a few use cases.  The momentum will continue to build making advanced analytics more pervasive.
  • Advanced Analytics in the Cloud:  speaking of putting predictive models in the cloud, business analytics in general will continue to move to the cloud for mid market companies and others that deem it valuable.  Companies such as QlikTech introduced a cloud-based service in 2009.  There are also a number of pure play SaaS vendors out there, like GoodData and others that provide cloud-based services in this space.  Expect to hear more about this in 2010.
  • Analyzing complex data streams.  A number of forward-looking companies with large amounts of real-time data (such as RFID or financial data) are already investing in analyzing these data streams.   Some are using the on-demand capacity of cloud based model to do this.  Expect this trend to continue in 2010.

Text Analytics Meets Publishing

I’ve been writing about text analytics for a number of years, now. Many of my blogs have included survey findings and vendor offerings in the space.  I’ve also provided a number of use cases for text analytics; many of which have revolved around voice of the customer, market intelligence, e-discovery, and fraud.  While these are all extremely valuable, there are a number of other very beneficial use cases for the technology and I believe it is important to put them out there, too.

Last week, I spoke with Daniel Mayer, a product-marketing manager, at TEMIS about the publishing landscape and how text analytics can be used in both the editorial and the new product development parts of the publishing business.  It’s an interesting and significant use of the technology.

First a little background.  I don’t believe that it comes as a surprise to anyone that publishing, as we used to know it has changed dramatically.  Mainstream newspapers and magazines have given way to desktop publishing and the Internet as economics have changed the game.  Chris Anderson wrote about this back in 2004, in Wired, in an article he called “The Long Tail” (it has since become a book).  Some of the results include:

  • Increased Competition.  There are more entrants, more content and more choice on the Internet and much of it is free.
  • Mass market vs. narrow market.  Additionally, whereas the successful newspapers and magazines of the past targeted a general audience, the Internet economically enables more narrow appeal publications.  
  • Social, Real time.  Social network sites, like twitter, are fast becoming an important source of real time news. 

All of this has caused mainstream publishers to rethink their strategies in order to survive.  In particular, publishers realize that content needs to be richer, interactive, timely, and relevant.

Consider the following example.  A plane crashes over a large river, close to an airport.  The editor in charge of the story wants to write about the crash itself, and also wants to include historical information about the cause of plane crashes (e.g. time of year, time of day, equipment malfunction, pilot error, etc based on other plane crashes for the past 40 years) to enrich the story.  Traditionally, publishers have annotated documents with key words and dates.   Typically, this was a manual process and not all documents were thoroughly tagged.  Past annotations might not meet current expectations. Even if the documents were tagged, they might have been tagged only at a high level (e.g. plane crash), so that the editor is overwhelmed with information.   This means that it might be very difficult her to find similar stories, much less analyze what happened in other relevant crashes.  

Using text analytics, all historical documents could be culled for relevant entities, concepts, and relationships to create a much more enriched annotation scheme.  Information about the plane crash such as location, type of planes involved, dates, times, and causes could be extracted from the text.  This information would be stored as enriched metadata about the articles and used when needed.  The Luxid Platform offered by TEMIS would also suggest topics close to the given topic.  What does this do? 

  • It improves the productivity of the editor.  The editor has a complete set of information that he or she can easily navigate.  Additionally, if text analytics can extract relationships such as cause this can be analyzed and used to enrich a story.
  • It provides new opportunities for publishers.  For example, Luxid would enable the publisher to provide the consumer with links to similar articles or set up alerts when new, similar content is created, as well as tools to better navigate data or analyze it (this might be used by fee based subscription services).  It also enables publishers to create targeted microsites and topical pages, which might be of interest to consumers.

Under many current schemes, advertisers pay online publishers.  Enhancing navigation means more visits, more page views, and a more focused audience, which can lead to more advertising revenue for the publisher.  Publishers, in some cases, are trying to go even further, by transforming readers into sales leads and receiving a commission from sales. There are other models that publishers are exploring, as well.  Additionally, text analytics could enable publishers to re-package content, on the fly (called content repurposing), which might lead to additional revenue opportunities such as selling content to brand sponsors that might resell it.  The possibilities are numerous.

I am interested in other compelling use cases for the technology.

A different spin on analyzing content – Infosphere Content Assessment

IBM made a number of announcements last week at IOD regarding new products/offerings to help companies analyze content.  One was Cognos Content Analytics, which enables organizations to analyze unstructured data alongside structured data.  It also looks like IBM may be releasing a “voice of the customer” type service to help companies understand what is being said about them in the “cloud” (i.e. blogs, message boards, and the like).  Stay tuned on that front, it is currently being “previewed”.

I was particularly interested in a new product called IBM Infosphere Content Assessment, because I thought it was an interesting use of text analytics technology.  The product uses content analytics (IBM’s term for text analytics) to analyze “content in the wild”.  This means that a user can take the software, run it over servers that might contain terabytes (or even petabytes) of data to understand what is being stored on servers.  Here are some of the potential use cases for this kind of product:

  • Decommission data.  Once you understand the data that is on a server, you might choose to decommission it, thereby freeing up storage space
  • Records enablement.   Infosphere Content Assessment can also be used to identify what records need to go into a records management system for a record retention program
  • E-Discovery.  Of course, this technology could also be used in litigation, investigation, and audit.  It can analyze unstructured content on servers which can help to discover information that may be used in legal matters or information that needs to meet certain audit requirements for compliance.

The reality is that the majority of companies don’t formally manage their content.  It is simply stored on file servers.  The IBM product team’s view is that companies can “acknowledge the chaos”, but use the software to understand what is there and gain control over the content.  I had not seen a product positioned quite this way before and I thought it was a good use of the content analysis software that IBM has developed.

If anyone else knows of software like this, please let me know.

Threats to the American Justice System – Can Enterprise Content Management Help?

I was at the EMC writer’s conference  this past Friday, speaking on Text Analytics and ECM.  The idea behind the conference is very cool.  EMC brings together writers and bloggers, from all over the world, to discuss topics relevant to content management.  All of the sessions were great.  We discussed Cloud, Web 2.0, Sharepoint, Text Analytics, and e-Discovery. 

 I want to focus here on the e-Discovery discussion, since e-Discovery has been showing up on my top text analytics applications list for several years.  There are a growing number of vendors looking to address this problem (although not all of them may be making use of text analytics yet) including large companies like EMC, IBM, Digital Iron Mountain, Microsoft and smaller providers such as Zylab.

 Ralph Losey gave the presentation. He is a defense lawyer, by training, but over the years has focused on e-Discovery.  Losey has written a number of books on the topic and he writes a blog called e-Discovery Team.  An interesting fellow!

 His point was that “The failure of American business to adopt ECM is destroying the American system of justice.”  Why?  His argument went something like this:

  • You can’t find the truth if you can’t find the evidence.  As the amount of digital data explodes, it is harder to find the information companies need to defend themselves.  This is because the events surrounding the case might have occurred a year or more in the past, and the data is buried in files or email.  I don’t think anyone will argue with this fact. 
  • According to Losey, most trial lawyers are luddites, implying that they don’t get technology.  Lawyers aren’t trained this way so they are not going to push for ECM systems, since they might not even know what they are.  And corporate America is putting off decisions to purchase ECM systems that could actually help organize some of the content and make it more findable.
  • Meanwhile, the cost of litigation is skyrocketing.  Since it is so expensive, many companies don’t go to court and they look to private arbitration.  Why spend $2M in e-Discovery when you can settle for $3M?  Losey pointed to one example, in the Fannie Mae securities litigation (2009), where it cost $6M (or 9% of the annual budget of the Office of Federal Housing Enterprise Oversight) to comply with ONE subpoena. This involved about 660,000 emails. 
  • According to Losey, it costs about $5 to process one computer file for e-Discovery.  This is because the file needs to be reviewed for relevance, privilege, and confidentiality. 

 Can the American justice system be saved? 

 So, can e-Discovery tools be used to help save the justice system as we know it?  Here are a few points to ponder:

  • Losey seems to believe that the e-Discovery process may be hard to automate since it requires a skilled eye to determine whether an email (or any file for that matter) is admissible in court. 
  • I’m not even sure how much corporate email is actually being stored in content management systems – even when companies have content management systems. It’s a massive amount of data.
  •  And, then there is the issue of how much email companies will want to save to begin with.  Some will store it all because they want a record.  Others seem to be moving away from email altogether.  For example, one person in the group told us that his Bank of America financial advisor can no longer communicate with him via email!  This opens up a whole different can of worms, which is not worth going into here. 
  • Then there is the issue of changing vocabularies between different departments in companies, people not using certain phrases once they get media attention, etc. etc.


Before jumping to any conclusions let’s look at what vendors can do.  According to EMC, the email overload problem can be addressed.  The first thing to do is to de-duplicate emails that could be stored in a content management system.  Think about it.  You get an email and 20 people are copied on it. Or, you forward someone an email and they don’t necessarily delete it. These emails would pile up.  De-duplicating emails would go a long way in reducing the amount of content in the ECM.  Then there is the matter of classifying these emails.  That could be done.  Some of this classification would be straight-forward.  And, the system might be able to be trained to look for those emails that might be privileged, and classify these accordingly, but this would no doubt still require human intervention, to help with the process.  Of course, terminology will change, as well and people will have to stay on top of this. 


The upshot is that there are certainly hurdles to overcome to put advanced classification and text analytics in place to help in e-Discovery.  However, as the amount of digital information keeps piling up, something has to be done.  In this case, the value certainly would seem to outweigh the cost of business as usual.

Text Analytics Summit 2009

I just got back from the Text Analytics Summit and it was a very good conference.  I’ve been attending the Summit for the last three years and it has gotten better every year.  This year, it seemed like there were a lot more end users and the conference had more of a business oriented approach than in previous years.  Don’t get me wrong- there were still technical discussions, but I liked the balance.


A major theme this year, as in previous years, was on Voice of the Customer applications.  That is to be expected, in some ways, because it is still a hot application area and most of the vendors at the conference (including Attensity, Clarabridge, Lexalytics, SAS, and SPSS) focus on it in one form or another.  This year, there was a lot of discussion about using social media for text analytics and VoC  kinds of applications.  Social media meaning blogs, twitter, and even social networks.  The issue of sentiment analysis was discussed at length since it is a hard problem.  Sarcasm, irony, the element of surprise, and dealing with sentiment at the feature level were all discussed.  I was glad to hear it, because it is very important.  SAS also made an announcement about some of its new features around sentiment analysis.  I’ll blog about that in a few days.


Although there was a heavy focus on the VoC type applications, we did hear from Ernst & Young on fraud applications.  This was interesting because it showed how human expertise, in terms of understanding certain phrases that might appear in fraud, might be used to help automate fraud detection.  Biogen Inc also presented on its use of text analytics in life sciences and biomedical research.  We also heard what Monster and Facebook are doing with text analytics, which was quite interesting.  I would have liked to hear more about what is happening with text analytics in media and publishing and e-Discovery.  It would have also been useful to hear more about how text analytics is being incorporated into a broader range of applications.  I’m seeing (and Sue Feldman, from IDC, noted this too) a large number of services springing up that use text analytics.  This spans everything from new product innovation to providing real time insight to traders.  As these services, along with the SaaS model continue to explode, it would be useful to hear more about them next year.




Other observations

Here are some other observations on topics that I found interesting.


  • Bringing people into the equation.  While text analytics is very useful technology, it needs people to make it work.  The technology itself is not Nirvana.  In fact, it can most useful when a person works together with the technology to make it zing.  While people who use the technology obviously know this (there is work that has to be done by people to make text analytics work),  I think that  people beginning the process need to be aware of this too, for many reasons.  Not only are people necessary to make the technology work, the cultural component is also critical, as it is in the adoption of any new technology.  Having said this, there was discussion on the end user panel about how companies were making use of the SaaS model (or at least services), since it wasn’t working out for IT (not quite sure why – either they didn’t have the time or didn’t have the skills).
  • Managing expectations. This came up on some of the panels and in a few talks.  There were two interesting comments worth noting.  First, Chris Jones, from Intuit said that some people believe that text analytics will tell you what to do, so expectations need to be set properly.  In other words, people need to understand that text analytics will uncover issues and even the root cause of the issues, but it is up to a company to figure out what to do with that information.   Second, there was an interesting discussion around the notion of the 85% accuracy that text analytics might provide.  The end user panel was quite lively on this topic. I was especially taken with comments from Chris Bowmann, a former school superintendent of the Lafourche Parish School Board, and how he had used text analytics to try to help keep kids in school. He used the technology to cull through disciplinary records to see what patterns were emerging.  Very interesting.  Yes, as he pointed out, text analytics may not be 100% accurate, but think of what 85% can get you!
  • Search needs to incorporate more text analytics. There were two good search talks on the agenda:  Usama Fayyad, CEO of Open Insights who spoke about text analytics and web advertising as well as how text analytics might be used to help search “get things done” (i.e. like book a trip).  The other speaker on the topic was Daniel Tunkelang from Endeca, who talked about text analytics and exploratory search. There were a number of comments from people in the audience as well about services like Wolfram Apha.
  • Content Management.  I was happy to see more about enterprise content management this year and see more people in the audience who were interested in it.  There was even a talk about it from Lou Jordano from EMC. 

 I think anyone who attended the conference would agree that text analytics has definitely hit the main stream.


Get every new post delivered to your Inbox.

Join 1,189 other followers