EMC and Big Data- Observations from EMC World 2011

I attended EMC’s User Conference last week in Las Vegas. The theme of the event was Big Data meets the Cloud. So, what’s going on with Big Data and EMC? Does this new strategy make sense?

EMC acquired Greenplum in 2010. At the time EMC described Greenplum as a “shared nothing, massively parallel processing (MPP) data warehousing system.” In other words, it could handle pedabytes of data. While the term data warehouse denotes a fairly static data store, at the user conference, EMC executives characterized big data as a high volume of disparate data, which is structured and unstructured, it is growing fast, and it may be processed in real time. Big data is becoming increasingly important to the enterprise not just because of the need to store this data but also because of the need to analyze it. Greenplum has some of its own analytical capabilities but recently the company formed a partnership with SAS to provide more oomph to its analytical arsenal. At the conference, EMC also announced that it has now included Hadoop as part of its Greenplum infrastructure to handle unstructured information.

Given EMC’s strength in data storage and content management, it is logical for EMC to move into the big data arena. However, I am left with some unanswered questions. These include questions related to how EMC will make storage, content management, data management, and data analysis all fit together.

• Data Management. How will data management issues be handled (i.e. quality, loading, etc.)? EMC has a partnership with Informatica and SAS has data management capabilities, but how will all of these components work together?
• Analytics. What analytics solutions will emerge from the partnership with SAS? This is important since EMC is not necessarily known for analytics. SAS is a leader in analytics and can make a great partner for EMC. But, its partnership with EMC is not exclusive. Additionally, EMC made a point of the fact that 90% most enterprises’ data is unstructured. EMC has incorporated Hadoop into Greenplum, ostensibly to deal with unstructured data. EMC executives mentioned that the open source community has even begun developing analytics around Hadoop. EMC Documentum also has some text analytics capabilities as part of Center Stage. SAS also has text analytics capabilities. How will all of these different components converge into a plan?
• Storage and content management. How do the storage and content management parts of the business fit into the big data roadmap? It was not clear from the discussions at the meeting how EMC plans to integrate its storage platforms into an overall big data analysis strategy. In the short term we may not see a cohesive strategy emerge.

EMC is taking on the right issues by focusing on customers’ needs to manage big data. However, it is a complicated area and I don’t expect EMC to have all of the answers today. The market is still nascent. Rather, it seems to me that EMC is putting its stake in the ground around big data. This will be an important stake for the future.

What is Networked Content and Why Should We Care?

This is the first in a series of blogs about text analytics and content management. This one uses an interview format.

I recently had an interesting conversation with Daniel Mayer, from TEMIS regarding his new paper, the Networked Content Manifesto. I just finished reading it and found it to be insightful in terms of what he had to say about how enriched content might be used today and into the future.

So what is networked content? According to the Manifesto, networked content, “creates a network of semantic links between documents that enable new forms of navigation and improves retrieval from a collection of documents. “ It uses text analytics techniques to extract semantic metadata from documents. This metadata can be used to link documents together across the organization, thus providing a rich source of connected content for use by an entire company. Picture 50 thousand documents linked together across a company by enriched metadata that includes people, places, things, facts, or concepts and you can start to visualize what this might look like.

Here is an excerpt of my conversation with Daniel:

FH: So, what is the value of networked content?

DM: Semantic metadata creates a richer index than was previously possible using techniques such as manual tagging. There are five benefits that semantic metadata provides. The first two benefits are that it makes content more findable and easier to explore. You can’t find what you don’t know how to query. In many cases people don’t know how they should be searching. Any advanced search engine with facets is a simple example of how you can leverage metadata to enhance information access by enabling exploration. The third benefit is that networked content can boost insight into a subject of interest by revealing context and placing it into perspective. Context is revealed by showing what else there is around your precise search – for example related documents. Perspective is typically reached through analytics. That is, attaining a high level of insight into what can be found in a large amount of documents, like articles or call center notes. The final two benefits are more future looking. The first of these benefits is something we call “proactive delivery”. Up to now, people mostly access information by using search engines to return documents associated with a certain topic. For example, I might ask, “What are all of the restaurants in Boston?” But by leveraging information about your past behavior, your location, or your profile, I can proactively send you alerts about relevant restaurants you might be interested in. This is done by some advanced portals today, and the same principle can be applied to virtually any forms of content. The last benefit is tight integration with workflow applications. Today, people are used to searching Google or other search engines which require a dedicated interface. If you are writing a report and need to go to the web to look for more information, this interferes with your workflow. But instead, it is possible to pipe content directly to your workflow so that you don’t need to interrupt your work to access it. For example, we can foresee how in the near future, when typing a report in a word processing application such as MS Word, , right in the interface, you will be able to receive bits of information related contextually to what you are typing. As a chemist, , you might receive suggestions of scientific articles based on the metadata extracted from the text you are typing. Likewise, Content management interfaces in the future will be enriched with widgets that provide related documents and analytics.

FH: How is networked content different from other kinds of advanced classification systems provided by content management vendors today?

DM: Networked Content is ultimately a vision for how content can be better managed and distributed by leveraging semantic content enrichment. This vision is underpinned by an entire technical ecosystem, of which the Content Management System is only one element. Our White Paper illustrates how text analytics engines such as the Luxid® Content Enrichment Platform are a key part of this emerging ecosystem.

Making a blanket comparison is difficult, but generally speaking, Networked Content can leverage a level of granularity and domain specificity that the classification systems you are referring to don’t generally support.

FH: Do you need a taxonomy or ontology to make this work?

DM: I’d like to make sure we use caution when we use these terms. A taxonomy or ontology can be helpful, certainly. If a customer wants to improve navigation in content and already has an enterprise taxonomy, it will undoubtedly help by providing guidance and structure. However, in most cases it is not sufficient in and of itself to perform content enrichment. To do this you need to build an actual engine that is able to process text and identify within it some characteristics that will trigger the assignment of metadata (either by extracting concepts from the text itself or by judging the text as a whole) In the news domain, for example, the standard IPTC taxonomy is used to categorize news articles into topic areas such as economy, politics, or sports, and into subcategories like economy/economic policy or economy/macroeconomics, etc… You can think of this as a file cabinet where you ultimately want to file every article. What the IPTC taxonomy does is that it tells you the structure the file cabinet should have. But it doesn’t do the filing for you. For that, you need to build the metadata extraction engine. That’s where we come in. We provide a platform that includes standard extraction engines – that we call Skill Cartridges® as well as the full development environment to customize them, extend their coverage, and develop new ones from the ground up if needed.

FH: I know that TEMIS is heavily into the publishing industry and you cite publishing examples in the Manifesto. What other use cases do you see?

DM: The Life Sciences industry (especially Pharma and Crop Science) has been an early adopter of this technology for applications such as scientific discovery, IP management, knowledge management, pharmacovigilance,. These are typical use cases for all research-intensive sectors. Another group of common use cases for this technology in the private sector is what we call Market Intelligence: understanding your competitors and complementors (Competitive Intelligence), your customers (Voice of the Customer) and/or what is being said about you (Sentiment Analysis) You can think of all of these as departmental applications in the sense that primarily serve the needs of one department: R&D, Marketing, Strategy, etc…

Furthermore, we believe there is an ongoing trend for the Enterprise to adopt Networked Content transversally, beyond departmental applications, as a basic service of its core information system. There, content enrichment can act as the glue between content management, search, and BI, and can bring productivity gains and boost insight throughout the organization. This is what has led us to deploy within EMC Documentum and Microsoft SharePoint 2010 In the future all the departmental applications will become even more ubiquitous thanks to such deployments.

FH: How does Networked Content relate to the Semantic Web?

DM: They are very much related. The Semantic Web has been primarily concerned with how information that is available on the Web should be intelligently structured to facilitate access and manipulation by machines. Networked Content is focused on corporate – or private – content and how it can be connected with other content, either private, or public.

Five vendors committed to content analytics for ECM

In 2007, Hurwitz & Associates fielded one of the first market studies on text analytics. At that time, text analytics was considered to be more of a natural extension to a business intelligence system than a content management system. However, in that study, we asked respondents who were planning to use the software, whether they were planning to deploy it in conjunction with their content management systems. It turns out that a majority of respondents (62%) intended to use text analytics software in this manner. Text analytics, of course, is the natural extension to content management and we have seen the market evolve to the point where several vendors have included text analytics as part of the their offerings to enrich content management solutions.

Over the next few months, I am going to do a deeper dive into solutions that are at the intersection of text analytics and content management; three from content management vendors EMC, IBM, and OpenText as well as solutions from text analytics vendor TEMIS and analytics vendor SAS. Each of these vendors is actively offering solutions that provide insight into content stored in enterprise content management systems. Many of the solutions described below also go beyond providing insight for content stored in enterprise content management systems to include insight over other content both internal and external to an organization. A number of solutions also integrate structured data with unstructured information.

EMC: EMC refers to its content analytics capability as Content Intelligence Services (CIS). CIS supports entity extraction as well as categorization. It enables advanced search and discovery over a range of platforms including ECM systems such as EMC’s Documentum, Microsoft SharePoint, and others.

IBM: IBM offers a number of products with text analytics capabilities. Its goal is to provide rapid and deep insight into unstructured data. The IBM Content Analytics solution provides integration into IBM ECM (FileNet) solutions such as IBM Case Manager, its big data solutions (Netezza) and integration technologies (DataStage). It also integrates securely with other ECM solutions such as SharePoint, Livelink, Documentum and others.

OpenText: OpenText acquired text analytics vendor Nstein in 2010 in order to invest in semantic technology and expand its semantic coverage. Nstein semantic services are now integrated with OpenText’s ECM suite. This includes automated content categorization and classification as well as enhanced search and navigation. The company will soon be releasing additional analytics capabilities to support content discovery. Content Analytics services can also be integrated into other ECM systems.

SAS: SAS Institute provides a number of products for unstructured information access and discovery as part of its vision for the semantically integrated enterprise. These include SAS Enterprise Content Categorization, SAS Ontology Management (both for improving document relevance) and SAS Sentiment Analysis and SAS Text Miner for knowledge discovery. The products integrate with structured information; with Microsoft SharePoint, FAST ESP, Endeca, EMC Documentum; as well as with both Teradata and Greenplum.

TEMIS: TEMIS recently released its Networked Content Manifesto, which describes its vision of a network of semantic links connecting documents to enable new forms of navigation and retrieval from a collection of documents. It uses text analytics techniques to extract semantic metadata from documents that can then link documents together. Content Management systems form one part of this linked ecosystem. TEMIS integrates into ECM systems including EMC Documentum and Centerstage, Microsoft SharePoint 2010 and MarkLogic.

A different spin on analyzing content – Infosphere Content Assessment

IBM made a number of announcements last week at IOD regarding new products/offerings to help companies analyze content.  One was Cognos Content Analytics, which enables organizations to analyze unstructured data alongside structured data.  It also looks like IBM may be releasing a “voice of the customer” type service to help companies understand what is being said about them in the “cloud” (i.e. blogs, message boards, and the like).  Stay tuned on that front, it is currently being “previewed”.

I was particularly interested in a new product called IBM Infosphere Content Assessment, because I thought it was an interesting use of text analytics technology.  The product uses content analytics (IBM’s term for text analytics) to analyze “content in the wild”.  This means that a user can take the software, run it over servers that might contain terabytes (or even petabytes) of data to understand what is being stored on servers.  Here are some of the potential use cases for this kind of product:

  • Decommission data.  Once you understand the data that is on a server, you might choose to decommission it, thereby freeing up storage space
  • Records enablement.   Infosphere Content Assessment can also be used to identify what records need to go into a records management system for a record retention program
  • E-Discovery.  Of course, this technology could also be used in litigation, investigation, and audit.  It can analyze unstructured content on servers which can help to discover information that may be used in legal matters or information that needs to meet certain audit requirements for compliance.

The reality is that the majority of companies don’t formally manage their content.  It is simply stored on file servers.  The IBM product team’s view is that companies can “acknowledge the chaos”, but use the software to understand what is there and gain control over the content.  I had not seen a product positioned quite this way before and I thought it was a good use of the content analysis software that IBM has developed.

If anyone else knows of software like this, please let me know.

Threats to the American Justice System – Can Enterprise Content Management Help?

I was at the EMC writer’s conference  this past Friday, speaking on Text Analytics and ECM.  The idea behind the conference is very cool.  EMC brings together writers and bloggers, from all over the world, to discuss topics relevant to content management.  All of the sessions were great.  We discussed Cloud, Web 2.0, Sharepoint, Text Analytics, and e-Discovery. 

 I want to focus here on the e-Discovery discussion, since e-Discovery has been showing up on my top text analytics applications list for several years.  There are a growing number of vendors looking to address this problem (although not all of them may be making use of text analytics yet) including large companies like EMC, IBM, Digital Iron Mountain, Microsoft and smaller providers such as Zylab.

 Ralph Losey gave the presentation. He is a defense lawyer, by training, but over the years has focused on e-Discovery.  Losey has written a number of books on the topic and he writes a blog called e-Discovery Team.  An interesting fellow!

 His point was that “The failure of American business to adopt ECM is destroying the American system of justice.”  Why?  His argument went something like this:

  • You can’t find the truth if you can’t find the evidence.  As the amount of digital data explodes, it is harder to find the information companies need to defend themselves.  This is because the events surrounding the case might have occurred a year or more in the past, and the data is buried in files or email.  I don’t think anyone will argue with this fact. 
  • According to Losey, most trial lawyers are luddites, implying that they don’t get technology.  Lawyers aren’t trained this way so they are not going to push for ECM systems, since they might not even know what they are.  And corporate America is putting off decisions to purchase ECM systems that could actually help organize some of the content and make it more findable.
  • Meanwhile, the cost of litigation is skyrocketing.  Since it is so expensive, many companies don’t go to court and they look to private arbitration.  Why spend $2M in e-Discovery when you can settle for $3M?  Losey pointed to one example, in the Fannie Mae securities litigation (2009), where it cost $6M (or 9% of the annual budget of the Office of Federal Housing Enterprise Oversight) to comply with ONE subpoena. This involved about 660,000 emails. 
  • According to Losey, it costs about $5 to process one computer file for e-Discovery.  This is because the file needs to be reviewed for relevance, privilege, and confidentiality. 

 Can the American justice system be saved? 

 So, can e-Discovery tools be used to help save the justice system as we know it?  Here are a few points to ponder:

  • Losey seems to believe that the e-Discovery process may be hard to automate since it requires a skilled eye to determine whether an email (or any file for that matter) is admissible in court. 
  • I’m not even sure how much corporate email is actually being stored in content management systems – even when companies have content management systems. It’s a massive amount of data.
  •  And, then there is the issue of how much email companies will want to save to begin with.  Some will store it all because they want a record.  Others seem to be moving away from email altogether.  For example, one person in the group told us that his Bank of America financial advisor can no longer communicate with him via email!  This opens up a whole different can of worms, which is not worth going into here. 
  • Then there is the issue of changing vocabularies between different departments in companies, people not using certain phrases once they get media attention, etc. etc.


Before jumping to any conclusions let’s look at what vendors can do.  According to EMC, the email overload problem can be addressed.  The first thing to do is to de-duplicate emails that could be stored in a content management system.  Think about it.  You get an email and 20 people are copied on it. Or, you forward someone an email and they don’t necessarily delete it. These emails would pile up.  De-duplicating emails would go a long way in reducing the amount of content in the ECM.  Then there is the matter of classifying these emails.  That could be done.  Some of this classification would be straight-forward.  And, the system might be able to be trained to look for those emails that might be privileged, and classify these accordingly, but this would no doubt still require human intervention, to help with the process.  Of course, terminology will change, as well and people will have to stay on top of this. 


The upshot is that there are certainly hurdles to overcome to put advanced classification and text analytics in place to help in e-Discovery.  However, as the amount of digital information keeps piling up, something has to be done.  In this case, the value certainly would seem to outweigh the cost of business as usual.

Text Analytics Summit 2009

I just got back from the Text Analytics Summit and it was a very good conference.  I’ve been attending the Summit for the last three years and it has gotten better every year.  This year, it seemed like there were a lot more end users and the conference had more of a business oriented approach than in previous years.  Don’t get me wrong- there were still technical discussions, but I liked the balance.


A major theme this year, as in previous years, was on Voice of the Customer applications.  That is to be expected, in some ways, because it is still a hot application area and most of the vendors at the conference (including Attensity, Clarabridge, Lexalytics, SAS, and SPSS) focus on it in one form or another.  This year, there was a lot of discussion about using social media for text analytics and VoC  kinds of applications.  Social media meaning blogs, twitter, and even social networks.  The issue of sentiment analysis was discussed at length since it is a hard problem.  Sarcasm, irony, the element of surprise, and dealing with sentiment at the feature level were all discussed.  I was glad to hear it, because it is very important.  SAS also made an announcement about some of its new features around sentiment analysis.  I’ll blog about that in a few days.


Although there was a heavy focus on the VoC type applications, we did hear from Ernst & Young on fraud applications.  This was interesting because it showed how human expertise, in terms of understanding certain phrases that might appear in fraud, might be used to help automate fraud detection.  Biogen Inc also presented on its use of text analytics in life sciences and biomedical research.  We also heard what Monster and Facebook are doing with text analytics, which was quite interesting.  I would have liked to hear more about what is happening with text analytics in media and publishing and e-Discovery.  It would have also been useful to hear more about how text analytics is being incorporated into a broader range of applications.  I’m seeing (and Sue Feldman, from IDC, noted this too) a large number of services springing up that use text analytics.  This spans everything from new product innovation to providing real time insight to traders.  As these services, along with the SaaS model continue to explode, it would be useful to hear more about them next year.




Other observations

Here are some other observations on topics that I found interesting.


  • Bringing people into the equation.  While text analytics is very useful technology, it needs people to make it work.  The technology itself is not Nirvana.  In fact, it can most useful when a person works together with the technology to make it zing.  While people who use the technology obviously know this (there is work that has to be done by people to make text analytics work),  I think that  people beginning the process need to be aware of this too, for many reasons.  Not only are people necessary to make the technology work, the cultural component is also critical, as it is in the adoption of any new technology.  Having said this, there was discussion on the end user panel about how companies were making use of the SaaS model (or at least services), since it wasn’t working out for IT (not quite sure why – either they didn’t have the time or didn’t have the skills).
  • Managing expectations. This came up on some of the panels and in a few talks.  There were two interesting comments worth noting.  First, Chris Jones, from Intuit said that some people believe that text analytics will tell you what to do, so expectations need to be set properly.  In other words, people need to understand that text analytics will uncover issues and even the root cause of the issues, but it is up to a company to figure out what to do with that information.   Second, there was an interesting discussion around the notion of the 85% accuracy that text analytics might provide.  The end user panel was quite lively on this topic. I was especially taken with comments from Chris Bowmann, a former school superintendent of the Lafourche Parish School Board, and how he had used text analytics to try to help keep kids in school. He used the technology to cull through disciplinary records to see what patterns were emerging.  Very interesting.  Yes, as he pointed out, text analytics may not be 100% accurate, but think of what 85% can get you!
  • Search needs to incorporate more text analytics. There were two good search talks on the agenda:  Usama Fayyad, CEO of Open Insights who spoke about text analytics and web advertising as well as how text analytics might be used to help search “get things done” (i.e. like book a trip).  The other speaker on the topic was Daniel Tunkelang from Endeca, who talked about text analytics and exploratory search. There were a number of comments from people in the audience as well about services like Wolfram Apha.
  • Content Management.  I was happy to see more about enterprise content management this year and see more people in the audience who were interested in it.  There was even a talk about it from Lou Jordano from EMC. 

 I think anyone who attended the conference would agree that text analytics has definitely hit the main stream.

2009 Text Analytics Survey

 Several weeks ago, in preparation for the Text Analytics Summit, I deployed a short survey about the state of text analytics.  I supplemented the end-user survey with vendor interviews.   Here are some of the end-user findings. 

First, a few words about the survey itself and who responded to the survey.

  • I wanted to make the survey short and sweet.  I was interested in company’s plans for text analytics and whether the economy was affecting these plans. 
  • Let me say up front that given the topic of the survey and our list, I would categorize most of the respondents as fairly analytical and technology savvy.  Approximately 50 companies responded to the survey – split evenly between those companies that were deploying the technology and those that were not (note that this is a self selecting sample and does not imply that 50% of companies are currently using text analytics).  The respondents represent a good mix across a number of verticals including computing, telecommunications, education, pharmaceuticals, financial services, government, and CPG.  There were also a few market researchers in the mix.  Likewise, there was a mix of companies of various sizes. 
  • Here’s my caveat:  I would not view the respondents as a scientific sample and I would view these results as qualitative.  That said, many of the results paralleled results from previous surveys.  So, while the results are unscientific, in terms of a random sample and size, I believe they probably do reflect what many companies are doing in this space.


Kinds of applications, implementation schemes

I asked those respondents that were deploying text analytics, what kinds of applications they were using it for.  The results were not surprising.  The top three responses:  Voice of the Customer (VoC), Competitive Intelligence, and eDiscovery, were also in the top three the last time I asked the question. Additionally, many of the respondents were deploying more than one type of application (i.e. VoC and quality analysis).  This was a pattern that also emerged in a study I did on text analytics back in 2007. Once a company gains value from one implementation, it then sees the wider value of the technology (and realizes that it has a vast amount of information that needs to be analyzed).

 I asked those companies that were planning to deploy the technology, the top applications being considered.  In this case, VoC and Competitive Intelligence were again in the top two.  Brand Management and Product R&D were tied for third.  This is not surprising.  Companies are quite concerned with customer feedback and any issues that impact customer retention.  Companies want to understand what competitors are up to and how their brand is being perceived in the market.  Likewise, they are also trying to get smarter about how they develop products and services to be more cost effective and more market focused.



How Text Analytics is being deployed

I also wanted to find out how companies were deploying the technology. In particular, we’ve heard a lot this past year about organizations utilizing text analytics in a Software as a Service (SaaS) model.  This model has become particularly attractive in the Voice of the Customer/Competitive Intelligence/Brand Management area for several reasons.  For one thing, this kind of analysis might involve some sort of external information source such as news feeds and blog postings.  Even product R&D would draw from external sources such as trade journals, news about competitive products, and patent files.  Additionally, the SaaS model generally has a different price point that enterprise solutions.

In fact, SaaS was the model of choice for implementing the technology.  The SasS model does offer the flexibility and price point that many companies are looking for, especially in some of the above-mentioned areas.  However, that is not to say that companies are not deploying text analytics in other ways (note the values on the X axis).  Interestingly, companies are starting to deploy text analytics in conjunction with their content management systems.  I think we will see more of this as the technology continues to become more mainstream. 


Just as an FYI, all of the companies that had deployed text analytics stated that the implementations either met or exceeded their expectations.  And, close to 60% stated that text analytics had actually exceeded expectations. 

 What about those companies that aren’t deploying the technology?

Equally important to understanding the market are those companies that are not deploying text analytics.  I asked those companies if they had any plans to utilize the technology.  Eleven percent stated that plans had been put on hold due to funding constraints.  Twenty-eight percent stated that they had no plans to implement the technology.  Another 28% stated that they planned to implement the technology in the next year and 33% said they planned to implement it in the next few years. 

Reasons cited for not implementing the technology included not understanding enough about text analytics to implement it.  Other companies just never considered implementing it, or had other analytic projects on the radar.

What about the economy?

There have been numerous articles written about whether certain technologies are recession proof, with various BI related technology vendors stating/hoping/praying that their technology falls in to this category.  And certainly, companies do feel the need to gain insight about operational efficiency, their customers, the market, and the competition with, perhaps a greater urgency than in the past.  This has helped keep business analytics vendors moving forward in this economy.

The 11% number is relatively small.  However, I wonder what part of the 61% that said that they would be deploying it in the future, might actually fall into the hold category.  When I asked text analytics vendors (mostly private companies) whether the economy was impacting the sales cycle, they pretty much said the same thing.  Existing customers were not dropping projects (there is too much value there, as supported by this survey).  However, sales cycles are longer (companies are not necessarily rushing) and potential clients may be looking for creative financing and contracting options. 

I am participating in an analyst panel at the Text Analytics Summit in June.  I have more to say about the topic, as I am sure, do the other analysts who will be participating.

The Three C’s – conceptual search, clustering, and categorization


I recently had the opportunity to speak with Richard Turner, VP of Marketing, at Content Analyst.  Content Analyst was originally part of SAIC and spun out about 5 years ago.  The company provides content management, eDiscovery, and content workflow solutions to its clients – primarily as an OEM solutions.

The tool set is called CAAT (Content Analyst Analytic Technology).  It includes five core features:


  • Concept search: uses concepts rather than key words to search through documents.
  • Dynamic Clustering: classifies documents into clusters.
  • Categorization: classifies documents into user-defined categories.
  • Summarization: identifies conceptually definitive sentences in documents.
  • Foreign Language: identifies the language in a document and can work across any language.

Let me just briefly touch upon concept search and dynamic clustering and categorization.  Concept search is interesting because when most people think search, they think key word search.  However, key words may not give you what you’re looking for.  For example, I might be interested in finding documents that deal with banks.  However, the document might not state the word bank explicitly.  Rather, words like finance, money, and so on might occur in the document.  So, if you don’t insert the right word into the search engine, you will not get back all relevant documents.  Concept search allows you to search on the concept (not keyword) “bank” so you get back documents related to the concept even if they don’t contain the exact word.  CAAT learns the word bank in a given set of documents from words like “money”, “exchange rate”, etc.  It also learns that the word bank (as in financial institution) is not the same as the bank on the side of a river becuase of other terms in the document (such as money, transfer, or ATM).

Dynamic clustering enables the user to organize documents into categories based on content called clusters.  You can also categorize documents by using examples that fall into a certain cluster and then train the system to recognize similar documents that could fall into the same category.  You literally tag the document as belonging to a certain category and then give the system examples of other documents that are similar to this to train on.  In eDiscovery applications, this can help dramatically cut down the amount of time needed to find the right documents.  In the past, this was done manually, which obviously could be very time intensive. 

How do they do it?

The company uses a technique called Latent Semantic Indexing (LSI), along with other patented technology, to help it accomplish all of this.  Here is a good link that explains LSI.  The upshot is that LSI uses a vector representation of the information found in the documents to analyze the term space in a document.  Essentially, it removes the grammar, then counts and weights (e.g. how often a word appears on a page or in a document, etc.) the occurrence of the terms in the document.  It does this across all of the documents, and actually collapses the matrix using a technique patented at Bell Labs.  The more negative a term, the greater its distance from a page.  Since the approach is mathematical, there is no need to put together a dictionary or thesauri.  And, it’s this mathematical approach that makes the system language independent.

Some people have argued this technique can’t scale because the matrix would be too large and it would be hard to keep this in-memory.  However, when I asked the folks at Content Analyst about this they told me that they have been working on the problem and that CAAT contains a number of features to optimize memory and throughput.  The company regularly works with ligitation clients who might get 1-2 TB of information from opposing counsel and they are using CAAT for clustering, categorization, and search.  The company also works with organizations that have created indexes of 45+ million (>8 TB) documents.  That’s a lot of data!

Conceptual Search and Classification and Content Management

Aside from eDiscovery, Content Analyst is also being used in application such as improving search in media and publishing and of course, government applications.  The company is also looking into other application areas.

Concept search is definitely a big step up from keyword search and is important for any kind of document that might be stored in a content management system.  Automatic classification and clustering would also be huge (as would summarization and foreigh language recognition).  This could move Content Analyst into other areas including analyzing claims, medical records, and customer feedback.  Content management vendors such as IBM and EMC are definitely moving in the direction of providing more intelligence in their content management products.  This makes perfect sense, since a lot of unstructured information is stored in these systems.  Hopefully, other content management vendors will catch up soon.

Text Analytics meets Enterprise Content Management ‘Round 2– IBM Content Analyzer

Why? How? These are key questions that business people ask a lot. Questions such as, “Why did our customer retention rate plummet?” or “How come our product quality has declined?” or even “How did we end up in this mess?” often cannot be answered using structured data alone. However, there is a lot of unstructured data out there in content management systems that is ripe for this kind of analysis. Claims, case files, contracts, call center notes, and various forms of correspondence are all prime sources of insight.

This past year, I have become quite interested in what content management vendors are doing about text analytics. This is, in part, due to some research we had done at Hurwitz & Associates, which indicated that companies were planning to deploy their text analytics solutions in conjunction with content management systems. Many BI vendors have already incorporated text analytics into their BI platforms, yet earlier this year there didn’t seem to be much action on this front on the part of the ECM vendors.

Now, several content management providers are stepping up to the plate with offerings in this space. One of these vendors is IBM. IBM’s Content Analyzer , formerly IBM OmniFind Analytics Edition, uses linguistic understanding and trend analysis to allow users to search, mine and analyze the combined information from their unstructured content and structured data. Content Analyzer consists of two pieces: a backend linguistics component and a visualization and analysis text mining user interface. Content Analyzer also has a direct integration with FileNet P8, which means Content Analyzer understands FileNet formats and that the integrity of any information from the FileNet system is maintained as it moves into Content Analyzer.

Last week, Rashmi Vital, the offer manager for content analytics, gave me a demo of the product. It has come a long way since I wrote about what IBM was doing in the space back in February. The demo she showed me utilized data from the National Highway Transportation Safety Administration Database (NHTSA), which logs consumer complaints about vehicles. The information includes structured information – for example, the incident date and whether there was a crash — and unstructured information, which is the free-text description written by consumers.  Rashmi showed me how the text mining capabilities of Content Analyzer can be used for the early detection of quality issues.

Let’s suppose that management wants to understand quality problems with cars, trucks, and SUVs, before they become a major safety issue (the auto industry doesn’t need any more trouble than it already has) and they want to understand what specific component of the car is causing the complaint. The user decides to explore a data source of customer complaints. In this example, we are using the NHTSA data, but he or she could obviously get more information from his or her own warranty claims, reports, and technician notes stored in the content management system.

The user wants to gather all of the incident reports associated with fire. Since fire isn’t a structured data field, the first thing the user can do is either simply search on fire, or set up a small dictionary of terms with words and phrases that would also represent fire. These might include words like flame, blaze, spark, burst, and so on. Content analyzer crawls the documents and put them in an index. Content Analyzer’s index contains the NLP analytic results of the corpus of documents and the document itself because often the analyst wants to see the source.

Here is a screen shot of the IBM Content Analyzer’s visualization tool called Text Miner. Text Miner provides facilities for real-time statistical analysis on the index for a source dataset. It allows the users to analyze the processed data by organizing the data into categories, applying search conditions, and further drilling down to analyze patterns over time or correlations.


You can see that the search on fire returned about 200,000 documents (there were over 500,000 to start). The user can then correlate fire with specific problems. In this example, the user decides to correlate it to a structured field called “vehicle component”. The vehicle component that most highly correlated to fire (and with a high number of hits) is the electrical system wiring. The user can continue to drill down on this information, in order to determine what make and model of car had this problem, and once he or she has distilled the analysis to a manageable number, examine the actual description of the problem to understand the root cause.


Correlation analysis has another benefit because sometimes it can help to see trends that are highly unusual that we would not be consider. Suppose we take the same criteria as above and sort by correlation value (see next figure). It is not surprising to see components like electrical system or fuel system listed since we assume this is a normal place for potential fires to start. However, just below those components you can see a high correlation between fire and Vehicle Speed Control: Cruise Control Component. Perhaps this may not an area an analyst would consider a potential fire hazard. The high correlation value would be a signal to an analyst to investigate further by drilling down into the component, Vehicle Speed Control: Cruise Control and into the descriptions that customers submitted. The following view looks at the results of analyzing the phrases related to the current analytic criteria. Being able to drill down to see the actual incident report description allows the analyst to see the issue in its entirety. This is good stuff.


Stay tuned as I plan to showcase what other content management vendors are doing in this space.

I’m interested in your company plans for text analytics and content management.  Please answer my poll, below:


A different way to search?

I recently had an interesting conversation about classification and search with James Zubok, CFO of Brainware, Inc. Brainware is a Virginia based company that was once part of SER Systems AG, a former German ECM company.  Brainware provides products that help companies extract information from unstructured and semi-structured documents, such as invoices, order forms, contracts, etc. without using templates.  The company also offers some interesting classification and search technology and this is what our conversation focused on.


We discussed two different, but interrelated technologies that Brainware has developed; one a search engine based on n-grams and another,  a classification engine that uses neural networks. Brainware offers both enterprise and desktop editions of each.  I received a demo of the desktop version of the products and now have both running on my laptop.


A Search example

On the desktop search side, the product, called Globalbrain Personal Edition, differs from many other search products on the market in that it does not make use of keyword search.  Rather, it searches are natural language based, using a patented n-gram approach.  When indexing a word, the word is parsed into three parts and then a vector is created.  For example, the word sample would be parsed as sam, amp, mpl, etc.  According to Brainware, this three-letter snippet approach makes the search engine language independent.   The capability provided by Brainware lets users search, not simply on key words, but on whole paragraphs. For example, I have many documents (in various formats) on my desktop that deal with all of the companies I speak with.  Say, I want to find some documents relating to specific challenges companies faced in deploying their text analytics solutions.  Rather than simply inputting “text analytics” and “challenges”, I can type in a phrase or even a paragraph with the wording I’m looking for.  This returns a much more targeted set of documents. 


A Classification example

On the desktop classification front, the product is very easy to use.  I simply loaded the software which provided me a user interface where I could develop classes and then train my system to automatically classify documents based on a few training examples. As I mentioned, I have many documents on my desktop that deal with various technology areas and I might want to classify them in an intelligent manner for some research I’m planning.  So, I created several classes: text analytics, visualization, and MDM. I simply created these classes and then dragged documents that I felt fell into each category onto those classes.  I trained the system on these examples. 


Brainware provides a visual interface that lets me view how “good” the learned set is via a series of points in three-dimensional space.  The closer together the points (from the same class) are on the plot, the better the classification will be.  Also, the more separate the various class points are, the better the classification.  In my classification test, the visualization and the MDM documents were tightly clustered, while the text analytics information was not.  In any event, I then ran the classifier over the rest of my documents (supplying a few parameters) and the system automatically classified what it could.  It also gave me a list of documents that it couldn’t classify, but suggested the appropriate categories. I could then just drag those documents into the appropriate categories and run the classifier again. I should add that it did a good job of suggesting the right class for the documents it put in the unclassified category.



Brainware on an enterprise level


The enterprise edition of the product combines the search and classification capabilities and lets users search and classify over 400 different document types. 


Now, Brainware isn’t planning to compete with Google, Yahoo!, Fast, etc.  Rather, the company sees its search as a complement to these inverted index approaches.  The idea would be to embed its search into other applications that deal with archiving, document management, or e-discovery, to name a few.  The classification piece could also be embedded into the appropriate applications.  I asked if the company was in discussions with content management providers and service providers that store emails and documents.  It would seem to me that this software would be a natural complement to some of these systems. My understanding is that the company is looking for partnerships in the area.  Brainware currently has a partnership with Attensity, a text analytics provider, to help classify and search documents as part of the text analytics process.


I’m interested to see what will develop with this company.





Get every new post delivered to your Inbox.

Join 1,189 other followers