EMC and Big Data- Observations from EMC World 2011

I attended EMC’s User Conference last week in Las Vegas. The theme of the event was Big Data meets the Cloud. So, what’s going on with Big Data and EMC? Does this new strategy make sense?

EMC acquired Greenplum in 2010. At the time EMC described Greenplum as a “shared nothing, massively parallel processing (MPP) data warehousing system.” In other words, it could handle pedabytes of data. While the term data warehouse denotes a fairly static data store, at the user conference, EMC executives characterized big data as a high volume of disparate data, which is structured and unstructured, it is growing fast, and it may be processed in real time. Big data is becoming increasingly important to the enterprise not just because of the need to store this data but also because of the need to analyze it. Greenplum has some of its own analytical capabilities but recently the company formed a partnership with SAS to provide more oomph to its analytical arsenal. At the conference, EMC also announced that it has now included Hadoop as part of its Greenplum infrastructure to handle unstructured information.

Given EMC’s strength in data storage and content management, it is logical for EMC to move into the big data arena. However, I am left with some unanswered questions. These include questions related to how EMC will make storage, content management, data management, and data analysis all fit together.

• Data Management. How will data management issues be handled (i.e. quality, loading, etc.)? EMC has a partnership with Informatica and SAS has data management capabilities, but how will all of these components work together?
• Analytics. What analytics solutions will emerge from the partnership with SAS? This is important since EMC is not necessarily known for analytics. SAS is a leader in analytics and can make a great partner for EMC. But, its partnership with EMC is not exclusive. Additionally, EMC made a point of the fact that 90% most enterprises’ data is unstructured. EMC has incorporated Hadoop into Greenplum, ostensibly to deal with unstructured data. EMC executives mentioned that the open source community has even begun developing analytics around Hadoop. EMC Documentum also has some text analytics capabilities as part of Center Stage. SAS also has text analytics capabilities. How will all of these different components converge into a plan?
• Storage and content management. How do the storage and content management parts of the business fit into the big data roadmap? It was not clear from the discussions at the meeting how EMC plans to integrate its storage platforms into an overall big data analysis strategy. In the short term we may not see a cohesive strategy emerge.

EMC is taking on the right issues by focusing on customers’ needs to manage big data. However, it is a complicated area and I don’t expect EMC to have all of the answers today. The market is still nascent. Rather, it seems to me that EMC is putting its stake in the ground around big data. This will be an important stake for the future.

What is Networked Content and Why Should We Care?

This is the first in a series of blogs about text analytics and content management. This one uses an interview format.

I recently had an interesting conversation with Daniel Mayer, from TEMIS regarding his new paper, the Networked Content Manifesto. I just finished reading it and found it to be insightful in terms of what he had to say about how enriched content might be used today and into the future.

So what is networked content? According to the Manifesto, networked content, “creates a network of semantic links between documents that enable new forms of navigation and improves retrieval from a collection of documents. “ It uses text analytics techniques to extract semantic metadata from documents. This metadata can be used to link documents together across the organization, thus providing a rich source of connected content for use by an entire company. Picture 50 thousand documents linked together across a company by enriched metadata that includes people, places, things, facts, or concepts and you can start to visualize what this might look like.

Here is an excerpt of my conversation with Daniel:

FH: So, what is the value of networked content?

DM: Semantic metadata creates a richer index than was previously possible using techniques such as manual tagging. There are five benefits that semantic metadata provides. The first two benefits are that it makes content more findable and easier to explore. You can’t find what you don’t know how to query. In many cases people don’t know how they should be searching. Any advanced search engine with facets is a simple example of how you can leverage metadata to enhance information access by enabling exploration. The third benefit is that networked content can boost insight into a subject of interest by revealing context and placing it into perspective. Context is revealed by showing what else there is around your precise search – for example related documents. Perspective is typically reached through analytics. That is, attaining a high level of insight into what can be found in a large amount of documents, like articles or call center notes. The final two benefits are more future looking. The first of these benefits is something we call “proactive delivery”. Up to now, people mostly access information by using search engines to return documents associated with a certain topic. For example, I might ask, “What are all of the restaurants in Boston?” But by leveraging information about your past behavior, your location, or your profile, I can proactively send you alerts about relevant restaurants you might be interested in. This is done by some advanced portals today, and the same principle can be applied to virtually any forms of content. The last benefit is tight integration with workflow applications. Today, people are used to searching Google or other search engines which require a dedicated interface. If you are writing a report and need to go to the web to look for more information, this interferes with your workflow. But instead, it is possible to pipe content directly to your workflow so that you don’t need to interrupt your work to access it. For example, we can foresee how in the near future, when typing a report in a word processing application such as MS Word, , right in the interface, you will be able to receive bits of information related contextually to what you are typing. As a chemist, , you might receive suggestions of scientific articles based on the metadata extracted from the text you are typing. Likewise, Content management interfaces in the future will be enriched with widgets that provide related documents and analytics.

FH: How is networked content different from other kinds of advanced classification systems provided by content management vendors today?

DM: Networked Content is ultimately a vision for how content can be better managed and distributed by leveraging semantic content enrichment. This vision is underpinned by an entire technical ecosystem, of which the Content Management System is only one element. Our White Paper illustrates how text analytics engines such as the Luxid® Content Enrichment Platform are a key part of this emerging ecosystem.

Making a blanket comparison is difficult, but generally speaking, Networked Content can leverage a level of granularity and domain specificity that the classification systems you are referring to don’t generally support.

FH: Do you need a taxonomy or ontology to make this work?

DM: I’d like to make sure we use caution when we use these terms. A taxonomy or ontology can be helpful, certainly. If a customer wants to improve navigation in content and already has an enterprise taxonomy, it will undoubtedly help by providing guidance and structure. However, in most cases it is not sufficient in and of itself to perform content enrichment. To do this you need to build an actual engine that is able to process text and identify within it some characteristics that will trigger the assignment of metadata (either by extracting concepts from the text itself or by judging the text as a whole) In the news domain, for example, the standard IPTC taxonomy is used to categorize news articles into topic areas such as economy, politics, or sports, and into subcategories like economy/economic policy or economy/macroeconomics, etc… You can think of this as a file cabinet where you ultimately want to file every article. What the IPTC taxonomy does is that it tells you the structure the file cabinet should have. But it doesn’t do the filing for you. For that, you need to build the metadata extraction engine. That’s where we come in. We provide a platform that includes standard extraction engines – that we call Skill Cartridges® as well as the full development environment to customize them, extend their coverage, and develop new ones from the ground up if needed.

FH: I know that TEMIS is heavily into the publishing industry and you cite publishing examples in the Manifesto. What other use cases do you see?

DM: The Life Sciences industry (especially Pharma and Crop Science) has been an early adopter of this technology for applications such as scientific discovery, IP management, knowledge management, pharmacovigilance,. These are typical use cases for all research-intensive sectors. Another group of common use cases for this technology in the private sector is what we call Market Intelligence: understanding your competitors and complementors (Competitive Intelligence), your customers (Voice of the Customer) and/or what is being said about you (Sentiment Analysis) You can think of all of these as departmental applications in the sense that primarily serve the needs of one department: R&D, Marketing, Strategy, etc…

Furthermore, we believe there is an ongoing trend for the Enterprise to adopt Networked Content transversally, beyond departmental applications, as a basic service of its core information system. There, content enrichment can act as the glue between content management, search, and BI, and can bring productivity gains and boost insight throughout the organization. This is what has led us to deploy within EMC Documentum and Microsoft SharePoint 2010 In the future all the departmental applications will become even more ubiquitous thanks to such deployments.

FH: How does Networked Content relate to the Semantic Web?

DM: They are very much related. The Semantic Web has been primarily concerned with how information that is available on the Web should be intelligently structured to facilitate access and manipulation by machines. Networked Content is focused on corporate – or private – content and how it can be connected with other content, either private, or public.

Five vendors committed to content analytics for ECM

In 2007, Hurwitz & Associates fielded one of the first market studies on text analytics. At that time, text analytics was considered to be more of a natural extension to a business intelligence system than a content management system. However, in that study, we asked respondents who were planning to use the software, whether they were planning to deploy it in conjunction with their content management systems. It turns out that a majority of respondents (62%) intended to use text analytics software in this manner. Text analytics, of course, is the natural extension to content management and we have seen the market evolve to the point where several vendors have included text analytics as part of the their offerings to enrich content management solutions.

Over the next few months, I am going to do a deeper dive into solutions that are at the intersection of text analytics and content management; three from content management vendors EMC, IBM, and OpenText as well as solutions from text analytics vendor TEMIS and analytics vendor SAS. Each of these vendors is actively offering solutions that provide insight into content stored in enterprise content management systems. Many of the solutions described below also go beyond providing insight for content stored in enterprise content management systems to include insight over other content both internal and external to an organization. A number of solutions also integrate structured data with unstructured information.

EMC: EMC refers to its content analytics capability as Content Intelligence Services (CIS). CIS supports entity extraction as well as categorization. It enables advanced search and discovery over a range of platforms including ECM systems such as EMC’s Documentum, Microsoft SharePoint, and others.

IBM: IBM offers a number of products with text analytics capabilities. Its goal is to provide rapid and deep insight into unstructured data. The IBM Content Analytics solution provides integration into IBM ECM (FileNet) solutions such as IBM Case Manager, its big data solutions (Netezza) and integration technologies (DataStage). It also integrates securely with other ECM solutions such as SharePoint, Livelink, Documentum and others.

OpenText: OpenText acquired text analytics vendor Nstein in 2010 in order to invest in semantic technology and expand its semantic coverage. Nstein semantic services are now integrated with OpenText’s ECM suite. This includes automated content categorization and classification as well as enhanced search and navigation. The company will soon be releasing additional analytics capabilities to support content discovery. Content Analytics services can also be integrated into other ECM systems.

SAS: SAS Institute provides a number of products for unstructured information access and discovery as part of its vision for the semantically integrated enterprise. These include SAS Enterprise Content Categorization, SAS Ontology Management (both for improving document relevance) and SAS Sentiment Analysis and SAS Text Miner for knowledge discovery. The products integrate with structured information; with Microsoft SharePoint, FAST ESP, Endeca, EMC Documentum; as well as with both Teradata and Greenplum.

TEMIS: TEMIS recently released its Networked Content Manifesto, which describes its vision of a network of semantic links connecting documents to enable new forms of navigation and retrieval from a collection of documents. It uses text analytics techniques to extract semantic metadata from documents that can then link documents together. Content Management systems form one part of this linked ecosystem. TEMIS integrates into ECM systems including EMC Documentum and Centerstage, Microsoft SharePoint 2010 and MarkLogic.

A different spin on analyzing content – Infosphere Content Assessment

IBM made a number of announcements last week at IOD regarding new products/offerings to help companies analyze content.  One was Cognos Content Analytics, which enables organizations to analyze unstructured data alongside structured data.  It also looks like IBM may be releasing a “voice of the customer” type service to help companies understand what is being said about them in the “cloud” (i.e. blogs, message boards, and the like).  Stay tuned on that front, it is currently being “previewed”.

I was particularly interested in a new product called IBM Infosphere Content Assessment, because I thought it was an interesting use of text analytics technology.  The product uses content analytics (IBM’s term for text analytics) to analyze “content in the wild”.  This means that a user can take the software, run it over servers that might contain terabytes (or even petabytes) of data to understand what is being stored on servers.  Here are some of the potential use cases for this kind of product:

  • Decommission data.  Once you understand the data that is on a server, you might choose to decommission it, thereby freeing up storage space
  • Records enablement.   Infosphere Content Assessment can also be used to identify what records need to go into a records management system for a record retention program
  • E-Discovery.  Of course, this technology could also be used in litigation, investigation, and audit.  It can analyze unstructured content on servers which can help to discover information that may be used in legal matters or information that needs to meet certain audit requirements for compliance.

The reality is that the majority of companies don’t formally manage their content.  It is simply stored on file servers.  The IBM product team’s view is that companies can “acknowledge the chaos”, but use the software to understand what is there and gain control over the content.  I had not seen a product positioned quite this way before and I thought it was a good use of the content analysis software that IBM has developed.

If anyone else knows of software like this, please let me know.

Threats to the American Justice System – Can Enterprise Content Management Help?

I was at the EMC writer’s conference  this past Friday, speaking on Text Analytics and ECM.  The idea behind the conference is very cool.  EMC brings together writers and bloggers, from all over the world, to discuss topics relevant to content management.  All of the sessions were great.  We discussed Cloud, Web 2.0, Sharepoint, Text Analytics, and e-Discovery. 

 I want to focus here on the e-Discovery discussion, since e-Discovery has been showing up on my top text analytics applications list for several years.  There are a growing number of vendors looking to address this problem (although not all of them may be making use of text analytics yet) including large companies like EMC, IBM, Digital Iron Mountain, Microsoft and smaller providers such as Zylab.

 Ralph Losey gave the presentation. He is a defense lawyer, by training, but over the years has focused on e-Discovery.  Losey has written a number of books on the topic and he writes a blog called e-Discovery Team.  An interesting fellow!

 His point was that “The failure of American business to adopt ECM is destroying the American system of justice.”  Why?  His argument went something like this:

  • You can’t find the truth if you can’t find the evidence.  As the amount of digital data explodes, it is harder to find the information companies need to defend themselves.  This is because the events surrounding the case might have occurred a year or more in the past, and the data is buried in files or email.  I don’t think anyone will argue with this fact. 
  • According to Losey, most trial lawyers are luddites, implying that they don’t get technology.  Lawyers aren’t trained this way so they are not going to push for ECM systems, since they might not even know what they are.  And corporate America is putting off decisions to purchase ECM systems that could actually help organize some of the content and make it more findable.
  • Meanwhile, the cost of litigation is skyrocketing.  Since it is so expensive, many companies don’t go to court and they look to private arbitration.  Why spend $2M in e-Discovery when you can settle for $3M?  Losey pointed to one example, in the Fannie Mae securities litigation (2009), where it cost $6M (or 9% of the annual budget of the Office of Federal Housing Enterprise Oversight) to comply with ONE subpoena. This involved about 660,000 emails. 
  • According to Losey, it costs about $5 to process one computer file for e-Discovery.  This is because the file needs to be reviewed for relevance, privilege, and confidentiality. 

 Can the American justice system be saved? 

 So, can e-Discovery tools be used to help save the justice system as we know it?  Here are a few points to ponder:

  • Losey seems to believe that the e-Discovery process may be hard to automate since it requires a skilled eye to determine whether an email (or any file for that matter) is admissible in court. 
  • I’m not even sure how much corporate email is actually being stored in content management systems – even when companies have content management systems. It’s a massive amount of data.
  •  And, then there is the issue of how much email companies will want to save to begin with.  Some will store it all because they want a record.  Others seem to be moving away from email altogether.  For example, one person in the group told us that his Bank of America financial advisor can no longer communicate with him via email!  This opens up a whole different can of worms, which is not worth going into here. 
  • Then there is the issue of changing vocabularies between different departments in companies, people not using certain phrases once they get media attention, etc. etc.


Before jumping to any conclusions let’s look at what vendors can do.  According to EMC, the email overload problem can be addressed.  The first thing to do is to de-duplicate emails that could be stored in a content management system.  Think about it.  You get an email and 20 people are copied on it. Or, you forward someone an email and they don’t necessarily delete it. These emails would pile up.  De-duplicating emails would go a long way in reducing the amount of content in the ECM.  Then there is the matter of classifying these emails.  That could be done.  Some of this classification would be straight-forward.  And, the system might be able to be trained to look for those emails that might be privileged, and classify these accordingly, but this would no doubt still require human intervention, to help with the process.  Of course, terminology will change, as well and people will have to stay on top of this. 


The upshot is that there are certainly hurdles to overcome to put advanced classification and text analytics in place to help in e-Discovery.  However, as the amount of digital information keeps piling up, something has to be done.  In this case, the value certainly would seem to outweigh the cost of business as usual.

Text Analytics Summit 2009

I just got back from the Text Analytics Summit and it was a very good conference.  I’ve been attending the Summit for the last three years and it has gotten better every year.  This year, it seemed like there were a lot more end users and the conference had more of a business oriented approach than in previous years.  Don’t get me wrong- there were still technical discussions, but I liked the balance.


A major theme this year, as in previous years, was on Voice of the Customer applications.  That is to be expected, in some ways, because it is still a hot application area and most of the vendors at the conference (including Attensity, Clarabridge, Lexalytics, SAS, and SPSS) focus on it in one form or another.  This year, there was a lot of discussion about using social media for text analytics and VoC  kinds of applications.  Social media meaning blogs, twitter, and even social networks.  The issue of sentiment analysis was discussed at length since it is a hard problem.  Sarcasm, irony, the element of surprise, and dealing with sentiment at the feature level were all discussed.  I was glad to hear it, because it is very important.  SAS also made an announcement about some of its new features around sentiment analysis.  I’ll blog about that in a few days.


Although there was a heavy focus on the VoC type applications, we did hear from Ernst & Young on fraud applications.  This was interesting because it showed how human expertise, in terms of understanding certain phrases that might appear in fraud, might be used to help automate fraud detection.  Biogen Inc also presented on its use of text analytics in life sciences and biomedical research.  We also heard what Monster and Facebook are doing with text analytics, which was quite interesting.  I would have liked to hear more about what is happening with text analytics in media and publishing and e-Discovery.  It would have also been useful to hear more about how text analytics is being incorporated into a broader range of applications.  I’m seeing (and Sue Feldman, from IDC, noted this too) a large number of services springing up that use text analytics.  This spans everything from new product innovation to providing real time insight to traders.  As these services, along with the SaaS model continue to explode, it would be useful to hear more about them next year.




Other observations

Here are some other observations on topics that I found interesting.


  • Bringing people into the equation.  While text analytics is very useful technology, it needs people to make it work.  The technology itself is not Nirvana.  In fact, it can most useful when a person works together with the technology to make it zing.  While people who use the technology obviously know this (there is work that has to be done by people to make text analytics work),  I think that  people beginning the process need to be aware of this too, for many reasons.  Not only are people necessary to make the technology work, the cultural component is also critical, as it is in the adoption of any new technology.  Having said this, there was discussion on the end user panel about how companies were making use of the SaaS model (or at least services), since it wasn’t working out for IT (not quite sure why – either they didn’t have the time or didn’t have the skills).
  • Managing expectations. This came up on some of the panels and in a few talks.  There were two interesting comments worth noting.  First, Chris Jones, from Intuit said that some people believe that text analytics will tell you what to do, so expectations need to be set properly.  In other words, people need to understand that text analytics will uncover issues and even the root cause of the issues, but it is up to a company to figure out what to do with that information.   Second, there was an interesting discussion around the notion of the 85% accuracy that text analytics might provide.  The end user panel was quite lively on this topic. I was especially taken with comments from Chris Bowmann, a former school superintendent of the Lafourche Parish School Board, and how he had used text analytics to try to help keep kids in school. He used the technology to cull through disciplinary records to see what patterns were emerging.  Very interesting.  Yes, as he pointed out, text analytics may not be 100% accurate, but think of what 85% can get you!
  • Search needs to incorporate more text analytics. There were two good search talks on the agenda:  Usama Fayyad, CEO of Open Insights who spoke about text analytics and web advertising as well as how text analytics might be used to help search “get things done” (i.e. like book a trip).  The other speaker on the topic was Daniel Tunkelang from Endeca, who talked about text analytics and exploratory search. There were a number of comments from people in the audience as well about services like Wolfram Apha.
  • Content Management.  I was happy to see more about enterprise content management this year and see more people in the audience who were interested in it.  There was even a talk about it from Lou Jordano from EMC. 

 I think anyone who attended the conference would agree that text analytics has definitely hit the main stream.

2009 Text Analytics Survey

 Several weeks ago, in preparation for the Text Analytics Summit, I deployed a short survey about the state of text analytics.  I supplemented the end-user survey with vendor interviews.   Here are some of the end-user findings. 

First, a few words about the survey itself and who responded to the survey.

  • I wanted to make the survey short and sweet.  I was interested in company’s plans for text analytics and whether the economy was affecting these plans. 
  • Let me say up front that given the topic of the survey and our list, I would categorize most of the respondents as fairly analytical and technology savvy.  Approximately 50 companies responded to the survey – split evenly between those companies that were deploying the technology and those that were not (note that this is a self selecting sample and does not imply that 50% of companies are currently using text analytics).  The respondents represent a good mix across a number of verticals including computing, telecommunications, education, pharmaceuticals, financial services, government, and CPG.  There were also a few market researchers in the mix.  Likewise, there was a mix of companies of various sizes. 
  • Here’s my caveat:  I would not view the respondents as a scientific sample and I would view these results as qualitative.  That said, many of the results paralleled results from previous surveys.  So, while the results are unscientific, in terms of a random sample and size, I believe they probably do reflect what many companies are doing in this space.


Kinds of applications, implementation schemes

I asked those respondents that were deploying text analytics, what kinds of applications they were using it for.  The results were not surprising.  The top three responses:  Voice of the Customer (VoC), Competitive Intelligence, and eDiscovery, were also in the top three the last time I asked the question. Additionally, many of the respondents were deploying more than one type of application (i.e. VoC and quality analysis).  This was a pattern that also emerged in a study I did on text analytics back in 2007. Once a company gains value from one implementation, it then sees the wider value of the technology (and realizes that it has a vast amount of information that needs to be analyzed).

 I asked those companies that were planning to deploy the technology, the top applications being considered.  In this case, VoC and Competitive Intelligence were again in the top two.  Brand Management and Product R&D were tied for third.  This is not surprising.  Companies are quite concerned with customer feedback and any issues that impact customer retention.  Companies want to understand what competitors are up to and how their brand is being perceived in the market.  Likewise, they are also trying to get smarter about how they develop products and services to be more cost effective and more market focused.



How Text Analytics is being deployed

I also wanted to find out how companies were deploying the technology. In particular, we’ve heard a lot this past year about organizations utilizing text analytics in a Software as a Service (SaaS) model.  This model has become particularly attractive in the Voice of the Customer/Competitive Intelligence/Brand Management area for several reasons.  For one thing, this kind of analysis might involve some sort of external information source such as news feeds and blog postings.  Even product R&D would draw from external sources such as trade journals, news about competitive products, and patent files.  Additionally, the SaaS model generally has a different price point that enterprise solutions.

In fact, SaaS was the model of choice for implementing the technology.  The SasS model does offer the flexibility and price point that many companies are looking for, especially in some of the above-mentioned areas.  However, that is not to say that companies are not deploying text analytics in other ways (note the values on the X axis).  Interestingly, companies are starting to deploy text analytics in conjunction with their content management systems.  I think we will see more of this as the technology continues to become more mainstream. 


Just as an FYI, all of the companies that had deployed text analytics stated that the implementations either met or exceeded their expectations.  And, close to 60% stated that text analytics had actually exceeded expectations. 

 What about those companies that aren’t deploying the technology?

Equally important to understanding the market are those companies that are not deploying text analytics.  I asked those companies if they had any plans to utilize the technology.  Eleven percent stated that plans had been put on hold due to funding constraints.  Twenty-eight percent stated that they had no plans to implement the technology.  Another 28% stated that they planned to implement the technology in the next year and 33% said they planned to implement it in the next few years. 

Reasons cited for not implementing the technology included not understanding enough about text analytics to implement it.  Other companies just never considered implementing it, or had other analytic projects on the radar.

What about the economy?

There have been numerous articles written about whether certain technologies are recession proof, with various BI related technology vendors stating/hoping/praying that their technology falls in to this category.  And certainly, companies do feel the need to gain insight about operational efficiency, their customers, the market, and the competition with, perhaps a greater urgency than in the past.  This has helped keep business analytics vendors moving forward in this economy.

The 11% number is relatively small.  However, I wonder what part of the 61% that said that they would be deploying it in the future, might actually fall into the hold category.  When I asked text analytics vendors (mostly private companies) whether the economy was impacting the sales cycle, they pretty much said the same thing.  Existing customers were not dropping projects (there is too much value there, as supported by this survey).  However, sales cycles are longer (companies are not necessarily rushing) and potential clients may be looking for creative financing and contracting options. 

I am participating in an analyst panel at the Text Analytics Summit in June.  I have more to say about the topic, as I am sure, do the other analysts who will be participating.


Get every new post delivered to your Inbox.

Join 1,710 other followers