Big Data’s Future/Big Data’s Past

I just listened to an interesting IBM Google hangout about big data called Visions of Big Data’s future.  You can watch  it here.  There were some great experts on the line including James Kobelius (IBM), Thomas Deutsch (IBM), and Ed Dumbill (Silicon Valley Data Science).

The moderator, David Pittman, asked a fantastic question, “What’s taking longer than you expect in big data?”  It brought me back to 1992 (ok, I’m dating myself)  when I used to work at AT&T Bell Laboratories.  At that time, I was working in what might today be called an analytics Center of Excellence.  The group was composed of all kinds of quantitative scientists (economists, statisticians, physicists) as well as computer scientists and other IT like people. I think the group was called something like the Marketing Models, Systems, and Analysis department.

I had been working with members of Bell Labs Research to take some of the machine learning algorithms they were developing and applying them to our marketing data for analytics like churn analysis.  At that time, I proposed the formation of a group that would consist of market analysts and developers, working together with researchers and some computer scientists.  The idea was to provide continuous innovation around analysis.  I found the proposal today (I’m still sneezing from the dust).  Here is a sentence from it,

big data from 1992

Managing and analyzing large amounts of data?  At that point we were even thinking about call detail records.  It goes on to say, “Specifically the group will utilize two software technologies that will help to extract knowledge from databases:  data mining and data archeology.  The data archeology piece referred to:

Data discovery 1992

This exploration of the data is  similar to what is termed discovery today.  Here’s a link to the paper that came out of this work.   Interestingly, around this time I also remember going to talk to some people who were developing NLP algorithms for analyzing text.  I remember thinking that the “why” around customers were churning could be found in those call center notes.

I thought about this when I heard the moderator’s question not because the group I was proposing would certainly have been ahead of its time –  let’s face it AT&T was way ahead of its time with its Center of Excellence in analysis in the first place – but  because it’s taken so long to get from there to here and we’re not even here or there  yet.

Are you ready for IBM Watson?

This week marks the one year anniversary of the IBM Watson computer system succeeding at Jeopardy!. Since then, IBM has gotten a lot of interest in Watson.  Companies want one of those.

But what exactly is Watson and what makes it unique?  What does it mean to have a Watson?  And, how is commercial Watson different from Jeopardy Watson?

What is Watson and why is it unique?

Watson is a new class of analytic solution

Watson is a set of technologies that processes and analyzes massive amounts of both structured and unstructured data in a unique way.   One statistic given at the recent IOD conference is that Watson can process and analyze information from 200 million books in three seconds.  While Watson is very advanced it uses technologies that are commercially available with some “secret sauce” technologies that IBM Research has either enhanced or developed.  It combines software technologies from big data, content and predictive analytics, and industry specific software to make it work.

Watson includes several core pieces of technology that make it unique

So what is this secret sauce?  Watson understands natural language, generates and evaluates hypotheses, and adapts and learns.

First, Watson uses Natural Language Processing (NLP). NLP is a very broad and complex field, which has developed over the last ten to twenty years. The goals of NLP are to derive meaning from text. NLP generally makes use of linguistic concepts such as grammatical structures and parts of speech.  It breaks apart sentences and extracts information such as entities, concepts, and relationships.  IBM is using a set of annotators to extract information like symptoms, age, location, and so on.

So, NLP by itself is not new, however, Watson is processing vast amounts of this unstructured data quickly, using an architecture designed for this.

Second, Watson works by generating hypotheses which are potential answers to a question.  It is trained by feeding question and answer (Q/A) data into the system. In other words, it is shown representative questions and learns from the supplied answers.  This is called evidence based learning.  The goal is to generate a model that can produce a confidence score (think logistic regression with a bunch of attributes).  Watson would start with a generic statistical model and then look at the first Q/A and use that to tweak coefficients. As it gains more evidence it continues to tweak the coefficients until it can “say” confidence is high.  Training Watson is key since what is really happening is that the trainers are building statistical models that are scored.  At the end of the training, Watson has a system that has feature vectors and models so that eventually it can use the model to probabilistically score the answers.   The key here is something that Jeopardy! did not showcase – which is that it is not deterministic (i.e. using rules).  Watson is probabilistic and that makes it dynamic.

When Watson generates a hypothesis it then scores the hypothesis based on the evidence.   Its goal is to get the right answer for the right reason.  (So, theoretically, if there are 5 symptoms that must be positive for a certain disease and 4 that must be negative and Watson only has 4 of the 9 pieces of information, it could ask for more.) The hypothesis with the highest score is presented.   By the end the analysis, Watson is confident when it knows the answer and when it doesn’t know the answer.

Here’s an example.  Suppose you go in to see your doctor because you are not feeling well.  Specifically, you might have heart palpitations, fatigue, hair loss, and muscle weakness.  You decide to go see a doctor to determine if there is something wrong with your thyroid or if it is something else.  If your doctor has access to a Watson system then he could use it to help advise him regarding your diagnosis.  In this case, Watson would already have ingested and curated all of the information in books and journals associated with thyroid disease.  It also has the diagnosis and related information from other patients from this hospital and other doctors in the practice from the electronic medical records of prior cases that it has in its data banks.  Based on the first set of symptoms you might report it would generate a hypothesis along with probabilities associated with the hypothesis (i.e. 60% hyperthyroidism, 40% anxiety, etc.).  It might then ask for more information.  As it is fed this information, i.e. example patient history, Watson would continue to refine its hypothesis along with the probability of the hypothesis being correct.  After it is given all of the information and it iterates through it and presents the diagnosis with the highest confidence level, the physician would use this information to help assist him in making the diagnosis and developing a treatment plan.  If Watson doesn’t know the answer, it will state that it has does not have an answer or doesn’t have enough information to provide an answer.

IBM likens the process of training a Watson to teaching a child how to learn.  A child can read a book to learn.  However, he can also learn by a teacher asking questions and reinforcing the answers about that text.

Can I buy a Watson?

Watson will be offered in the cloud in an “as a service” model.  Since Watson is in its own class, let’s call this Watson as a Service (WaaS).  Since Watson’s knowledge is essentially built in tiers, the idea is that IBM will provide the basic core knowledge in a particular WaaS solution space, say all of the corpus about a particular subject – like diabetes – and then different users could build on this.

For example, in September IBM announced an agreement to create the first commercial applications of Watson with WellPoint – a health benefits company. Under the agreement, WellPoint will develop and launch Watson-based solutions to help improve patient care. IBM will develop the base Watson healthcare technology on which WellPoint’s solution will run.  Last month, Cedars-Sinai signed on with WellPoint to help develop an oncology solution using Watson.  Cedars-Sinai’s oncology experts will help develop recommendations on appropriate clinical content for the WellPoint health care solutions. They will assist in the evaluation and testing of these tools.  In fact, these oncologists will “enter hypothetical patient scenarios, evaluate the proposed treatment options generated by IBM Watson, and provide guidance on how to improve the content and utility of the treatment options provided to the physicians.”  Wow.

Moving forward, picture potentially large numbers of core knowledge bases that are trained and available for particular companies to build upon.  This would be available in a public cloud model and potentially a private one as well, but with IBM involvement.  This might include Watsons for law or financial planning or even politics (just kidding) – any area where there is a huge corpus of information that people need to wrap their arms around in order to make better decisions.

IBM is now working with its partners to figure out what the user interface for these Watsons- as a Service might look like.  Will Watson ask the questions?  Can end-users, say doctors, put in their own information and Watson will use it?  This remains to be seen.

Ready for Watson?

In the meantime, IBM recently rolled out its “Ready for Watson.”  The idea is that a move to Watson might not be a linear progression.  It depends on the business  problem that companies are looking to solve.  So IBM has tagged certain of its products as “ready” to be incorporated into a Watson solution.  IBM Content and Predictive Analytics for Healthcare is one example of this.  It combines IBM’s content analytics and predictive analytics solutions that are components of Watson.  Therefore, if a company used this solution it could migrate it to a Watson-as a Service deployment down the road.

So happy anniversary IBM Watson!  You have many people excited and some people a little bit scared.  For myself, I am excited to see where Watson is on its first anniversary and am looking forward to see what progress it has made on its second anniversary.

Four Vendor Views on Big Data and Big Data Analytics: IBM

Next in my discussion of big data providers is IBM.   Big data plays right into IBM’s portfolio of solutions in the information management space.  It also dove tails very nicely with the company’s Smarter Planet strategy.  Smarter Planet holds the vision of the world as a more interconnected, instrumented, and intelligent place.  IBM’s Smarter Cities and Smarter Industries are all part of its solutions portfolio.  For companies to be successful in this type of environment requires a new emphasis on big data and big data analytics.

Here’s a quick look at how IBM is positioning around big data, some of its product offerings, and use cases for big data analytics.

IBM

According to IBM, big data has three characteristics.  These are volume, velocity, and variety.   IBM is talking about large volumes of both structured and unstructured data.  This can include audio and video together with text and traditional structured data.  It can be gathered and analyzed in real time.

IBM has both hardware and software products to support both big data and big data analytics.  These products include:

  • Infosphere Streams – a platform that can be used to perform deep analysis of massive volumes of relational and non-relational data types with sub-millisecond response times.   Cognos Real-time Monitoring can also be used with Infosphere Streams for dashboarding capabilities.
  • Infosphere BigInsights – a product that consists of IBM research technologies on top of open source Apache Hadoop.  BigInsights provides core installation, development tools, web-based UIs, connectors for integration, integrated text analytics, and BigSheets for end-user visualization.
  • IBM Netezza – a high capacity appliance that allows companies to analyze pedabytes of data in minutes.
  • Cognos Consumer Insights- Leverages BigInsights and text analytics capabilities to perform social media sentiment analysis.
  • IBM SPSS- IBM’s predictive and advanced analytics platform that can read data from various data sources such as Netezza and be integrated with Infosphere Streams to perform advanced analysis.
  • IBM Content Analytics – uses text analytics to analyze unstructured data.  This can sit on top of Infosphere BigInsights.

At the Information on Demand (IOD) conference a few months ago, IBM and its customers presented many use cases around big data and big data analytics. Here is what some of the early adopters are doing:

  • Engineering:  Analyzing hourly wind data, radiation, heat and 78 other attributes to determine where to locate the next wind power plant.
  • Business:
    • Analyzing social media data, for example to understand what fans are saying about a sports game in real time.
    • Analyzing customer activity at a zoo to understand guest spending habits, likes and dislikes.
  • Analyzing healthcare data:
    • Analyzing streams of data from medical devices in neonatal units.
    •  Healthcare Predictive Analytics.  One hospital is using a product called Content and Predictive analytics to understand limit early hospital discharges which would result in re-admittance to the hospital

IBM is working with its clients and prospects to implement big data initiatives.  These initiatives generally involve a services component given the range of product offerings IBM has in the space and the newness of the market.  IBM is making significant investments in tools, integrated analytic accelerators, and solution accelerators to reduce deployment time and cost to deploy these kinds of solutions.

At IBM, big data is about the “the art of the possible.”   According to the company, price points on products that may have been too expensive five years ago are coming down.  IBM is a good example of a vendor that is both working with customers to push the envelope in terms of what is possible with big data and, at the same time, educating the market about big data.   The company believes that big data can change the way companies do business.  It’s still early in the game, but IBM has a well-articulated vision around big data.  And, the solutions its clients discussed were big, bold, and very exciting.  The company is certainly a leader in this space.

Five vendors committed to content analytics for ECM

In 2007, Hurwitz & Associates fielded one of the first market studies on text analytics. At that time, text analytics was considered to be more of a natural extension to a business intelligence system than a content management system. However, in that study, we asked respondents who were planning to use the software, whether they were planning to deploy it in conjunction with their content management systems. It turns out that a majority of respondents (62%) intended to use text analytics software in this manner. Text analytics, of course, is the natural extension to content management and we have seen the market evolve to the point where several vendors have included text analytics as part of the their offerings to enrich content management solutions.

Over the next few months, I am going to do a deeper dive into solutions that are at the intersection of text analytics and content management; three from content management vendors EMC, IBM, and OpenText as well as solutions from text analytics vendor TEMIS and analytics vendor SAS. Each of these vendors is actively offering solutions that provide insight into content stored in enterprise content management systems. Many of the solutions described below also go beyond providing insight for content stored in enterprise content management systems to include insight over other content both internal and external to an organization. A number of solutions also integrate structured data with unstructured information.

EMC: EMC refers to its content analytics capability as Content Intelligence Services (CIS). CIS supports entity extraction as well as categorization. It enables advanced search and discovery over a range of platforms including ECM systems such as EMC’s Documentum, Microsoft SharePoint, and others.

IBM: IBM offers a number of products with text analytics capabilities. Its goal is to provide rapid and deep insight into unstructured data. The IBM Content Analytics solution provides integration into IBM ECM (FileNet) solutions such as IBM Case Manager, its big data solutions (Netezza) and integration technologies (DataStage). It also integrates securely with other ECM solutions such as SharePoint, Livelink, Documentum and others.

OpenText: OpenText acquired text analytics vendor Nstein in 2010 in order to invest in semantic technology and expand its semantic coverage. Nstein semantic services are now integrated with OpenText’s ECM suite. This includes automated content categorization and classification as well as enhanced search and navigation. The company will soon be releasing additional analytics capabilities to support content discovery. Content Analytics services can also be integrated into other ECM systems.

SAS: SAS Institute provides a number of products for unstructured information access and discovery as part of its vision for the semantically integrated enterprise. These include SAS Enterprise Content Categorization, SAS Ontology Management (both for improving document relevance) and SAS Sentiment Analysis and SAS Text Miner for knowledge discovery. The products integrate with structured information; with Microsoft SharePoint, FAST ESP, Endeca, EMC Documentum; as well as with both Teradata and Greenplum.

TEMIS: TEMIS recently released its Networked Content Manifesto, which describes its vision of a network of semantic links connecting documents to enable new forms of navigation and retrieval from a collection of documents. It uses text analytics techniques to extract semantic metadata from documents that can then link documents together. Content Management systems form one part of this linked ecosystem. TEMIS integrates into ECM systems including EMC Documentum and Centerstage, Microsoft SharePoint 2010 and MarkLogic.

Advanced Analytics and the skills needed to make it happen: Takeaways from IBM IOD

Advanced Analytics was a big topic at the IBM IOD conference last week. As part of this, predictive analytics was again an important piece of the story along with other advanced analytics capabilities IBM has developed or is in the process of developing to support optimization. These include Big Insights (for big data), analyzing data streams, content/text analytics, and of course, the latest release of Cognos.

One especially interesting topic that was discussed at the conference was the skills required to make advanced analytics a reality. I have been writing and thinking a lot this subject so I was very happy to hear IBM address it head on during the second day keynote. This keynote included a customer panel and another speaker, Dr. Atul Gawande, and both offered some excellent insights. The panel included Scott Friesen (Best Buy), Scott Futren (Guinnett County Public Schools), Srinivas Koushik (Nationwide), and Greg Christopher (Nestle). Here are some of the interrelated nuggets from the discussions:

• Ability to deliver vs. the ability to absorb. One panelist made the point that a lot of new insights are being delivered to organizations. In the future, it may become difficult for people to absorb all of this information (and this will require new skills too).
• Analysis and interpretation. People will need to know how to analyze and how to interpret the results of an analysis. As Dr. Gawande pointed out, “Having knowledge is not the same as using knowledge effectively.”
• The right information. One of the panelists mentioned that putting analytics tools in the hands of line people might be too much for them, and instead the company is focusing on giving these employees the right information.
• Leaders need to have capabilities too. If executives are accustomed to using spreadsheets and relying on their gut instincts, then they will also need to learn how to make use of analytics.
• Cultural changes. From call center agents using the results of predictive models to workers on the line seeing reports to business analysts using more sophisticated models, change is coming. This change means people will be changing the way that they work. How this change is handled will require special thought by organizations.

IBM executives also made a point of discussing the critical skills required for analytics. These included strategy development, developing user interfaces, enterprise integration, modeling, and dealing with structured and unstructured data. IBM has, of course, made a huge investment in these skills. GBS executives emphasized the 8,500 employees in its Global Business Services Business Analytics and Optimization group. Executives also pointed to the fact that the company has thousands of partners in this space and that 1 in 3 IBMers will attend analytics training. So, IBM is prepared to help companies in their journey into business analytics.

Are companies there yet? I think that it is going to take organizations time to develop some of these skills (and some they should probably outsource). Sure, analytics has been around a long time. And sure, vendors are making their products easier to use and that is going to help end users become more effective. Even if we’re just talking about a lot of business people making use of analytic software (as opposed to operationalizing it in a business process), the reality is that analytics requires a certain mindset. Additionally, unless someone understands the context of the information he or she is dealing with, it doesn’t matter how user friendly the platform is – they can still get it wrong. People using analytics will need to think critically about data, understand their data, and understand context. They will also need to know what questions to ask.

I whole-heartedly believe it is worth the investment of time and energy to make analytics happen.

Please note:

As luck would have it, I am currently fielding a study on advanced analytics! In am interested in understanding what your company’s plans are for advanced analytics. If you’re not planning to use advanced analytics, I’d like to know why. If you’re already using advanced analytics I’d like to understand your experience.

If you participate in this survey I would be happy to send you a report of our findings. Simply provide your email address at the end of the survey! Here’s the link:

Click here to take survey

What about Analytics in Social Media monitoring?

I was speaking to a client the other day.  This company was very excited about tracking its brand using one of the many listening posts out on the market.  As I sat listening to him, I couldn’t help but think that a) it was nice that his company could get its feet wet in social media monitoring using a tool like this and b) that they might be getting a false sense of security because the reality is that these social media tracking tools provide a fairly rudimentary analysis about brand/product mentions, sentiment, and influencers.  For those of you not familiar with listening posts here’s a quick primer.

Listening Post Primer

Listening posts monitor the “chatter” that is occurring on the Internet in blogs, message boards, tweets, etc.  They basically:

  • Aggregate content from across many,  many Internet sources.
  • Track the number of mentions of a topic (brand or some other term) over time and source of mention.
  • Provide users with positive or negative sentiment associated with topic (often you can’t change this, if it is incorrect).
  • Provide some sort of Influencer information.
  • Possibly provide a word cloud that lets you know what other words are associated with your topic.
  • Provide you with the ability to look at the content associated with your topic.

They typically charge by the topic.  Since these listening posts mostly use a search paradigm (with ways to aggregate words into a search topic) they don’t really allow  you to “discover” any information or insight that you may not have been aware of unless you happen to stumble across it while reading posts or put a lot of time into manually mining this information.  Some services allow the user to draw on historical data.  There are more than 100 listening posts on the market.

I certainly don’t want to minimize what these providers are offering.  Organizations that are just starting out analyzing social media will certainly derive huge benefit from these services.  Many are also quite easy to use and the price point is reasonable. My point is that there is more that can be done to derive more useful insight from social media.  More advanced systems typically make use of text analytics software.   Text analytics utilizes techniques that originate in computational linguistics, statistics, and other computer science disciplines to actually analyze the unstructured text.

Adding Text Analytics to the Mix

Although still in the early phases, social media monitoring is moving to social media analysis and understanding as text analytics vendors apply their technology to this problem.  The space is heating up as evidenced by these three recent announcements:

  • Attensity buys Biz 360. The other week, Attensity announced its intention to purchase Biz360, a leading listening post. In April, 2009, Attensity combined with two European companies that focus on semantic business applications to form Attensity Group (was formerly Attensity Corporation).  Attensity has sophisticated technology which makes use of “exhaustive extraction” techniques (as well as nine other techniques) to analyze unstructured data. Its flagship technology automatically extracts facts from parsed text (who did what to whom, when, where, under what conditions) and organizes this information.  With the addition of Biz360 and its earlier acquisitions, the Biz360 listening post will feed all Attensity products.  Additionally, the  Biz360 SaaS platform will be expanded to include deeper semantic capabilities for analysis, sentiment, response and knowledge management utilizing Attensity IP.  This service will be called Attensity 360.  The service will provide listening and deep analysis capabilities.  On top of this, extracted knowledge will be automatically routed to the group in the enterprise that needs the information.  For example, legal insights  about people, places, events, topics, and sentiment will be automatically routed to legal, customer service insights to customer service, and so on. These groups can then act on the information.  Attensity refers to this as the “open enterprise.” The idea is an end-to-end listen-analyze-respond-act process for enterprises to act on the insight they can get from the solution.
  • SAS announces its social media analytics software. SAS purchased text analytics vendor Teragram last year.  In April, SAS announced SAS® Social Media Analytics which, “Analyzes online conversations to drive insight, improve customer interaction, and drive performance.”  The product provides deep unstructured data analysis capabilities around both internal and external sources of information (it has partnerships with external content aggregators, if needed) for brand, media, PR, and customer related information.  SAS has then coupled with this the ability to perform advanced analytics such as predictive forecasting and correlation on this unstructured data.  For example, the SAS product enables companies to forecast number of mentions, given a history of mentions, or to understand whether sentiment during a certain time period was more negative, say than a previous time period.  It also enables users to analyze sentiment at a granular level and to change sentiment (and learn from this), if it is not correct.  It can deal with sentiment in 13 languages and supports 30 languages.
  • Newer social media analysis services such as NetBase are announced. NetBase is currently in limited release of its first consumer insight discovery product called ConsumerBase.  It has eight  patents pending around its deep parsing  and semantic modeling technology.  It combines deep analytics with a content aggregation service and a reporting capability.  The product provides analysis around likes/dislikes, emotions, reasons why, and behaviors.  For example, whereas a listening post might interpret the sentence, “Listerine kills germs because it hurts” as either a negative or neutral statement, the NetBase technology uses a semantic data model to understand not only that this is a positive statement, but also the reason it is positive.

Each of these products and services are slightly different.  For example, Attensity’s approach is to listen, analyze, relate (it to the business), and act (route, respond, reuse) which it calls its LARA methodology.   The SAS solution is part of its broader three Is strategy: Insight- Interaction- Improve.  NetBase is looking to provide an end to end service that helps companies to understand the reason around emotions, behaviors, likes and dislikes.   And, these are not the only game in town. Other social media analysis services announced in the last year (or earlier) include those from other text analytics vendors such as IBM, Clarabridge, and Lexalytics. And, to be fair, some of the listening posts are beginning to put this capability into their services.

This market is still in its early adoption phase, as companies try to put plans together around social media, including utilizing it for their own marketing purposes as well as analyzing it for reasons including and beyond marketing. It will be extremely important for users to determine what their needs and price points are and plan accordingly.

A different spin on analyzing content – Infosphere Content Assessment

IBM made a number of announcements last week at IOD regarding new products/offerings to help companies analyze content.  One was Cognos Content Analytics, which enables organizations to analyze unstructured data alongside structured data.  It also looks like IBM may be releasing a “voice of the customer” type service to help companies understand what is being said about them in the “cloud” (i.e. blogs, message boards, and the like).  Stay tuned on that front, it is currently being “previewed”.

I was particularly interested in a new product called IBM Infosphere Content Assessment, because I thought it was an interesting use of text analytics technology.  The product uses content analytics (IBM’s term for text analytics) to analyze “content in the wild”.  This means that a user can take the software, run it over servers that might contain terabytes (or even petabytes) of data to understand what is being stored on servers.  Here are some of the potential use cases for this kind of product:

  • Decommission data.  Once you understand the data that is on a server, you might choose to decommission it, thereby freeing up storage space
  • Records enablement.   Infosphere Content Assessment can also be used to identify what records need to go into a records management system for a record retention program
  • E-Discovery.  Of course, this technology could also be used in litigation, investigation, and audit.  It can analyze unstructured content on servers which can help to discover information that may be used in legal matters or information that needs to meet certain audit requirements for compliance.

The reality is that the majority of companies don’t formally manage their content.  It is simply stored on file servers.  The IBM product team’s view is that companies can “acknowledge the chaos”, but use the software to understand what is there and gain control over the content.  I had not seen a product positioned quite this way before and I thought it was a good use of the content analysis software that IBM has developed.

If anyone else knows of software like this, please let me know.

Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers