Two Weeks and Counting to Big Data for Dummies

I am excited to announce I’m a co-author of Big Data for Dummies which will be released in mid-April 2013.  Here’s the synopsis from Wiley:

Find the right big data solution for your business or organization

Big data management is one of the major challenges facing business, industry, and not-for-profit organizations. Data sets such as customer transactions for a mega-retailer, weather patterns monitored by meteorologists, or social network activity can quickly outpace the capacity of traditional data management tools. If you need to develop or manage big data solutions, you’ll appreciate how these four experts define, explain, and guide you through this new and often confusing concept. You’ll learn what it is, why it matters, and how to choose and implement solutions that work.

  • Effectively managing big data is an issue of growing importance to businesses, not-for-profit organizations, government, and IT professionals
  • Authors are experts in information management, big data, and a variety of solutions
  • Explains big data in detail and discusses how to select and implement a solution, security concerns to consider, data storage and presentation issues, analytics, and much more
  • Provides essential information in a no-nonsense, easy-to-understand style that is empowering

 

Big Data For Dummies cuts through the confusion and helps you take charge of big data solutions for your organization.

Five Challenges for Text Analytics

While text analytics is considered a “must have” technology by the majority of companies that use it, challenges abound.  So I’ve learned from the many companies I’ve talked to as I prepare Hurwitz & Associates’ Victory Index for Text Analytics,a tool that assesses not just the technical capability of the technology but its ability to provide tangible value to the business (look for the results of the Victory Index in about a month). Here are the top five: http://bit.ly/Tuk8DB.  Interestingly, most of them have nothing to do with the technology itself.

Are you ready for IBM Watson?

This week marks the one year anniversary of the IBM Watson computer system succeeding at Jeopardy!. Since then, IBM has gotten a lot of interest in Watson.  Companies want one of those.

But what exactly is Watson and what makes it unique?  What does it mean to have a Watson?  And, how is commercial Watson different from Jeopardy Watson?

What is Watson and why is it unique?

Watson is a new class of analytic solution

Watson is a set of technologies that processes and analyzes massive amounts of both structured and unstructured data in a unique way.   One statistic given at the recent IOD conference is that Watson can process and analyze information from 200 million books in three seconds.  While Watson is very advanced it uses technologies that are commercially available with some “secret sauce” technologies that IBM Research has either enhanced or developed.  It combines software technologies from big data, content and predictive analytics, and industry specific software to make it work.

Watson includes several core pieces of technology that make it unique

So what is this secret sauce?  Watson understands natural language, generates and evaluates hypotheses, and adapts and learns.

First, Watson uses Natural Language Processing (NLP). NLP is a very broad and complex field, which has developed over the last ten to twenty years. The goals of NLP are to derive meaning from text. NLP generally makes use of linguistic concepts such as grammatical structures and parts of speech.  It breaks apart sentences and extracts information such as entities, concepts, and relationships.  IBM is using a set of annotators to extract information like symptoms, age, location, and so on.

So, NLP by itself is not new, however, Watson is processing vast amounts of this unstructured data quickly, using an architecture designed for this.

Second, Watson works by generating hypotheses which are potential answers to a question.  It is trained by feeding question and answer (Q/A) data into the system. In other words, it is shown representative questions and learns from the supplied answers.  This is called evidence based learning.  The goal is to generate a model that can produce a confidence score (think logistic regression with a bunch of attributes).  Watson would start with a generic statistical model and then look at the first Q/A and use that to tweak coefficients. As it gains more evidence it continues to tweak the coefficients until it can “say” confidence is high.  Training Watson is key since what is really happening is that the trainers are building statistical models that are scored.  At the end of the training, Watson has a system that has feature vectors and models so that eventually it can use the model to probabilistically score the answers.   The key here is something that Jeopardy! did not showcase – which is that it is not deterministic (i.e. using rules).  Watson is probabilistic and that makes it dynamic.

When Watson generates a hypothesis it then scores the hypothesis based on the evidence.   Its goal is to get the right answer for the right reason.  (So, theoretically, if there are 5 symptoms that must be positive for a certain disease and 4 that must be negative and Watson only has 4 of the 9 pieces of information, it could ask for more.) The hypothesis with the highest score is presented.   By the end the analysis, Watson is confident when it knows the answer and when it doesn’t know the answer.

Here’s an example.  Suppose you go in to see your doctor because you are not feeling well.  Specifically, you might have heart palpitations, fatigue, hair loss, and muscle weakness.  You decide to go see a doctor to determine if there is something wrong with your thyroid or if it is something else.  If your doctor has access to a Watson system then he could use it to help advise him regarding your diagnosis.  In this case, Watson would already have ingested and curated all of the information in books and journals associated with thyroid disease.  It also has the diagnosis and related information from other patients from this hospital and other doctors in the practice from the electronic medical records of prior cases that it has in its data banks.  Based on the first set of symptoms you might report it would generate a hypothesis along with probabilities associated with the hypothesis (i.e. 60% hyperthyroidism, 40% anxiety, etc.).  It might then ask for more information.  As it is fed this information, i.e. example patient history, Watson would continue to refine its hypothesis along with the probability of the hypothesis being correct.  After it is given all of the information and it iterates through it and presents the diagnosis with the highest confidence level, the physician would use this information to help assist him in making the diagnosis and developing a treatment plan.  If Watson doesn’t know the answer, it will state that it has does not have an answer or doesn’t have enough information to provide an answer.

IBM likens the process of training a Watson to teaching a child how to learn.  A child can read a book to learn.  However, he can also learn by a teacher asking questions and reinforcing the answers about that text.

Can I buy a Watson?

Watson will be offered in the cloud in an “as a service” model.  Since Watson is in its own class, let’s call this Watson as a Service (WaaS).  Since Watson’s knowledge is essentially built in tiers, the idea is that IBM will provide the basic core knowledge in a particular WaaS solution space, say all of the corpus about a particular subject – like diabetes – and then different users could build on this.

For example, in September IBM announced an agreement to create the first commercial applications of Watson with WellPoint – a health benefits company. Under the agreement, WellPoint will develop and launch Watson-based solutions to help improve patient care. IBM will develop the base Watson healthcare technology on which WellPoint’s solution will run.  Last month, Cedars-Sinai signed on with WellPoint to help develop an oncology solution using Watson.  Cedars-Sinai’s oncology experts will help develop recommendations on appropriate clinical content for the WellPoint health care solutions. They will assist in the evaluation and testing of these tools.  In fact, these oncologists will “enter hypothetical patient scenarios, evaluate the proposed treatment options generated by IBM Watson, and provide guidance on how to improve the content and utility of the treatment options provided to the physicians.”  Wow.

Moving forward, picture potentially large numbers of core knowledge bases that are trained and available for particular companies to build upon.  This would be available in a public cloud model and potentially a private one as well, but with IBM involvement.  This might include Watsons for law or financial planning or even politics (just kidding) – any area where there is a huge corpus of information that people need to wrap their arms around in order to make better decisions.

IBM is now working with its partners to figure out what the user interface for these Watsons- as a Service might look like.  Will Watson ask the questions?  Can end-users, say doctors, put in their own information and Watson will use it?  This remains to be seen.

Ready for Watson?

In the meantime, IBM recently rolled out its “Ready for Watson.”  The idea is that a move to Watson might not be a linear progression.  It depends on the business  problem that companies are looking to solve.  So IBM has tagged certain of its products as “ready” to be incorporated into a Watson solution.  IBM Content and Predictive Analytics for Healthcare is one example of this.  It combines IBM’s content analytics and predictive analytics solutions that are components of Watson.  Therefore, if a company used this solution it could migrate it to a Watson-as a Service deployment down the road.

So happy anniversary IBM Watson!  You have many people excited and some people a little bit scared.  For myself, I am excited to see where Watson is on its first anniversary and am looking forward to see what progress it has made on its second anniversary.

Four Vendor Views on Big Data and Big Data Analytics: IBM

Next in my discussion of big data providers is IBM.   Big data plays right into IBM’s portfolio of solutions in the information management space.  It also dove tails very nicely with the company’s Smarter Planet strategy.  Smarter Planet holds the vision of the world as a more interconnected, instrumented, and intelligent place.  IBM’s Smarter Cities and Smarter Industries are all part of its solutions portfolio.  For companies to be successful in this type of environment requires a new emphasis on big data and big data analytics.

Here’s a quick look at how IBM is positioning around big data, some of its product offerings, and use cases for big data analytics.

IBM

According to IBM, big data has three characteristics.  These are volume, velocity, and variety.   IBM is talking about large volumes of both structured and unstructured data.  This can include audio and video together with text and traditional structured data.  It can be gathered and analyzed in real time.

IBM has both hardware and software products to support both big data and big data analytics.  These products include:

  • Infosphere Streams – a platform that can be used to perform deep analysis of massive volumes of relational and non-relational data types with sub-millisecond response times.   Cognos Real-time Monitoring can also be used with Infosphere Streams for dashboarding capabilities.
  • Infosphere BigInsights – a product that consists of IBM research technologies on top of open source Apache Hadoop.  BigInsights provides core installation, development tools, web-based UIs, connectors for integration, integrated text analytics, and BigSheets for end-user visualization.
  • IBM Netezza – a high capacity appliance that allows companies to analyze pedabytes of data in minutes.
  • Cognos Consumer Insights- Leverages BigInsights and text analytics capabilities to perform social media sentiment analysis.
  • IBM SPSS- IBM’s predictive and advanced analytics platform that can read data from various data sources such as Netezza and be integrated with Infosphere Streams to perform advanced analysis.
  • IBM Content Analytics – uses text analytics to analyze unstructured data.  This can sit on top of Infosphere BigInsights.

At the Information on Demand (IOD) conference a few months ago, IBM and its customers presented many use cases around big data and big data analytics. Here is what some of the early adopters are doing:

  • Engineering:  Analyzing hourly wind data, radiation, heat and 78 other attributes to determine where to locate the next wind power plant.
  • Business:
    • Analyzing social media data, for example to understand what fans are saying about a sports game in real time.
    • Analyzing customer activity at a zoo to understand guest spending habits, likes and dislikes.
  • Analyzing healthcare data:
    • Analyzing streams of data from medical devices in neonatal units.
    •  Healthcare Predictive Analytics.  One hospital is using a product called Content and Predictive analytics to understand limit early hospital discharges which would result in re-admittance to the hospital

IBM is working with its clients and prospects to implement big data initiatives.  These initiatives generally involve a services component given the range of product offerings IBM has in the space and the newness of the market.  IBM is making significant investments in tools, integrated analytic accelerators, and solution accelerators to reduce deployment time and cost to deploy these kinds of solutions.

At IBM, big data is about the “the art of the possible.”   According to the company, price points on products that may have been too expensive five years ago are coming down.  IBM is a good example of a vendor that is both working with customers to push the envelope in terms of what is possible with big data and, at the same time, educating the market about big data.   The company believes that big data can change the way companies do business.  It’s still early in the game, but IBM has a well-articulated vision around big data.  And, the solutions its clients discussed were big, bold, and very exciting.  The company is certainly a leader in this space.

Four Vendor Views on Big Data and Big Data Analytics Part 1: Attensity

I am often asked whether it is the vendors or the end users who are driving the Big Data market. I usually reply that both are. There are early adopters of any technology that push the vendors to evolve their own products and services. The vendors then show other companies what can be done with this new and improved technology.

Big Data and Big Data Analytics are hot topics right now. Different vendors of course, come at it from their own point of view. Here’s a look at how four vendors (Attensity, IBM, SAS, and SAP) are positioning around this space, some of their product offerings, and use cases for Big Data Analytics.

In Attensity’s world Big Data is all about high volume customer conversations. Attensity text analytics solutions can be used to analyze both internal and external data sources to better understand the customer experience. For example, it can analyze sources such as call center notes, emails, survey verbatim and other documents to understand customer behavior. With its recent acquisition of Biz360 the company can combine social media from 75 million sources and analyze this content to understand the customer experience. Since industry estimates put the structured/unstructured data ratio at 20%/80%, this kind of data needs to be addressed. While vendors with Big Data appliances have talked about integrating and analyzing unstructured data as part of the Big Data equation, most of what has been done to date has dealt primarily with structured data. This is changing, but it is good to see a text analytics vendor address this issue head on.

Attensity already has a partnership with Teradata so it can marry information extracted from its unstructured data (from internal conversations) together with structured data stored in the Teradata Warehouse. Recently, Attensity extended this partnership to Aster data, which was acquired by Teradata. Aster Data provides a platform for Big Data Analytics. The Aster MapReduce Platform is a massively parallel software solution that embeds MapReduce analytic processing with data stores for big data analytics on what the company terms “multistructured data sources and types.” Attensity can now be embedded as a runtime SQL in the Aster Data library to enable the real time analysis of social media streams. Aster Data will also act as long term archival and analytics platform for the Attensity real-time Command Center platform for social media feeds and iterative exploratory analytics. By mid 2012 the plan is for complete integration to the Attensity Analyze application.

Attensity describes several use cases for the real time analysis of social streams:

1. Voice of the Customer Command Center: the ability to semantically annotate real-time social data streams and combine that with multi-channel customer conversation data in a Command Center view that gives companies a real-time view of what customers are saying about their company, products and brands.
2. Hotspotting: the ability to analyze customer conversations to identify emerging trends. Unlike common keyword based approaches, Hotspot reports identify issues that a company might not already know about, as they emerge, by measuring the “significance” of change in probability for a data value between a historical period and the current period. Attensity then assigns a “temperature” value to mark the degree of difference between the two probabilities. Hot means significantly trending upward in the current period vs. historical. Cold means significantly trending downward in the current period vs. historical.
3. Customer service: the ability to analyze conversations to identify top complaints and issues and prioritize incoming calls, emails or social requests accordingly.

Next Up:SAS

Five basic questions to ask before leaping into low cost social media monitoring

I just finished testing two low cost (<$50.00/mo) social media monitoring tools. They were both easy to use with clean user interfaces. Both had some nice features, especially around reaching back out to people making comments in social media. However, running these two services side by side brought home some issues with these kinds of offerings – specifically in the area of coverage, sentiment analysis, and analytics. Note that I am not naming names because I believe the issues I ran into are not unique to these specific tools, but to the low cost social media monitoring market, in general. Some of these issues will also apply to higher priced offerings!

What I did:

I ran an analysis using the term “advanced analytics” including the term itself as well as companies (as additional topics) in the space. I made sure to be as specific and clear as I could be, since I knew topics were keyword based. I let the services run side by side for several weeks, interacting with the tools on a regular basis.

What I noticed:

1. Topic specification. Tools will vary in how the end user can input what he or she is looking for. Some will let you try to refine your keyword search (and these are keyword based, they won’t let you discover topics per se), other won’t. Some will only allow search across all languages, others will allow the user to specify the language. Find out how you can be sure that you are getting what you are searching for. For example, does the tool allow you be very specific about words that should and should not be included (i.e. SAS is the name of an analytics company and also an airline)? Since these low cost tools often don’t provide a way to build a thesaurus, you need to be careful.
2. Completeness of Coverage. The coverage varied tremendously between the services. Nor was the coverage the same for the same day for the same company name that I was tracking (and I was pretty specific, although see #1 above). In fact it seemed to vary by at least an order of magnitude. I even compared this manually in twitter streams. When I asked, I was told by one company that if they weren’t picking up everything, it must be a bug and it should be reported (!?). The other company told me all of my content came to me in one big fire hose, because there had been a problem with it, before (!?). In both cases, there still seemed to be a problem with the completeness of content. The amount of content just didn’t add up between the two services. In fact, one company told me that since I was on a trial, I wasn’t getting all of the content – yet even with the firehose effect, the numbers didn’t make sense. Oh. And don’t forget to ask if the service can pull in message boards, and which message boards (i.e. public vs. private). For an analysis, all of these content issues might mean that completeness of buzz might be misrepresented which can lead to problems.
3. Duplicate Handling. What about the amount of buzz? I thought that part of my content counting discrepancy might be due to how the company was dealing with duplicates. So beware. Some companies count duplicates (such as retweets) as buzz and some do not. However, be sure to ask when duplicate content is considered duplicate and when it is not. One company told me that retweets are not counted in buzz, but are included in the tag cloud (!?).
4. Sentiment analysis. The reality is that most of the low cost tools are not that good at analyzing sentiment. Even though the company will tell you they are 75% accurate the reality is more like 50%. Case in point: on one offering, one job listing was rated positive and another job posting listed as negative. In looking at the two postings, it wasn’t clear why (shouldn’t a job post be neutral anyway) Note, however, that many of these tools provide a means to change the sentiment from +/-/neutral (if they don’t then don’t buy it). So, if sentiment is a big deal to you then be prepared to wade through the content and change sentiment, if need be. Also ask how the company does sentiment analysis and find out at what level it does the analysis (article, sentence, phrase)
5. Analysis. Be prepared to ask a lot of questions about how the company is doing its analysis. For example, sometimes I could not map the total buzz to other analytics numbers (was it duplicates handling or something else). Additionally, some social media monitoring tools will break down buzz by gender. How is this determined? Some companies determine gender based on a name algorithm, while others use profile information from facebook or twitter (obviously not a complete view of buzz by gender, since not all information sources are twitter and facebook like). Additionally, some of the tools will only show a percentage (a no-no in this case), while others may show the number and the percent. Ditto with geolocation information. If the data is incomplete (and isn’t representative of the whole) then there could be a problem with using it for analytical purposes.

What this means

Certainly, the lure of low cost social media platforms is strong, especially for small and medium businesses. However, I would caution people to do their homework and ask the right questions, before purchasing even a low cost product. I would also suggest testing a few products (running them side by side for the same time period, even if you have to pay for it for a month or so) to compare the tools in terms of coverage, sentiment, and analysis.

The reality is that you can end up with an analysis that is completely wrong if you don’t ask the right questions of the service provider. The amount of buzz might not be what you think it is, how your company compares to another company might be wrong based on how you specified the company name, the sentiment might be entirely wrong if you don’t check it, and the analysis may be misleading unless you understand how it was put together.

Five Analytics Predictions for 2011

In 2011 analytics will take center stage as a key trend because companies are at a tipping point with the volume of data they have and their urgent need to do something about it. So, with 2010 now past and 2011 to look forward to, I wanted to take the opportunity to submit my predictions (no pun intended) regarding the analytics and advanced analytics market.

Advanced Analytics gains more steam. Advanced Analytics was hot last year and will remain so in 2011. Growth will come from at least three different sources. First, advanced analytics will increase its footprint in large enterprises. A number of predictive and advanced analytics vendors tried to make their tools easier to use in 2009-2010. In 2011 expect new users in companies already deploying the technology to come on board. Second, more companies will begin to purchase the technology because they see it as a way to increase top line revenue while gaining deeper insights about their customers. Finally, small and mid sized companies will get into the act, looking for lower cost and user -friendly tools.
Social Media Monitoring Shake Out. The social media monitoring and analysis market is one crowded and confused space, with close to 200 vendors competing across no cost, low cost, and enterprise-cost solution classes. Expect 2011 to be a year of folding and consolidation with at least a third of these companies tanking. Before this happens, expect new entrants to the market for low cost social media monitoring platforms and everyone screaming for attention.
Discovery Rules. Text Analytics will become a main stream technology as more companies begin to finally understand the difference between simply searching information and actually discovering insight. Part of this will be due to the impact of social media monitoring services that utilize text analytics to discover, rather than simply search social media to find topics and patterns in unstructured data. However, innovative companies will continue to build text analytics solutions to do more than just analyze social media.
Sentiment Analysis is Supplanted by other Measures. Building on prediction #3, by the end of 2011 sentiment analysis won’t be the be all and end all of social media monitoring. Yes, it is important, but the reality is that most low cost social media monitoring vendors don’t do it well. They may tell you that they get 75-80% accuracy, but it ain’t so. In fact, it is probably more like 30-40%. After many users have gotten burned by not questioning sentiment scores, they will begin to look for other meaningful measures.
Data in the cloud continues to expand as well as BI SaaS. Expect there to still be a lot of discussion around data in the cloud. However, business analytics vendors will continue to launch SaaS BI solutions and companies will continue to buy the solutions, especially small and mid sized companies that find the SaaS model a good alternative to some pricey enterprise solutions. Expect to see at least ten more vendors enter the market.

On-premise becomes a new word. This last prediction is not really related to analytics (hence the 5 rather than 6 predictions), but I couldn’t resist. People will continue to use the term, “on-premise”, rather than “on-premises” when referring to cloud computing even though it is incorrect. This will continue to drive many people crazy since premise means “a proposition supporting or helping to support a conclusion” (dictionary.com) rather than a singular form of premises. Those of us in the know will finally give up correcting everyone else.

Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers