Deathtrap: Overlearning in Predictive Analytics

I am in the process of gathering survey data for the TDWI Predictive Analytics Best Practices Report.  Right now, I’m in the data analysis phase.  It turns out (not surprisingly) that one of the biggest barriers to adoption of predictive analytics is understanding how the technology works.  Education is definitely needed as more advanced forms of analytics move out to less experienced users.

With regard to education, coincidentally I had the pleasure of speaking to Eric Siegel recently about his book, “Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die” (www.thepredictionbook.com). Eric Siegel is well known in analytics circles.   For those who haven’t read the book, it is a good read.  It is business focused with some great examples of how predictive analytics is being used today.

Eric and I focused our discussion on one of the more technical chapters in the book which addresses the problem known as overfitting (aka overlearning) - an important concept in predictive analytics. Overfitting occurs when a model describes the noise or random error rather than the underlying relationship.  In other words, it occurs when your data fits the model a little too well.   As Eric put it, “Not understanding overfitting in predictive analytics is like driving a car without learning where the brake pedal is. “

While all predictive modeling methods can overlearn, a decision tree is a good technique for intuitively seeing where overlearning can happen.  The decision tree is one of the most popular types of predictive analytics techniques used today.  This is because it is relatively easy to understand – even by the non-statistician – and ease of use is a top priority among end-users and vendors alike.

Here’s a simplified example of a decision tree.  Let’s say that you’re a financial institution that is trying to understand the characteristics of customers who leave (i.e., defect or cancel).  This means that your target variables are leave (yes) and don’t leave (no).  After (hopefully) visualizing or running some descriptive stats to get a sense of the data, and understanding the question being asked, the company puts together what’s called a training set of data into a decision tree program.  The training set is a subset of the overall data set in terms of number of observations.  In this case it might consist of attributes like demographic and personal information about the customer, size of monthly deposits, how long the customer has been with the bank, how long the customer has used online banking, how often they contact the call center, and so on.

Here’s what might come out:

decision tree

The first node of the decision tree is total deposit/month.  This decision tree is saying that if a customer deposits >$4K per month and is using online bill pay for more than two years they are not likely to leave (there would be probabilities associated with this).  However, if they have used online banking for < 2 years and contacted the call center X times, there may be a different outcome.  This makes sense intuitively.  A customer who has been with the bank a long time and is already doing a lot of online bill paying might not want to leave.  Conversely, a customer who isn’t doing a lot of deposits and who has made a lot of calls to the call center, might be having trouble with the online bill pay.  You can see that the tree could branch down and down, each branch with a different probability of an outcome, either yes or no.

Now, here’s the point about overfitting.  You can imagine that this decision tree could branch out bigger and bigger to a point where it could account for every case in the training data, including the noisy ones.  For instance, a rule with a 97% probability might read, “If customer deposits more than $4K a month and has used online bill pay for more than 2 years, and lives in ZYX, and  is greater than 6 feet tall then they will leave.”  As Eric states in his book, “Overlearning is the pitfall of mistaking noise for information, assuming too much about what has been shown in the data.”  If you give the decision tree enough variables, there are going to be spurious predictions.

The way to detect the potential pitfall of overlearning is apply a set of test data to the model.  The test data set is a “hold out” sample.  The idea is to see how well the rules perform with this new data.  In the example above, there is a high probability that the spurious rule above won’t pan out in the test set.

In practice, some software packages will do this work for you.  They will automatically hold out the test sample before supplying you with the results.  The tools will show you the results on the test data.  However, not all do, so it is important to understand this principle.   If you validate your model using hold-out data then overfitting does not have to be a problem.

I want to mention one other point here about noisy data.  With all of the discussion in the media about big data there has been a lot said about people being misled by noisy big data.  As Eric notes, “If you checking 500K variables you’ll have bad luck eventually – you’ll find something spurious. “  However, chances are that this kind of misleading noise is from an individual correlation, not a model.  There is a big difference.  People tend to equate predictive analytics with big data analytics.   The two are not synonymous.

Are there issues with any technique?  Of course.  That’s why education is so important.  However, there is a great deal to be gained from predictive analytics models, as more and more companies are discovering.

For more on the results of the Predictive Analytics BPR see my TDWI blog:

Four Vendor Views on Big Data and Big Data Analytics: IBM

Next in my discussion of big data providers is IBM.   Big data plays right into IBM’s portfolio of solutions in the information management space.  It also dove tails very nicely with the company’s Smarter Planet strategy.  Smarter Planet holds the vision of the world as a more interconnected, instrumented, and intelligent place.  IBM’s Smarter Cities and Smarter Industries are all part of its solutions portfolio.  For companies to be successful in this type of environment requires a new emphasis on big data and big data analytics.

Here’s a quick look at how IBM is positioning around big data, some of its product offerings, and use cases for big data analytics.

IBM

According to IBM, big data has three characteristics.  These are volume, velocity, and variety.   IBM is talking about large volumes of both structured and unstructured data.  This can include audio and video together with text and traditional structured data.  It can be gathered and analyzed in real time.

IBM has both hardware and software products to support both big data and big data analytics.  These products include:

  • Infosphere Streams – a platform that can be used to perform deep analysis of massive volumes of relational and non-relational data types with sub-millisecond response times.   Cognos Real-time Monitoring can also be used with Infosphere Streams for dashboarding capabilities.
  • Infosphere BigInsights – a product that consists of IBM research technologies on top of open source Apache Hadoop.  BigInsights provides core installation, development tools, web-based UIs, connectors for integration, integrated text analytics, and BigSheets for end-user visualization.
  • IBM Netezza – a high capacity appliance that allows companies to analyze pedabytes of data in minutes.
  • Cognos Consumer Insights- Leverages BigInsights and text analytics capabilities to perform social media sentiment analysis.
  • IBM SPSS- IBM’s predictive and advanced analytics platform that can read data from various data sources such as Netezza and be integrated with Infosphere Streams to perform advanced analysis.
  • IBM Content Analytics – uses text analytics to analyze unstructured data.  This can sit on top of Infosphere BigInsights.

At the Information on Demand (IOD) conference a few months ago, IBM and its customers presented many use cases around big data and big data analytics. Here is what some of the early adopters are doing:

  • Engineering:  Analyzing hourly wind data, radiation, heat and 78 other attributes to determine where to locate the next wind power plant.
  • Business:
    • Analyzing social media data, for example to understand what fans are saying about a sports game in real time.
    • Analyzing customer activity at a zoo to understand guest spending habits, likes and dislikes.
  • Analyzing healthcare data:
    • Analyzing streams of data from medical devices in neonatal units.
    •  Healthcare Predictive Analytics.  One hospital is using a product called Content and Predictive analytics to understand limit early hospital discharges which would result in re-admittance to the hospital

IBM is working with its clients and prospects to implement big data initiatives.  These initiatives generally involve a services component given the range of product offerings IBM has in the space and the newness of the market.  IBM is making significant investments in tools, integrated analytic accelerators, and solution accelerators to reduce deployment time and cost to deploy these kinds of solutions.

At IBM, big data is about the “the art of the possible.”   According to the company, price points on products that may have been too expensive five years ago are coming down.  IBM is a good example of a vendor that is both working with customers to push the envelope in terms of what is possible with big data and, at the same time, educating the market about big data.   The company believes that big data can change the way companies do business.  It’s still early in the game, but IBM has a well-articulated vision around big data.  And, the solutions its clients discussed were big, bold, and very exciting.  The company is certainly a leader in this space.

The Inaugural Hurwitz & Associates Predictive Analytics Victory Index is complete!

For more years than I like to admit, I have been focused on the importance of managing data so that it helps companies anticipate changes and therefore be prepared to take proactive action. Therefore, as I watched the market for predictive analytics really emerge I thought it was important to provide customers with a holistic perspective on the value of commercial offerings. I was determined that when I provided this analysis it would be based on real world factors. Therefore, I am delighted to announce the release of the Hurwitz & Associates Victory Index for Predictive Analytics! I’ve been working on this report for a quite some time and I believe that it will be very valuable tool for companies looking to understand predictive analytics and the vendors that play in this market.

Predictive analytics has become a key component of a highly competitive company’s analytics arsenal. Hurwitz & Associates defines predictive analytics as:

A statistical or data mining solution consisting of algorithms and techniques that can be used on both structured and unstructured data (together or individually) to determine future outcomes. It can be deployed for prediction, optimization, forecasting, simulation, and many other uses.

So what is this report all about? The Hurwitz & Associates Victory Index is a market research assessment tool, developed by Hurwitz & Associates that analyzes vendors across four dimensions: Vision, Viability, Validity and Value. Hurwitz & Associates takes a holistic view of the value and benefit of important technologies. We assess not just the technical capability of the technology but its ability to provide tangible value to the business. For the Victory Index we examined more than fifty attributes including: customer satisfaction, value/price, time to value, technical value, breadth and depth of functionality, customer adoption, financial viability, company vitality, strength of intellectual capital, business value, ROI, and clarity and practicality of strategy and vision. We also examine important trends in the predictive analytics market as part of the report and provide detailed overviews of vendor offerings in the space.

Some of the key vendor highlights include:
• Hurwitz & Associates named six vendors as Victors across two categories including SAS, IBM (SPSS), Pegasystems, Pitney Bowes, StatSoft and Angoss.
• Other vendors recognized in the Victory Index include KXEN, Megaputer Intelligence, Rapid-I, Revolution Analytics, SAP, and TIBCO.

Some of the key market findings include:
• Vendors have continued to place an emphasis on improving the technology’s ease of use, making strides towards automating model building capabilities and presenting findings in business context.
• Predictive analytics is no longer relegated to statisticians and mathematicians. The user profile for predictive analytics has shifted dramatically as the ability to leverage data for competitive advantage has placed business analysts in the driver’s seat.
• As companies gather greater volumes of disparate kinds of data, both structured and unstructured, they require solutions that can deliver high performance and scalability.
• The ability to operationalize predictive analytics is growing in importance as companies have come to understand the advantage to incorporating predictive models in their business processes. For example, statisticians at an insurance company might build a model that predicts the likelihood of a claim being fraudulent.

I invite you to find out more about the report by visiting our website: www.hurwitz.com

Informatica announces 9.1 and puts stake in the ground around big data

Earlier this week, Informatica announced the release of the Informatica 9.1 Platform for Big Data. The company joins other data centric vendors such as EMC and IBM by putting its stake in the ground around the hot topic of Big Data. Informatica defines Big Data as, “all data, including both transaction and interaction data, in sets whose size or complexity exceeds the ability of commonly used technologies to capture, manage and process at a reasonable cost and timeframe. Indeed, Informatica ‘s stance is that Big Data is the confluence of the three technology trends including big transaction data, big interaction data and big data processing.” In Informatica parlance the transactional data includes OLTP, OLAP, and data warehouse data; the interaction data might include social media data, call center records, click stream data, and even scientific data like that associated with genomics. Informatica targets native, high performance connectivity and future integration with Hadoop, the Big Data processing platform.

In 9.1 Informatica is providing an updated set of capabilities around self-service, authoritative and trustworthy data (MDM and data quality), data services and data integration. I wanted to focus on the data services here because of the connection to Big Data. Informatica is providing a platform that companies can use to integrate transactional data (at petabyte scale and beyond in volume) and social network data from Facebook, LinkedIn, and Twitter. Additionally, 9.1 provides the capability to move all data into and out of Hadoop in batch or real time using universal connectivity to including mainframe, databases, and applications which can help in managing unstructured data.

So, how will companies utilize this latest release? I recently had the opportunity to speak with Rob Myers, an Informatica customer, who is the manager of BI architecture and data warehousing, MDM, enterprise integration for HealthNow. HealthNow is a BlueCross/BlueShield provider for parts of western New York and the Albany area. The company is expanding geographically and is also providing value added services such as patient portals. It views its mission not simply as a claims processor but as a service provider to healthcare providers and patients. According to Rob, the company is looking to offer other value added services to doctors and patients as part of its competitive strategy. These offerings may include real time claims processing, identifying fraudulent claims, or analytics for healthier outcomes. For example, HealthNow might provide a service where it identifies patients with diabetes and provide proactive services to them to help manage the disease. Or, it might provide physicians with suggestions of tests they might consider for certain patients, given their medical records.

Currently, the company utilizes Informatica PowerCenter and Informatica Data Services for data integration including ETL and data abstraction. HealthNow has one large data warehouse and is currently building out a second. It is exposing data out to a logical model in data services tier. For example, its member portal utilizes data services to enable members to sign in and in real time, integrate 30-40 attributes around each member including demographic information, products, and eligibility for certain services into the portal. In addition, the company’s actuaries, marketing groups, and health services group have been utilizing its data warehouses to perform their own analysis. Rob doesn’t consider the data in these warehouses to be Big Data. Rather they are just sets of relational data. He views Big Data as some of the other data that the company currently has a hard time mining, for example data on social networks and the unstructured data in claims and medical notes. The company is in the beginning phase of determining how to gather and parse through this text and expose it in a way that it can be analyzed. For example, the company is interested in utilizing the data that they already have together with unstructured data and providing predictive analytics to its community. HealthNow is exploring Hadoop data stores as part of this plan and is excited about the direction that Informatica is moving. It views Informatica as the middleware that can get the trusted data out of the various silos and integrated in a way that it can then be analyzed or used in other value-added services.

It is certainly interesting to see what end-users have in mind for Big Data and, for that matter, how they define Big Data. Rob clearly views Big Data as high volume and disparate in nature (i.e. including structured and unstructured data). There seems to be a time dimension to it. He also made the point that its not just about having Big Data, it’s about doing something that he couldn’t do before with it – like processing and analyzing it. This is an important point that vendors and end-users are starting to pick up on. If Big Data were simply about volume of different kinds of data, then it would be a moving target. Really, an important aspect of Big Data is about is being able to perform activities on the data that weren’t possible before. I am glad to companies thinking about their use cases for Big Data and vendors such as Informatica putting a stake in the ground around the subject.

Five Analytics Predictions for 2011

In 2011 analytics will take center stage as a key trend because companies are at a tipping point with the volume of data they have and their urgent need to do something about it. So, with 2010 now past and 2011 to look forward to, I wanted to take the opportunity to submit my predictions (no pun intended) regarding the analytics and advanced analytics market.

Advanced Analytics gains more steam. Advanced Analytics was hot last year and will remain so in 2011. Growth will come from at least three different sources. First, advanced analytics will increase its footprint in large enterprises. A number of predictive and advanced analytics vendors tried to make their tools easier to use in 2009-2010. In 2011 expect new users in companies already deploying the technology to come on board. Second, more companies will begin to purchase the technology because they see it as a way to increase top line revenue while gaining deeper insights about their customers. Finally, small and mid sized companies will get into the act, looking for lower cost and user -friendly tools.
Social Media Monitoring Shake Out. The social media monitoring and analysis market is one crowded and confused space, with close to 200 vendors competing across no cost, low cost, and enterprise-cost solution classes. Expect 2011 to be a year of folding and consolidation with at least a third of these companies tanking. Before this happens, expect new entrants to the market for low cost social media monitoring platforms and everyone screaming for attention.
Discovery Rules. Text Analytics will become a main stream technology as more companies begin to finally understand the difference between simply searching information and actually discovering insight. Part of this will be due to the impact of social media monitoring services that utilize text analytics to discover, rather than simply search social media to find topics and patterns in unstructured data. However, innovative companies will continue to build text analytics solutions to do more than just analyze social media.
Sentiment Analysis is Supplanted by other Measures. Building on prediction #3, by the end of 2011 sentiment analysis won’t be the be all and end all of social media monitoring. Yes, it is important, but the reality is that most low cost social media monitoring vendors don’t do it well. They may tell you that they get 75-80% accuracy, but it ain’t so. In fact, it is probably more like 30-40%. After many users have gotten burned by not questioning sentiment scores, they will begin to look for other meaningful measures.
Data in the cloud continues to expand as well as BI SaaS. Expect there to still be a lot of discussion around data in the cloud. However, business analytics vendors will continue to launch SaaS BI solutions and companies will continue to buy the solutions, especially small and mid sized companies that find the SaaS model a good alternative to some pricey enterprise solutions. Expect to see at least ten more vendors enter the market.

On-premise becomes a new word. This last prediction is not really related to analytics (hence the 5 rather than 6 predictions), but I couldn’t resist. People will continue to use the term, “on-premise”, rather than “on-premises” when referring to cloud computing even though it is incorrect. This will continue to drive many people crazy since premise means “a proposition supporting or helping to support a conclusion” (dictionary.com) rather than a singular form of premises. Those of us in the know will finally give up correcting everyone else.

Who is using advanced analytics?

Advanced analytics is currently a hot topic among businesses, but who is actually using it and why? What are the challenges and benefits to those companies that are using advanced analytics? And, what is keeping some companies from exploring this technology?

Hurwitz & Associates would like your help in answering a short (5 min) survey on advanced analytics. We are interested in understanding what your company’s plans are for advanced analytics. If you’re not planning to use advanced analytics, we’d like to know why. If you’re already using advanced analytics we’d like to understand your experience.

If you participate in this survey we would be happy to send you a report of our findings. Simply provide us your email address at the end of the survey! Thanks!

Here is the link to the survey:
Click here to take survey

Five Predictions for Advanced Analytics in 2010

With 2010 now upon us, I wanted to take the opportunity to talk about five advanced analytics technology trends that will take flight this year.  Some of these are up in the clouds, some down to earth.

  • Text Analytics:  Analyzing unstructured text will continue to be a hot area for companies. Vendors in this space have weathered the economic crisis well and the technology is positioned to do even better once a recovery begins.  Social media analysis really took off in 2009 and a number of text analytics vendors, such as Attensity and Clarabridge, have already partnered with online providers to offer this service. Those that haven’t will do so this year.  Additionally, numerous “listening post” services, dealing with brand image and voice of the customer have also sprung up. However, while voice of the customer has been a hot area and will continue to be, I think other application areas such as competitive intelligence will also gain momentum.  There is a lot of data out on the Internet that can be used to gain insight about markets, trends, and competitors.
  • Predictive Analytics Model Building:  In 2009, there was a lot of buzz about predictive analytics.  For example, IBM bought SPSS and other vendors, such as SAS and Megaputer, also beefed up offerings.  A newish development that will continue to gain steam is predictive analytics in the cloud.  For example, vendors Aha! software and Clario are providing predictive capabilities to users in a cloud-based model.  While different in approach they both speak to the trend that predictive analytics will be hot in 2010.
  • Operationalizing Predictive Analytics:  While not every company can or may want to build a predictive model, there are certainly a lot of uses for operationalizing predictive models as part of a business process.  Forward looking companies are already using this as part of the call center process, in fraud analysis, and churn analysis, to name a few use cases.  The momentum will continue to build making advanced analytics more pervasive.
  • Advanced Analytics in the Cloud:  speaking of putting predictive models in the cloud, business analytics in general will continue to move to the cloud for mid market companies and others that deem it valuable.  Companies such as QlikTech introduced a cloud-based service in 2009.  There are also a number of pure play SaaS vendors out there, like GoodData and others that provide cloud-based services in this space.  Expect to hear more about this in 2010.
  • Analyzing complex data streams.  A number of forward-looking companies with large amounts of real-time data (such as RFID or financial data) are already investing in analyzing these data streams.   Some are using the on-demand capacity of cloud based model to do this.  Expect this trend to continue in 2010.
Follow

Get every new post delivered to your Inbox.

Join 1,190 other followers