Six skills for predictive analytics

Today, I participated in a webinar with Actuate on the skills needed for business analysts to perform predictive modeling.  This is a hot topic and there were hundreds of participants on the call.  In my part of the presentation, I outlined some major trends in predictive analytics (including the fact that the tools are much easier to use) as well as six different skills which I thought were important for business analysts building predictive models.  I grouped them into two buckets.  One was the skills needed to frame a problem.  The other group were the skills needed to explain/defend analysis.  These skills were:

  • Critical thinking
  • Domain expertise
  • Data sense
  • Understanding the tools
  • Some level of understanding of the techniques
  • Storytelling ability

I’m sure there are more than these six.  However, what was interesting was that we got a lot of questions from the audience around these skills –  thinking that the message of the webinar was that you don’t need to be quantitative to perform predictive analytics. We got questions about overfitting and other technical considerations in predictive analytics.  I think some people thought that we were advocating the complete dumbing down of predictive analytics and that anyone off the street could build a predictive model.

My point in the Q&A around this was as follows:  Statisticians and data scientists are a scarce resource.  I believe that there are some kinds of predictive analytics that business analysts can perform, hence freeing up the big guns for the more complex work.  I still think that business analysts should be trained in the tools and techniques so they can use them to their fullest and be able to defend their analysis.

Any thoughts?  To hear more about these skills and predictive analytics, register for the webinar to view the archived version!

 

Three entry points for big data initiatives

The TDWI Big Data Maturity Model and Assessment is set to launch November 20th.  Krish Krishnan and I have been working on this for a while, and we’re very excited about it.  There are two parts to the Big Data Maturity Model and Assessment tool. The first is the actual TDWI Big Data Maturity Model Guide. This is a guide that walks you through the actual stages of maturity for big data initiatives and provides examples and characteristics of companies at different stages of maturity. In each of these stages, we look across various dimensions that are necessary for maturity. These include organizational issues, infrastructure, data management, analytics, and governance.

The second piece is the assessment tool. The tool allows respondents to answer a series of about 75 questions in the organization, infrastructure, data management, analytics, and governance dimensions. Once complete, the respondent receives a score in each dimension as well as some expectations and best practices for moving forward. A unique feature of the assessment is that respondents can actually look to see how their scores compare against their peers, by both industry and company size.

We urge you to take the assessment and see where you land relative to your peers regarding your big data efforts.  Additionally, it’s important to note that we view this assessment as evolutionary.  We know that many companies are in the early stages of their big data journey. Therefore, this assessment is meant to be evolutionary. You can come back and take it more than once. In addition, we will be adding best practices as we learn more about what companies are doing to succeed in their big data efforts.

In the course of our research for the model, Krish and I spoke to numerous companies embarking on big data.  There were a number of patterns that emerged regarding how companies get started in their big data efforts.   Here are a few of them:

  1. Large volumes of structured data are already being analyzed in the company.  Some companies have amassed large volumes (i.e., terabytes) of structured data that they are storing in their data warehouse or in some sort of appliance, often on-premises.  They feel that their BI infrastructure is pretty solid.  Typically, the BI effort is departmental in scope.  Some of these companies are already performing more advanced kinds of analysis; such as predictive analytics on the data.  Often, they are doing this to understand their customers.  The vision for big data is about augmenting the data they have with other forms of data (often text or geospatial data) to gain more insight.
  2. A specific need for big data. Some companies start a big data effort, almost from scratch, because of a specific business need.  For instance, a wireless provider might be interested in monitoring the network and then predicting where failures will occur.   An insurance company might be interested in telemetric information in order to determine pricing for certain kinds of drivers.  A marketing department might be interested in analyzing  social media data to determine brand reputation or as part of a marketing campaign. Typically these efforts are departmental in scope and are not part of a wider enterprise big data ecosystem.
  3. Building the business on big data.  We spoke to many e-businesses that were building the business model on big data.  While these companies might be somewhat advanced in terms of infrastructure to support big data often they were still working on the analytics related to the service and typically did not have any form of governance in place.

Deathtrap: Overlearning in Predictive Analytics

I am in the process of gathering survey data for the TDWI Predictive Analytics Best Practices Report.  Right now, I’m in the data analysis phase.  It turns out (not surprisingly) that one of the biggest barriers to adoption of predictive analytics is understanding how the technology works.  Education is definitely needed as more advanced forms of analytics move out to less experienced users.

With regard to education, coincidentally I had the pleasure of speaking to Eric Siegel recently about his book, “Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die” (www.thepredictionbook.com). Eric Siegel is well known in analytics circles.   For those who haven’t read the book, it is a good read.  It is business focused with some great examples of how predictive analytics is being used today.

Eric and I focused our discussion on one of the more technical chapters in the book which addresses the problem known as overfitting (aka overlearning) - an important concept in predictive analytics. Overfitting occurs when a model describes the noise or random error rather than the underlying relationship.  In other words, it occurs when your data fits the model a little too well.   As Eric put it, “Not understanding overfitting in predictive analytics is like driving a car without learning where the brake pedal is. “

While all predictive modeling methods can overlearn, a decision tree is a good technique for intuitively seeing where overlearning can happen.  The decision tree is one of the most popular types of predictive analytics techniques used today.  This is because it is relatively easy to understand – even by the non-statistician – and ease of use is a top priority among end-users and vendors alike.

Here’s a simplified example of a decision tree.  Let’s say that you’re a financial institution that is trying to understand the characteristics of customers who leave (i.e., defect or cancel).  This means that your target variables are leave (yes) and don’t leave (no).  After (hopefully) visualizing or running some descriptive stats to get a sense of the data, and understanding the question being asked, the company puts together what’s called a training set of data into a decision tree program.  The training set is a subset of the overall data set in terms of number of observations.  In this case it might consist of attributes like demographic and personal information about the customer, size of monthly deposits, how long the customer has been with the bank, how long the customer has used online banking, how often they contact the call center, and so on.

Here’s what might come out:

decision tree

The first node of the decision tree is total deposit/month.  This decision tree is saying that if a customer deposits >$4K per month and is using online bill pay for more than two years they are not likely to leave (there would be probabilities associated with this).  However, if they have used online banking for < 2 years and contacted the call center X times, there may be a different outcome.  This makes sense intuitively.  A customer who has been with the bank a long time and is already doing a lot of online bill paying might not want to leave.  Conversely, a customer who isn’t doing a lot of deposits and who has made a lot of calls to the call center, might be having trouble with the online bill pay.  You can see that the tree could branch down and down, each branch with a different probability of an outcome, either yes or no.

Now, here’s the point about overfitting.  You can imagine that this decision tree could branch out bigger and bigger to a point where it could account for every case in the training data, including the noisy ones.  For instance, a rule with a 97% probability might read, “If customer deposits more than $4K a month and has used online bill pay for more than 2 years, and lives in ZYX, and  is greater than 6 feet tall then they will leave.”  As Eric states in his book, “Overlearning is the pitfall of mistaking noise for information, assuming too much about what has been shown in the data.”  If you give the decision tree enough variables, there are going to be spurious predictions.

The way to detect the potential pitfall of overlearning is apply a set of test data to the model.  The test data set is a “hold out” sample.  The idea is to see how well the rules perform with this new data.  In the example above, there is a high probability that the spurious rule above won’t pan out in the test set.

In practice, some software packages will do this work for you.  They will automatically hold out the test sample before supplying you with the results.  The tools will show you the results on the test data.  However, not all do, so it is important to understand this principle.   If you validate your model using hold-out data then overfitting does not have to be a problem.

I want to mention one other point here about noisy data.  With all of the discussion in the media about big data there has been a lot said about people being misled by noisy big data.  As Eric notes, “If you checking 500K variables you’ll have bad luck eventually – you’ll find something spurious. “  However, chances are that this kind of misleading noise is from an individual correlation, not a model.  There is a big difference.  People tend to equate predictive analytics with big data analytics.   The two are not synonymous.

Are there issues with any technique?  Of course.  That’s why education is so important.  However, there is a great deal to be gained from predictive analytics models, as more and more companies are discovering.

For more on the results of the Predictive Analytics BPR see my TDWI blog:

Big Data’s Future/Big Data’s Past

I just listened to an interesting IBM Google hangout about big data called Visions of Big Data’s future.  You can watch  it here.  There were some great experts on the line including James Kobelius (IBM), Thomas Deutsch (IBM), and Ed Dumbill (Silicon Valley Data Science).

The moderator, David Pittman, asked a fantastic question, “What’s taking longer than you expect in big data?”  It brought me back to 1992 (ok, I’m dating myself)  when I used to work at AT&T Bell Laboratories.  At that time, I was working in what might today be called an analytics Center of Excellence.  The group was composed of all kinds of quantitative scientists (economists, statisticians, physicists) as well as computer scientists and other IT like people. I think the group was called something like the Marketing Models, Systems, and Analysis department.

I had been working with members of Bell Labs Research to take some of the machine learning algorithms they were developing and applying them to our marketing data for analytics like churn analysis.  At that time, I proposed the formation of a group that would consist of market analysts and developers, working together with researchers and some computer scientists.  The idea was to provide continuous innovation around analysis.  I found the proposal today (I’m still sneezing from the dust).  Here is a sentence from it,

big data from 1992

Managing and analyzing large amounts of data?  At that point we were even thinking about call detail records.  It goes on to say, “Specifically the group will utilize two software technologies that will help to extract knowledge from databases:  data mining and data archeology.  The data archeology piece referred to:

Data discovery 1992

This exploration of the data is  similar to what is termed discovery today.  Here’s a link to the paper that came out of this work.   Interestingly, around this time I also remember going to talk to some people who were developing NLP algorithms for analyzing text.  I remember thinking that the “why” around customers were churning could be found in those call center notes.

I thought about this when I heard the moderator’s question not because the group I was proposing would certainly have been ahead of its time –  let’s face it AT&T was way ahead of its time with its Center of Excellence in analysis in the first place – but  because it’s taken so long to get from there to here and we’re not even here or there  yet.

Five Trends in Predictive Analytics

Predictive analytics, a technology that has been around for decades has gotten a lot of attention over the past few years, and for good reason.  Companies understand that looking in the rear-view mirror is not enough to remain competitive in the current economy.  Today, adoption of predictive analytics is increasing for a number of reasons including a better understanding of the value of the technology, the availability of compute power, and the expanding toolset to make it happen. In fact, in a recent TDWI survey at our Chicago World Conference earlier this month, more than 50% of the respondents said that they planned to use predictive analytics in their organization over the next three years. The techniques for predictive analytics are being used on both traditional data sets as well as on big data.

Here are five trends that I’m seeing in predictive analytics:

  • Ease of use.  Whereas in the past, statisticians used some sort of scripting language to build a predictive model, vendors are now making their software easier to use.  This includes hiding the complexity of the model building process and the data preparation process via the user interface.  This is not an entirely new trend but it is worth mentioning because it opens up predictive analytics to a wider audience such as marketing.  For example, vendors such as Pitney Bowes, Pegasystems, and KXEN provide solutions targeted to marketing professionals with ease of use as a primary feature.  The caveat here, of course, is that marketers still need the skills and judgment to make sure the software is used properly.
  • For more trends: http://tdwi.org/blogs/fern-halper/list/ferns-blog.aspx

Two Big Data Resources Worth Exploring

It’s a good day.  Our new book, Big Data for Dummies, is being released today and I’m busy working on a Big Data Analytics maturity model at TDWI with Krish Krishnan.  Krish, a faculty member at TDWI, is actually presenting some of the model at the TDWI World Conference:  Big Data Tipping Point taking place during the first week of May (see sidebar).  I would encourage people to attend, even if you aren’t that far along in your big data deployments.  TDWI has terrific courses in all aspects of information management and we understand that most companies will need to leverage their existing infrastructure to support big data initiatives.  In fact the title of this World conference is, “Preparing for the Practical Realities of Big Data.”   Check it out.

Back to the book.  Here’s a look at the Introduction!  Enjoy!

 

Two Weeks and Counting to Big Data for Dummies

I am excited to announce I’m a co-author of Big Data for Dummies which will be released in mid-April 2013.  Here’s the synopsis from Wiley:

Find the right big data solution for your business or organization

Big data management is one of the major challenges facing business, industry, and not-for-profit organizations. Data sets such as customer transactions for a mega-retailer, weather patterns monitored by meteorologists, or social network activity can quickly outpace the capacity of traditional data management tools. If you need to develop or manage big data solutions, you’ll appreciate how these four experts define, explain, and guide you through this new and often confusing concept. You’ll learn what it is, why it matters, and how to choose and implement solutions that work.

  • Effectively managing big data is an issue of growing importance to businesses, not-for-profit organizations, government, and IT professionals
  • Authors are experts in information management, big data, and a variety of solutions
  • Explains big data in detail and discusses how to select and implement a solution, security concerns to consider, data storage and presentation issues, analytics, and much more
  • Provides essential information in a no-nonsense, easy-to-understand style that is empowering

 

Big Data For Dummies cuts through the confusion and helps you take charge of big data solutions for your organization.

Follow

Get every new post delivered to your Inbox.

Join 1,190 other followers