Analyzing Big Data

The term “Big Data” has gained popularity over the past 12-24 months as a) amounts of data available to companies continually increase and b) technologies have emerged to more effectively manage this data. Of course, large volumes of data have been around for a long time. For example, I worked in the telecommunications industry for many years analyzing customer behavior. This required analyzing call records. The problem was that the technology (particularly the infrastructure) couldn’t necessarily support this kind of compute intensive analysis, so we often analyzed billing records rather than streams of calls detail records, or sampled the records instead.

Now companies are looking to analyze everything from the genome to Radio Frequency ID (RFID) tags to business event streams. And, newer technologies have emerged to handle massive (TB and PB) quantities of data more effectively. Often this processing takes place on clusters of computers meaning that processing is occurring across machines. The advent of cloud computing and the elastic nature of the cloud has furthered this movement.

A number of frameworks have also emerged to deal with large-scale data processing and support large-scale distributed computing. These include MapReduce and Hadoop:

-MapReduce is a software framework introduced by Google to support distributed computing on large sets of data. It is designed to take advantage of cloud resources. This computing is done across large numbers of computer clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs
-Apache Hadoop is an open source distributed computing platform that is written in Java and inspired by MapReduce. Data is stored over many machines in blocks that are replicated to other servers. It uses a hash algorithm to cluster data elements that are similar. Hadoop can cerate a map function of organized key/value pairs that can be output to a table, to memory, or to a temporary file to be analyzed.

But what about tools to actually analyze this massive amount of data?

Datameer

I recently had a very interesting conversation with the folks at Datameer. Datameer formed in 2009 to provide business users with a way to analyze massive amounts of data. The idea is straightforward: provide a platform to collect and read different kinds of large data stores, put it into a Hadoop framework, and then provide tools for analysis of this data. In other words, hide the complexity of Hadoop and provide analysis tools on top of it. The folks at Datameer believe their solution is particularly useful for data greater than 10 TB, where a company may have hit a cost wall using traditional technologies but where a business user might want to analyze some kind of behavior. So website activity, CRM systems, phone records, POS data might all be candidates for analysis. Datameer provides 164 functions (i.e. group, average, median, etc) for business users with APIs to target more specific requirements.

For example, suppose you’re in marketing at a wireless service provider and you offered a “free minutes” promotion. You want to analyze the call detail records of those customers who made use of the program to get a feel for how customers would use cell service if given unlimited minutes. The chart below shows the call detail records from one particular day of the promotion – July 11th. The chart shows the call number (MDN) as well as the time the call started and stopped and the duration of the call in milliseconds. Note that the data appear under the “analytics” tab. The “Data” tab provides tools to read different data sources into Hadoop.

This is just a snapshot – there may be TB of data from that day. So, what about analyzing this data? The chart below illustrates a simple analysis of the longest calls and the phone numbers those calls came from. It also illustrates basic statistics about all of the calls on that day – the average, median, and maximum call duration.

From this brief example, you can start to visualize the kind of analysis that is possible with Datameer.

Note too that since Datameer runs on top of Hadoop, it can deal with unstructured as well as structured data. The company has some solutions in the unstructured realm (such as basic analysis of twitter feeds), and is working to provide more sophisticated tools. Datameer offers its software either on either a SaaS license or on premises.

In the Cloud?

Not surprisingly, early adopters of the technology are using it in a private cloud model. This makes sense since some companies often want to keep control of their own data. Some of these companies already have Hadoop clusters in place and are looking for analytics capabilities for business use. Others are dealing with big data, but have not yet adopted Hadoop. They are looking at a complete “big data BI” type solution.

So, will there come a day when business users can analyze massive amounts of data without having to drag IT entirely into the picture? Utilizing BI adoption as a model, the folks from Datameer hope so. I’m interested in any thoughts readers might have on this topic!

Social Network Analysis: What is it and why should we care?

When most people think of social networks they think of Facebook and Twitter, but social network analysis has its roots in psychology, sociology, anthropology and math (see Scott, John Social Network Analysis for more details). The phrase has a number of different definitions, depending on the discipline you’re interested in, but for the purposes of this discussion social network analysis can be used to understand the patterns of how individuals interact.  For other definitions, look here.

I had a very interesting conversation with the folks from SAS last week about Social Network Analysis.   SAS has a sophisticated social network analysis solution that draws upon its analytics arsenal to solve some very important problems.  These include discovering banking or insurance fraud rings, identifying tax evasion, social services fraud, and health care fraud (to name a few) These are huge issues.  For example, the 2009 ABA Deposit Account Fraud Survey found that eight out of ten banks reported having check fraud losses in 2008. A new report by the National Insurance Crime Bureau (NICB) shows an increase in claims related to “opportunistic fraud,” possibly due to the economic downturn.   These include worker’s compensation, staged and caused accidents.

Whereas some companies (and there are a number of them in this space) use mostly rules (e.g. If the transaction is out of the country, flag it) to identify potential fraud, SAS utilizes a hybrid approach that can also include:

  • Anomalies; e.g. the number of unsecured loans exceeds the norm
  • Patterns; using predictive models to understand account opening and closing patterns
  • Social link analysis: e.g. to identify transactions to suspicious counterparties

Consider the following fraud ring:

  • Robert Madden shares a phone number with Eric Sully and their accounts have been shut down
  • Robert Madden also shares and address with Chris Clark
  • Chris Clark Shares a phone with Sue Clark and she still has open accounts
  • Sue Clark and Eric Sully also share an address with Joseph Sullins who has open accounts and who is soft matched to Joe Sullins who has many open accounts and has been doing a lot of cash cycling between them.

This is depicted in the ring of fraud that the SAS software found, which is shown above.   The dark accounts indicate accounts that have been closed.  Joe Sullins represents a new burst of accounts that should be investigated.

The SAS solution accepts input from many sources (including text, where it can use text mining to extract information from, say a claim).  The strength of the solution is in its ability to take data from many sources and in the depth of its analytical capability.

Why is this important?

Many companies set up Investigation Units to investigate potential fraud.  However, often times there are large numbers of false positives (i.e. investigations that show up as potential fraud but aren’t) which cost the company a lot of to investigate.  Just think about how many times you’ve been called by your credit card company when you’ve made a big purchase or traveled out of the country and forgot to call them and you understand the dollars wasted on false positives.    This cost, of course, pales in comparison to the billions of dollars lost each year to fraud.    Social network analysis, especially using more sophisticated analytics, can be used to find previously undetected fraud rings.

Of course, social network analysis has other use cases as well as fraud detection.   SAS uses Social Network Analysis as part of its Fraud Framework, but it is expanding its vision to include customer churn and viral marketing  (i.e. to understand how customers are related to each other).   Other use cases include terrorism and crime prevention, company organizational analysis, as well as various kinds of marketing applications such as finding key opinion leaders.

Social network analysis for marketing is an area I expect to see more action in the near term, although people will need to be educated about social networks, the difference between social network analysis and social media analysis (as well as where they overlap) and the value of the use cases.  There seems to be some confusion in the market, but that is the subject of another blog.

My Take on the SAS Analyst Conference

I just got back from the SAS analyst event that was held in Steamboat Springs, Colorado.   It was a great meeting.  Here are some of the themes I heard over the few days I was there:

SAS is a unique place to work.

Consider the following:  SAS revenue per employee is somewhat lower than the software industry average because everyone is on the payroll.  That’s right.  Everyone from the grounds keepers to the health clinic professionals to those involved in advertising are on the SAS payroll.   The company treats its employees very well, providing fitness facilities and on site day care (also on the payroll). You don’t even have to buy your own coffee or soda! The company has found that these kinds of perks have a positive impact.  SAS announced no layoffs in 2009 and this further increased morale and productivity.  The company actually saw increased profits in 2009.   Executives from SAS also made the point that even thought they might have their own advertising, etc. they do not want to be insular.  The company knows it needs new blood and new ideas.  On that note, check out the next two themes:

Innovation is very important to SAS.

Here are some examples:

  • Dr. Goodnight gave his presentation using the latest version of the SAS BI dashboard, which looked pretty slick.
  • SAS has recently introduced some very innovative products and the trend will continue. One example is its social network analysis product that has been doing very well in the market.  The product analyzes social networks and can, for example, uncover groups of people working together to commit fraud.  This product was able to find $32M in welfare fraud in several weeks.
  • SAS continues to enhance its UI, which it has been beat up about in the past. We also got pre-briefed on some new product announcements that I can’t talk about yet, but other analysts did tweet about them at the conference.   There were a lot of tweats at this conference and they were analyzed in real time.

The partnership with Accenture is a meaningful one.

SAS execs stated that although they may not have that many partnerships, they try to make the ones they have very real.  While, on the surface, the recent announcement regarding the Accenture SAS Analytics Group might seem like a me too after IBM BAO, it is actually different.  Accenture’s goal is transform the front office, like ERP/CRM was transformed.  It wants to, “Take the what and turn it into so what and now what?” It views analytics not simply as a technology, but a new competitive management science that enables agility.  It obviously won’t market it that way as the company takes a business focus.  Look for the Accenture SAS Analytics Group to put out services such as Churn management as a service, Risk and fraud detection as a service.  They will operationalize this as part of a business process.

The Cloud!

SAS has a number of SaaS offerings in the market and will, no doubt, introduce more.  What I found refreshing was that SAS takes issues around SaaS very seriously.  You’d expect a data company to be concerned about their customers’ data and they are. 

Best line of the conference

SAS is putting a lot of effort into making its products easier to use and that is a good thing.  There are ways to get analysis to those people who aren’t that analytical.  In a discussion about the skill level required for people to use advanced analytics, however, one customer commented, “Just because you can turn on a stove doesn’t mean you know how to cook.”  More on this in another post.

Five Predictions for Advanced Analytics in 2010

With 2010 now upon us, I wanted to take the opportunity to talk about five advanced analytics technology trends that will take flight this year.  Some of these are up in the clouds, some down to earth.

  • Text Analytics:  Analyzing unstructured text will continue to be a hot area for companies. Vendors in this space have weathered the economic crisis well and the technology is positioned to do even better once a recovery begins.  Social media analysis really took off in 2009 and a number of text analytics vendors, such as Attensity and Clarabridge, have already partnered with online providers to offer this service. Those that haven’t will do so this year.  Additionally, numerous “listening post” services, dealing with brand image and voice of the customer have also sprung up. However, while voice of the customer has been a hot area and will continue to be, I think other application areas such as competitive intelligence will also gain momentum.  There is a lot of data out on the Internet that can be used to gain insight about markets, trends, and competitors.
  • Predictive Analytics Model Building:  In 2009, there was a lot of buzz about predictive analytics.  For example, IBM bought SPSS and other vendors, such as SAS and Megaputer, also beefed up offerings.  A newish development that will continue to gain steam is predictive analytics in the cloud.  For example, vendors Aha! software and Clario are providing predictive capabilities to users in a cloud-based model.  While different in approach they both speak to the trend that predictive analytics will be hot in 2010.
  • Operationalizing Predictive Analytics:  While not every company can or may want to build a predictive model, there are certainly a lot of uses for operationalizing predictive models as part of a business process.  Forward looking companies are already using this as part of the call center process, in fraud analysis, and churn analysis, to name a few use cases.  The momentum will continue to build making advanced analytics more pervasive.
  • Advanced Analytics in the Cloud:  speaking of putting predictive models in the cloud, business analytics in general will continue to move to the cloud for mid market companies and others that deem it valuable.  Companies such as QlikTech introduced a cloud-based service in 2009.  There are also a number of pure play SaaS vendors out there, like GoodData and others that provide cloud-based services in this space.  Expect to hear more about this in 2010.
  • Analyzing complex data streams.  A number of forward-looking companies with large amounts of real-time data (such as RFID or financial data) are already investing in analyzing these data streams.   Some are using the on-demand capacity of cloud based model to do this.  Expect this trend to continue in 2010.

Operationalizing Predictive Analytics

There has been a lot of excitement in the market recently around business analytics in general and specifically around predictive analytics. The promise of moving away from the typical rear view mirror approach to a predictive, anticipatory approach is a very compelling value proposition. 

But, just how can this be done?  Predictive models are complex.  So, how can companies use them to their best advantage?  A number of ideas have emerged to make this happen including 1) making the models easier to build in the first place and 2) operationalizing models that have been built so users across the organization can utilize the output of these models in various ways.  I have written several blogs on the topic.

Given the market momentum around predictive analytics, I was interested to speak to members of the Aha! Team about their spin on this subject, which they term “Business Embedded Analytics.” For those of you not familiar with Aha! the company was formed in 2006 to provide a services platform (i.e. SaaS platform called Axel ) to embed analytics within a business.  The company currently has customers in healthcare, telecommunications, and travel and transportation.  The idea behind the platform is to allow business analysts to utilize advanced business analytics in their day to day jobs by implementing a range of deterministic and stochastic predictive models and then tracking, trending, forecasting and monitoring business outcomes based on the output of the model.

An example

Here’s an example.  Say, you work at an insurance company and you are concerned about customers not renewing their policies.  Your company might have a lot of data about both past and present customers including demographic data, the type of policy they have, how long they’ve had it, and so on.  This kind of data can be used to create a predictive model of customers who are likely to drop their policy based on the characteristics of customers who have already done so.  The Aha! platform allows a company to collect the data necessary to run the model, implement the model, get the results from the model and continue to update it and track it as more data becomes available.   This, by itself, is not a new idea.  What is interesting about the Axel Services Platform is that the output from the model is displayed as a series of dynamic Key Performance Indicators (KPIs) models that the business analyst has created.  These KPIs are really important metrics, such as current membership, policy terminations, % disenrolled, and so on.   The idea is that once the model is chugging away, and getting more data, it can produce these indicators on an ongoing basis and analysts can use this information to actively understand and act on what is happening to their customer base.  The platform enables analysts to visualize these KPIs, trend them, forecast on them, and change the value of one of the KPIs in order to see the impact that might have on the overall business.   Here is a screen shot of the system:

In this instance, these are actual not forecasted values of the KPIs (although this could represent a modeled goal).  For example, the KPI on the lower right hand corner of the screen is called Internal Agent Member Retention.  This is actually a drill down of information from the Distribution Channel Performance.  The KPI might represent the number of policies renewed on a particular reference date, year to date, etc. If it was a modeled KPI, it might represent the target value for that particular KPI (i.e. in order to make a goal of selling 500,000 policies in a particular time period, an internal agent must sell, say 450 of them).  This goal might change based on seasonality, risk, time periods, and so on.

Aha! provides tools for collaboration among analysts and a dashboard, so that this information can be shared with members across the organization or across companies. Aha! Provides a series a predictive models, but also enables companies to pull in the models from outside sources such as SAS or SPSS. The service is currently targeted for enterprise class companies.

So what?

What does this mean?  Simply this:  that the model, once created, is not static.  Rather, its results are part of the business analyst’s day to day job.  In this way, companies can develop a strategy (for example around acquisition or retention), create a model to address it, and then continually monitor and analyze and act on what is happening to its customer base. 

When most analytics vendors talk about operationalizing predictive analytics, they generally mean putting a model in a process (say for a call center) that can be used by call center agents to tell them what they should be offering customers.  Call center agents can provide information back into the model, but I haven’t seen a solution where the model represents the business process in quite this way and continuously monitors the process.   This can be a tremendous help in the acquisition and retention efforts of a company. I see these kinds of models and process being very useful in industries that have a lot of small customers who aren’t that “sticky” meaning they have the potential to churn.  In this case, it is not enough to run a model once; it really needs to be part of the business process. In fact, the outcome analytics of the business user is the necessary feed back to calibrate and tune the predictive model (i.e. you might build a model, but it isn’t really the right model).  As offers, promotions, etc. are provided to these customers, the results can understood in a dynamic way, in a sense to get out ahead of your customer base 

Four reasons why the time is right for IBM to tackle Advanced Analytics

IBM has dominated a good deal of the news in the business analytics world, recently. On Friday, it completed the purchase of SPSS and solidified its position in predictive analytics.  This is certainly the biggest leg of a recent three-prong attack on the analytics market that also includes:

  • Purchasing Red Pill.  Red Pill is a privately-held company headquartered in Singapore that provides advanced customer analytics services –  especially in the business process outsourcing arena.  The company has talent in the area of advanced data modeling and simulation for various verticals such as financial services and telecommunications. 
  • Opening a series of solutions centers focused on advanced analytics.  There are currently four centers operating now: in New York (announced last week), Berlin, Beijing, and Tokyo.  Others are planned for Washington D.C. and London. 

Of course, there is a good deal of organizational (and technology) integration that needs to be done to get all of the pieces working together (and working together) with all of the other software purchases IBM has made recently.  But what is compelling to me is the size of the effort that IBM is putting forth.  The company obviously sees an important market opportunity in the advanced analytics market.  Why?  I can think of at least four reasons:

  • More Data and different kinds of data.  As the amount of data continues to expand, companies are finally realizing that they can use this data for competitive advantage, if they can analyze it properly.  This data includes traditional structured data as well as data from sensors and other instruments that pump out a lot of data, and of course, all of that unstructured data that can be found both within and outside of a company.
  • Computing power.  The computing power now exists to actually analyze this information.  This includes analyzing unstructured information along with utilizing complex algorithms to analyze massive amounts of structured data. And, with the advent of cloud computing, if companies are willing to put their data into the cloud, the compute power increases.
  • The power of analytics.  Sure, not everyone at every company understands what a predictive model is, much less how to build one.  However, a critical mass of companies have come to realize the power that advanced analytics, such as predictive analysis can provide.  For example, insurance companies are predicting fraud, telecommunications companies are predicting churn.  When a company utilizes a new technique with success, it is often more willing to try other new analytical techniques. 
  • The analysis can be operationalized.  Predictive models have been around for decades.  The difference is that 1) the compute power exists and 2) the results of the models can be utilized in operations.  I remember developing models to predict churn many years ago, but the problem was that it was difficult to actually put these models in to operation.  This is changing.  For example, companies are using advanced analytics in call centers.  When a customer calls, an agent knows if that customer might be likely to disconnect a service.  The agent can utilize this information, along with recommendations for new service to try to retain the customer. 

 So, as someone who is passionate about data analysis, it is good to see that it is finally gaining the traction it deserves.

Analyzing Data in the Cloud

I had an interesting chat with Roman Stanek, CEO of Good Data last week about the debate over data security and reliability in the cloud.  For those of you who are not familiar with Good Data, it provides a collaborative business analytics platform as a SaaS offering.

The upshot of the discussion was something like this:

 The argument over data security and reliability in the cloud is the wrong argument.    It’s not just about moving your existing data to the cloud.  It’s about using the cloud to provide a different level of functionality, capability, and service than you could obtain using a traditional premises solution- even if you move that solution to the “hosted” cloud. 

What does this mean?  First, companies should not simply be asking the question,  “should I move my data to the cloud?”  They should be thinking about new capabilities the cloud provides as part of the decision making process.  For example, Good Data touts its collaborative capabilities and its ability to do mash ups and certain kinds of benchmarking (utilizing external information) as differentiators to standard premises-based BI solutions.  This leads to the second point that a hosted BI solution is a different animal than a SaaS solution. For example, a user of Good Data could pull in information from other SaaS solutions (such as Salesforce.com) as part of the analysis process.  This might be difficult with a vanilla hosted solution.

 So, when BI users think about moving to the public cloud they need to assess the risk vs. the reward of the move.  If they are able to perform certain analysis that they couldn’t perform via a premises model and this analysis is valuable, then any perceived or real risk might be worth it.

What is location intelligence and why is it important?

Visualization can change the way that we look at data and information.   If that data contains a geographic/geospatial component then utilizing location information can help provide a new layer of insight for certain kinds of analysis.  Location intelligence is the integration and analysis of visual geographic/geospatial information as part of the decision making process.  A few examples where this might be useful include:

  • Analyzing marketing activity
  • Analyzing sales activity
  • Analyzing crime patterns
  • Analyzing utility outages
  • Analyzing  military options

I had the opportunity to meet with the team from SpatialKey the other week.  SpatialKey offers a location intelligence solution, targeted at decision makers, in a Software as a Service (SaaS) model.  The offering is part of Universal Mind, a consulting company that specializes in design and usability and had done a lot of work on dashboards, Geographic Information Systems, and the like.  Based on its experience, it developed a cloud-based service to help people utilize geographic information more effectively. 

According to the company, all the user needs to get started is a CSV file with their data. Files must contain an address, which SpatialKey will geocode, or latitude and longitude for mapping purposes.  It can contain any other structured data component.   Here is a screen shot from the system.  It shows approximately 1000 real estate transactions from the Sacramento, California area that were reported over a five day period. 

sac_real_estate1

There are several points to note in this figure.  First, the data can be represented as a heat map, meaning areas where there are large number of transactions appear in red, lower numbers in green.   Second, the software gives the user the ability to add visualization pods, which are graphics (on the left) that drill down into the information.  The service also allows you to incrementally add other data sets, so you can visualize patterns.  For example, you might choose to add crime rates or foreclosure rates on top of the real estate transactions to understand the area better.  The system also provides filtering capabilities through pop ups and other sliders. 

SpatialKey has just moved out of beta and into trial.  The company does not intend to compete with traditional BI vendors.  Rather, its intention is to provide a lightweight alternative to traditional BI and GIS systems.  The idea would be to simply export data from different sources (either your company data stores or even other cloud sources such as Salesforce.com) and allow end users to analyze it via a cloud model.

 The future of data is more data.  Location intelligence solutions will continue to become important as the number of devices, such as RFID and other sensors continue to explode.   As these devices spew yet even more data into organizations, people will want a better way to analyze this information.  It makes sense to include geographic visualization as part of the business analytics arsenal.

IBM Business Analytics and Optimization – The Dawn of New Era

I attended the IBM Business Analytics and Optimization (BAO) briefing yesterday at the IBM Research facility in Hawthorne, NY.   At the meeting, IBM executives from Software, Global Business Services, and Research (yes, Research) announced its new consulting organization, which will be led by Fred Balboni.   The initiative includes 4000 GBS consultants working together with the Software Group and Research to deliver solutions to customers dedicated to advanced business analytics and business optimization. The initiative builds off of IBM’s Smarter Planet . 

 

IBM believes that there is a great opportunity for companies that can take all of the information they are being inundated with and use it effectively.  According to IBM (based on a recent study), only 13% of companies are utilizing analytics to their advantage.  The business drivers behind the new practice include the fact that companies are being pressured to make decisions smarter and faster.  Optimization is key as well as the ability for organizations to become more predictive.  In fact, the word predictive was used a lot yesterday. 

 

According to IBM, with an instrumented data explosion, powerful software will be needed to manage this information, analyze it, and act on it.  This goes beyond business intelligence and business process management, to what IBM terms business analytics and optimization.  BAO operationalizes this information via advanced analytics and optimization.  This means that advanced analytics operating on lots of data will be part of solutions that are sold to customers.  BAO will go to market with industry specific applications

 

‘Been doing this for years

 

IBM was quick to point out that they have been delivering solutions like this to customers for a number of years Here are a few examples:

 

·        The Sentinel Group , an organization that provides healthcare anti-fraud and abuse services, uses IBM software and advanced analytics to predict insurance fraud.

·        The Fire Department of New York is using IBM software and advanced analytics to “ build a state of the art system for collecting and sharing data in real-time that can potentially prevent fires and protect firefighters and other first responders when a fire occurs”.

·        The Operational Risk data exchange (ORX) is using IBM to help its 35 member banks better analyze operational loss data from across the banking industry.  This work is being done in conjunction with IBM Research.

 

These solutions were built in conjunction with the members of IBM Research who have been pioneering new techniques for analyzing data.  This is a group of 200 mathematicians and other quantitative scientists.  In fact, according to IBM, IBM research has been part of a very large number of client engagements.  A few years back, the company formalized the bridge between GBS and Research via the Center for Business Optimization.  The new consulting organization is yet a further outgrowth of this. 

 

The Dawn of a New Era

 

The new organization will provide consulting services in the following areas:

·        Strategy

·        Biz Intelligence and Business Performance Management

·        Advanced Analytics and Optimization

·        Enterprise info management

·        Enterprise Content management

 

It was significant that the meeting was held at the Research Labs.  We lunched with researchers, met with Brenda Dietrich, VP of Research, and saw a number of solution demos that utilized intellectual property from Research.  IBM believes that its research strength will help to differentiate it from competitors.

 

The research organization is doing some interesting work in many areas of data analysis including mining blogs, sentiment analysis, and machine learning and predictive analysis.  While there are researchers on the team that are more traditional and measure success based on how many papers they publish, there are a large number that get excited about solving real problems for real customers.   Brenda Dietrich requires that each lab participate in real-world work. 

 

Look, I get excited about business analytics, it’s in my blood.  I agree that world of data is changing and companies that make the most effective use of information will come out ahead. I’ve been saying this for years.   I’m glad that IBM is taking the bull by the horns.  I like that Research is involved. 

 

It will be interesting to see how effectively IBM can take its IP and reuse it and make it scale across different customers in different industries in order to solve complex problems.  According to IBM, once a specific piece of IP is used several times, they can effectively make it work across other solutions.  On a side note, it will also be interesting to see how this IP might make its way into the Cognos Platform.  That is not the thrust of this announcement (which is more GBS centric), but is worth mentioning.  

 

Text Analytics meets Enterprise Content Management ‘Round 2– IBM Content Analyzer

Why? How? These are key questions that business people ask a lot. Questions such as, “Why did our customer retention rate plummet?” or “How come our product quality has declined?” or even “How did we end up in this mess?” often cannot be answered using structured data alone. However, there is a lot of unstructured data out there in content management systems that is ripe for this kind of analysis. Claims, case files, contracts, call center notes, and various forms of correspondence are all prime sources of insight.


This past year, I have become quite interested in what content management vendors are doing about text analytics. This is, in part, due to some research we had done at Hurwitz & Associates, which indicated that companies were planning to deploy their text analytics solutions in conjunction with content management systems. Many BI vendors have already incorporated text analytics into their BI platforms, yet earlier this year there didn’t seem to be much action on this front on the part of the ECM vendors.


Now, several content management providers are stepping up to the plate with offerings in this space. One of these vendors is IBM. IBM’s Content Analyzer , formerly IBM OmniFind Analytics Edition, uses linguistic understanding and trend analysis to allow users to search, mine and analyze the combined information from their unstructured content and structured data. Content Analyzer consists of two pieces: a backend linguistics component and a visualization and analysis text mining user interface. Content Analyzer also has a direct integration with FileNet P8, which means Content Analyzer understands FileNet formats and that the integrity of any information from the FileNet system is maintained as it moves into Content Analyzer.


Last week, Rashmi Vital, the offer manager for content analytics, gave me a demo of the product. It has come a long way since I wrote about what IBM was doing in the space back in February. The demo she showed me utilized data from the National Highway Transportation Safety Administration Database (NHTSA), which logs consumer complaints about vehicles. The information includes structured information – for example, the incident date and whether there was a crash — and unstructured information, which is the free-text description written by consumers.  Rashmi showed me how the text mining capabilities of Content Analyzer can be used for the early detection of quality issues.


Let’s suppose that management wants to understand quality problems with cars, trucks, and SUVs, before they become a major safety issue (the auto industry doesn’t need any more trouble than it already has) and they want to understand what specific component of the car is causing the complaint. The user decides to explore a data source of customer complaints. In this example, we are using the NHTSA data, but he or she could obviously get more information from his or her own warranty claims, reports, and technician notes stored in the content management system.


The user wants to gather all of the incident reports associated with fire. Since fire isn’t a structured data field, the first thing the user can do is either simply search on fire, or set up a small dictionary of terms with words and phrases that would also represent fire. These might include words like flame, blaze, spark, burst, and so on. Content analyzer crawls the documents and put them in an index. Content Analyzer’s index contains the NLP analytic results of the corpus of documents and the document itself because often the analyst wants to see the source.


Here is a screen shot of the IBM Content Analyzer’s visualization tool called Text Miner. Text Miner provides facilities for real-time statistical analysis on the index for a source dataset. It allows the users to analyze the processed data by organizing the data into categories, applying search conditions, and further drilling down to analyze patterns over time or correlations.

slide1


You can see that the search on fire returned about 200,000 documents (there were over 500,000 to start). The user can then correlate fire with specific problems. In this example, the user decides to correlate it to a structured field called “vehicle component”. The vehicle component that most highly correlated to fire (and with a high number of hits) is the electrical system wiring. The user can continue to drill down on this information, in order to determine what make and model of car had this problem, and once he or she has distilled the analysis to a manageable number, examine the actual description of the problem to understand the root cause.


slide12


Correlation analysis has another benefit because sometimes it can help to see trends that are highly unusual that we would not be consider. Suppose we take the same criteria as above and sort by correlation value (see next figure). It is not surprising to see components like electrical system or fuel system listed since we assume this is a normal place for potential fires to start. However, just below those components you can see a high correlation between fire and Vehicle Speed Control: Cruise Control Component. Perhaps this may not an area an analyst would consider a potential fire hazard. The high correlation value would be a signal to an analyst to investigate further by drilling down into the component, Vehicle Speed Control: Cruise Control and into the descriptions that customers submitted. The following view looks at the results of analyzing the phrases related to the current analytic criteria. Being able to drill down to see the actual incident report description allows the analyst to see the issue in its entirety. This is good stuff.


slide14

Stay tuned as I plan to showcase what other content management vendors are doing in this space.

I’m interested in your company plans for text analytics and content management.  Please answer my poll, below:


 

Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers