Next-Generation Analytics: Four Findings from TDWI’s Latest Best Practices Report

I recently completed TDWI’s latest Best Practices Report: Next Generation Analytics and Platforms for Business Success. Although the phrase “next-generation analytics and platforms” can evoke images of machine learning, big data, Hadoop, and the Internet of things (IoT), most organizations are somewhere in between the technology vision and today’s reality of BI and dashboards. For some organizations, next generation can simply mean pushing past reports and dashboards to more advanced forms, such as predictive analytics. Next-generation analytics might move your organization from visualization to big data visualization; from slicing and dicing data to predictive analytics; or to using more than just structured data for analysis. The market is on the cusp of moving forward.

What are some of the newer next-generation steps that companies are taking to move ahead?

  • Moving to predictive analytics. Predictive analytics is a statistical or data mining technique that can be used on both structured and unstructured data to determine outcomes such as whether a customer will “leave or stay” or “buy or not buy.” Predictive analytics models provide probabilities of certain outcomes. Popular use cases include churn analysis, fraud analysis, and predictive maintenance. Predictive analytics is gaining momentum and the market is primed for growth, if users stick to their plans and if they can be successful with the technology. In this case, 39% of respondents stated they are using predictive analytics today, and an additional 46% are planning to use it in the next few years . Often organizations move in fits and starts when it comes to more advanced analytics, but predictive analytics along with other techniques such as geospatial analytics, text analytics, social media analytics, and stream mining are gaining interest in the market.
  • Adding disparate data to the mix. Currently, 94% of respondents stated they are using structured data for analytics, and 68% are enriching this structured data with demographic data for analysis. However, companies are also getting interested in other kinds of data. Sources such as internal text data (today 27%), external Web data (today 29%), and external social media data (today 19%) are set to double or even triple in use for analysis over the next three years. Likewise, while IoT data is used by fewer than 20% of respondents today, another 34% are expecting to use it in the next three years. Real-time streaming data, which goes hand in hand with IoT data, is also set to grow in use (today 18%).
  • Operationalizing and embedding analytics. Operationalizing refers to making analytics part of a business process; i.e., deploying analytics into production. In this way, the output of analytics can be acted upon. Operationalizing occurs in different ways. It may be as simple as manually routing all claims that seem to have a high probability of fraud to a special investigation unit, or it might be as complex as embedding analytics in a system that automatically takes action based on the analytics. The market is still relatively new to this concept. Twenty-five percent have not operationalized their analytics, and another 15% stated they operationalize using manual approaches. Less than 10% embed analytics in system processes to operationalize it.
  • Investing in skills. Respondents cited the lack of skilled personnel as a top challenge for next-generation analytics. To overcome this challenge, some respondents talked about hiring fewer but more skilled personnel such as data analysts and data scientists. Others talked about training from within because current employees understand the business. Our survey revealed that many organizations are doing both. Additionally, some organizations are building competency centers where they can train from within. Where funding is limited, organizations are engaging in self-study.

These are only a few of the findings in this Best Practices Report.  To download the complete report click here.

To learn more about all things data, attend a TDWI conference! Each TDWI Conference features a unique program taught by highly qualified, vetted instructors teaching full- and half- day courses on topics of specific interest to the analytics/BI/DW professional.


3 Takeaways from the IBM Big Data Event

Last week I attended the IBM Big Data at the Speed of Business event at IBM’s Research facility in Almaden.  At this analyst event IBM announced multiple capabilities around its big data initiative including its new BLU Acceleration and IBM PureData System for Hadoop.  Additionally, new versions of Infosphere Big Insights and Infosphere Streams (for data streams) were announced as enhancements to IBM’s Big Data Platform.  A new version of Informix that includes time series acceleration was also announced.

The overall goal of these products is to make big data more consumable –i.e. to make it simple to manage and analyze big data.  For example, IBM PureData System for Hadoop is basically Hadoop as an appliance, making it easier to stand up and deploy.  Executives at the event said that a recent customer had gotten its PureData System “loading and interrogating data 89 minutes.”  The solution comes packaged with analytics and visualization technology too.  BLU Acceleration combines a number of technologies including dynamic in-memory processing and active compression to make it 8-25x faster for reporting and analytics.

For me, some of the most interesting presentations focused on big data analytics.  These included emerging patterns for big data analytics deployments, dealing with time series data, and the notion of the contextual enterprise.

Big data analytics use cases.  IBM has identified five big data use cases from studying hundreds of engagements it has done across 15 different industries.   These high value use cases include:

  • 360 degree view of a customer- utilizing data from internal and external sources such as social chatter to understand behavior and “seminal psychometric markers” to gain insight into customer interactions.
  • Security/Intelligence- utilizing data from sources like GPS devices and RFID tags and consuming it at a rate to protect individual safety from fraud or cyber attack.

For more visit my tdwi blog

Two Weeks and Counting to Big Data for Dummies

I am excited to announce I’m a co-author of Big Data for Dummies which will be released in mid-April 2013.  Here’s the synopsis from Wiley:

Find the right big data solution for your business or organization

Big data management is one of the major challenges facing business, industry, and not-for-profit organizations. Data sets such as customer transactions for a mega-retailer, weather patterns monitored by meteorologists, or social network activity can quickly outpace the capacity of traditional data management tools. If you need to develop or manage big data solutions, you’ll appreciate how these four experts define, explain, and guide you through this new and often confusing concept. You’ll learn what it is, why it matters, and how to choose and implement solutions that work.

  • Effectively managing big data is an issue of growing importance to businesses, not-for-profit organizations, government, and IT professionals
  • Authors are experts in information management, big data, and a variety of solutions
  • Explains big data in detail and discusses how to select and implement a solution, security concerns to consider, data storage and presentation issues, analytics, and much more
  • Provides essential information in a no-nonsense, easy-to-understand style that is empowering


Big Data For Dummies cuts through the confusion and helps you take charge of big data solutions for your organization.

Hadoop + MapReduce + SQL + Big Data and Analytics: RainStor

As the volume and variety of data continues to increase, we’re going to see more companies entering the market with solutions to address big data and compliant retention and business analytics.  One such company is RainStor, which while not a new entrant (with over 150 end-customers through direct sales and partner channels) has recently started to market its big data capabilities more aggressively to enterprises.  I had an interesting conversation with Ramon Chen, VP of product management at RainStor, the other week.   

The RainStor database was built in the UK as a government defense project to process large amounts of data in-memory.  Many of the in-memory features have been retained while new capabilities including persistent retention on any physical storage have been added. And now the company is positioning itself as providing an enterprise database architected for big data. It even runs natively on Hadoop.

The Value Proposition

The value proposition is that Rainstor’s technology enables companies to store data in the RainStor database using a unique compression technology to reduce disk space requirements.  The company boasts as much as a 40 to 1 compression ratio (>97% reduction in size).  Additionally, the software can run on any commodity hardware and storage. 

For example, one of RainStor’s clients generates 17B logs a day that it is required to store and access for ninety days.  This is the equivalent of 2 petabytes (PB) of raw information over that period which would ordinarily cost millions of dollars to store. Using RainStor, the company compressed and retained the data 20 fold in a cost-efficient 100 Terabyte (TB) NAS. At the same time RainStor also replaced an Oracle Data Warehouse providing fast response times to meet queries in support of an operational call center.

RainStor ingests the data, stores it, and makes it available for query and other analytic workloads.  It comes in two editions – the Big Data Retention Edition and the Big Data Analytics on Hadoop edition.  Both editions  provide full SQL-92 and ODBC/JDBC access.  According to the company, the Hadoop edition is the only database that runs natively on Hadoop and supports access through MapReduce and the PIG Latin language. As a massively parallel processing (MPP) database RainStor runs on the same Hadoop nodes, writing and supporting access to compressed data on HDFS. It provides security, high availability, and lifecycle management and versioning capabilities.

The idea then is that RainStor can dramatically lower the cost of storing data in Hadoop through its compression which reduces the node count needed and accelerates the performance of MapReduce jobs and provides full SQL-92 access. This can reduce the need to transfer data out of the Hadoop cluster to a separate enterprise data warehouse.  RainStor allows the Hadoop environment to support real-time query access in addition to its batch-oriented MapReduce processing.

How does it work?

RainStor is not a relational database; instead it follows the NoSQL movement by storing data non-relationally.  In its case the data is physically stored as a set of trees with linked values and nodes.  The idea is illustrated below (source: RainStor) 


Say a number of records with common value are ingested in the system.  Rainstor would throw away duplicates and only store the literal once but maintain references to the records that contained that value.  So, if the system is loading 1 million records and 500K contained it would only be stored once, saving significant storage.  This and additional pattern deduplication means that a resulting tree structure holds the same data in a significantly smaller footprint and higher compression ratio compared to other databases on the market, according to RainStor.  It also doesn’t require re-inflation like binary zip file compression which requires resources and time to re-inflate.  It writes the tree structure as is to disk, when you read it reads it back to disk.  Instead of unraveling all trees all the time, it only reads those relevant trees and branches of trees that are required to fulfill the query.  


RainStor is a good example of a kind of database that can enable big data analytics.  Just as many companies finally “got” the notion of business analytics and the importance of analytics in decision making so too are they realizing that as they accumulate and generate ever increasing amounts of data there is opportunity to analyze and act on it.

For example, according to the company, you can put a BI solution, like IBM Cognos, Microstrategy, Tableau or SAS, on top of RainStor.  RainStor would hold the raw data and any BI solution would access data either through MapReduce or ODBC/JDBC  (i.e. one platform) with no need to use Hive and HQL.  RainStor also recently announced a partnership with IBM BigInsights for its Big Data Analytics on Hadoop edition. 

What about big data appliances that are architected for high performance analytics?  RainStor claims that while some big data appliances  do have some MapReduce support (like Aster Data for example) it would be cheaper to use their solution together with open source Hadoop.  In other words, RainStor on Hadoop would be cheaper than any big data appliance.

It is still early in the game.  I am looking forward to seeing some big data analytics implementations which utilize RainStor.  I am interested to see use cases that go past querying huge amounts of data and provide some advanced analytics on top of RainStor.  Or, big data visualizations with rapid response time on top of RainStor, that only need to utilize a small number of nodes.  Please keep me posted, RainStor.

Four Vendor Views on Big Data and Big Data Analytics: IBM

Next in my discussion of big data providers is IBM.   Big data plays right into IBM’s portfolio of solutions in the information management space.  It also dove tails very nicely with the company’s Smarter Planet strategy.  Smarter Planet holds the vision of the world as a more interconnected, instrumented, and intelligent place.  IBM’s Smarter Cities and Smarter Industries are all part of its solutions portfolio.  For companies to be successful in this type of environment requires a new emphasis on big data and big data analytics.

Here’s a quick look at how IBM is positioning around big data, some of its product offerings, and use cases for big data analytics.


According to IBM, big data has three characteristics.  These are volume, velocity, and variety.   IBM is talking about large volumes of both structured and unstructured data.  This can include audio and video together with text and traditional structured data.  It can be gathered and analyzed in real time.

IBM has both hardware and software products to support both big data and big data analytics.  These products include:

  • Infosphere Streams – a platform that can be used to perform deep analysis of massive volumes of relational and non-relational data types with sub-millisecond response times.   Cognos Real-time Monitoring can also be used with Infosphere Streams for dashboarding capabilities.
  • Infosphere BigInsights – a product that consists of IBM research technologies on top of open source Apache Hadoop.  BigInsights provides core installation, development tools, web-based UIs, connectors for integration, integrated text analytics, and BigSheets for end-user visualization.
  • IBM Netezza – a high capacity appliance that allows companies to analyze pedabytes of data in minutes.
  • Cognos Consumer Insights- Leverages BigInsights and text analytics capabilities to perform social media sentiment analysis.
  • IBM SPSS- IBM’s predictive and advanced analytics platform that can read data from various data sources such as Netezza and be integrated with Infosphere Streams to perform advanced analysis.
  • IBM Content Analytics – uses text analytics to analyze unstructured data.  This can sit on top of Infosphere BigInsights.

At the Information on Demand (IOD) conference a few months ago, IBM and its customers presented many use cases around big data and big data analytics. Here is what some of the early adopters are doing:

  • Engineering:  Analyzing hourly wind data, radiation, heat and 78 other attributes to determine where to locate the next wind power plant.
  • Business:
    • Analyzing social media data, for example to understand what fans are saying about a sports game in real time.
    • Analyzing customer activity at a zoo to understand guest spending habits, likes and dislikes.
  • Analyzing healthcare data:
    • Analyzing streams of data from medical devices in neonatal units.
    •  Healthcare Predictive Analytics.  One hospital is using a product called Content and Predictive analytics to understand limit early hospital discharges which would result in re-admittance to the hospital

IBM is working with its clients and prospects to implement big data initiatives.  These initiatives generally involve a services component given the range of product offerings IBM has in the space and the newness of the market.  IBM is making significant investments in tools, integrated analytic accelerators, and solution accelerators to reduce deployment time and cost to deploy these kinds of solutions.

At IBM, big data is about the “the art of the possible.”   According to the company, price points on products that may have been too expensive five years ago are coming down.  IBM is a good example of a vendor that is both working with customers to push the envelope in terms of what is possible with big data and, at the same time, educating the market about big data.   The company believes that big data can change the way companies do business.  It’s still early in the game, but IBM has a well-articulated vision around big data.  And, the solutions its clients discussed were big, bold, and very exciting.  The company is certainly a leader in this space.

Informatica announces 9.1 and puts stake in the ground around big data

Earlier this week, Informatica announced the release of the Informatica 9.1 Platform for Big Data. The company joins other data centric vendors such as EMC and IBM by putting its stake in the ground around the hot topic of Big Data. Informatica defines Big Data as, “all data, including both transaction and interaction data, in sets whose size or complexity exceeds the ability of commonly used technologies to capture, manage and process at a reasonable cost and timeframe. Indeed, Informatica ‘s stance is that Big Data is the confluence of the three technology trends including big transaction data, big interaction data and big data processing.” In Informatica parlance the transactional data includes OLTP, OLAP, and data warehouse data; the interaction data might include social media data, call center records, click stream data, and even scientific data like that associated with genomics. Informatica targets native, high performance connectivity and future integration with Hadoop, the Big Data processing platform.

In 9.1 Informatica is providing an updated set of capabilities around self-service, authoritative and trustworthy data (MDM and data quality), data services and data integration. I wanted to focus on the data services here because of the connection to Big Data. Informatica is providing a platform that companies can use to integrate transactional data (at petabyte scale and beyond in volume) and social network data from Facebook, LinkedIn, and Twitter. Additionally, 9.1 provides the capability to move all data into and out of Hadoop in batch or real time using universal connectivity to including mainframe, databases, and applications which can help in managing unstructured data.

So, how will companies utilize this latest release? I recently had the opportunity to speak with Rob Myers, an Informatica customer, who is the manager of BI architecture and data warehousing, MDM, enterprise integration for HealthNow. HealthNow is a BlueCross/BlueShield provider for parts of western New York and the Albany area. The company is expanding geographically and is also providing value added services such as patient portals. It views its mission not simply as a claims processor but as a service provider to healthcare providers and patients. According to Rob, the company is looking to offer other value added services to doctors and patients as part of its competitive strategy. These offerings may include real time claims processing, identifying fraudulent claims, or analytics for healthier outcomes. For example, HealthNow might provide a service where it identifies patients with diabetes and provide proactive services to them to help manage the disease. Or, it might provide physicians with suggestions of tests they might consider for certain patients, given their medical records.

Currently, the company utilizes Informatica PowerCenter and Informatica Data Services for data integration including ETL and data abstraction. HealthNow has one large data warehouse and is currently building out a second. It is exposing data out to a logical model in data services tier. For example, its member portal utilizes data services to enable members to sign in and in real time, integrate 30-40 attributes around each member including demographic information, products, and eligibility for certain services into the portal. In addition, the company’s actuaries, marketing groups, and health services group have been utilizing its data warehouses to perform their own analysis. Rob doesn’t consider the data in these warehouses to be Big Data. Rather they are just sets of relational data. He views Big Data as some of the other data that the company currently has a hard time mining, for example data on social networks and the unstructured data in claims and medical notes. The company is in the beginning phase of determining how to gather and parse through this text and expose it in a way that it can be analyzed. For example, the company is interested in utilizing the data that they already have together with unstructured data and providing predictive analytics to its community. HealthNow is exploring Hadoop data stores as part of this plan and is excited about the direction that Informatica is moving. It views Informatica as the middleware that can get the trusted data out of the various silos and integrated in a way that it can then be analyzed or used in other value-added services.

It is certainly interesting to see what end-users have in mind for Big Data and, for that matter, how they define Big Data. Rob clearly views Big Data as high volume and disparate in nature (i.e. including structured and unstructured data). There seems to be a time dimension to it. He also made the point that its not just about having Big Data, it’s about doing something that he couldn’t do before with it – like processing and analyzing it. This is an important point that vendors and end-users are starting to pick up on. If Big Data were simply about volume of different kinds of data, then it would be a moving target. Really, an important aspect of Big Data is about is being able to perform activities on the data that weren’t possible before. I am glad to companies thinking about their use cases for Big Data and vendors such as Informatica putting a stake in the ground around the subject.

EMC and Big Data- Observations from EMC World 2011

I attended EMC’s User Conference last week in Las Vegas. The theme of the event was Big Data meets the Cloud. So, what’s going on with Big Data and EMC? Does this new strategy make sense?

EMC acquired Greenplum in 2010. At the time EMC described Greenplum as a “shared nothing, massively parallel processing (MPP) data warehousing system.” In other words, it could handle pedabytes of data. While the term data warehouse denotes a fairly static data store, at the user conference, EMC executives characterized big data as a high volume of disparate data, which is structured and unstructured, it is growing fast, and it may be processed in real time. Big data is becoming increasingly important to the enterprise not just because of the need to store this data but also because of the need to analyze it. Greenplum has some of its own analytical capabilities but recently the company formed a partnership with SAS to provide more oomph to its analytical arsenal. At the conference, EMC also announced that it has now included Hadoop as part of its Greenplum infrastructure to handle unstructured information.

Given EMC’s strength in data storage and content management, it is logical for EMC to move into the big data arena. However, I am left with some unanswered questions. These include questions related to how EMC will make storage, content management, data management, and data analysis all fit together.

• Data Management. How will data management issues be handled (i.e. quality, loading, etc.)? EMC has a partnership with Informatica and SAS has data management capabilities, but how will all of these components work together?
• Analytics. What analytics solutions will emerge from the partnership with SAS? This is important since EMC is not necessarily known for analytics. SAS is a leader in analytics and can make a great partner for EMC. But, its partnership with EMC is not exclusive. Additionally, EMC made a point of the fact that 90% most enterprises’ data is unstructured. EMC has incorporated Hadoop into Greenplum, ostensibly to deal with unstructured data. EMC executives mentioned that the open source community has even begun developing analytics around Hadoop. EMC Documentum also has some text analytics capabilities as part of Center Stage. SAS also has text analytics capabilities. How will all of these different components converge into a plan?
• Storage and content management. How do the storage and content management parts of the business fit into the big data roadmap? It was not clear from the discussions at the meeting how EMC plans to integrate its storage platforms into an overall big data analysis strategy. In the short term we may not see a cohesive strategy emerge.

EMC is taking on the right issues by focusing on customers’ needs to manage big data. However, it is a complicated area and I don’t expect EMC to have all of the answers today. The market is still nascent. Rather, it seems to me that EMC is putting its stake in the ground around big data. This will be an important stake for the future.


Get every new post delivered to your Inbox.

Join 1,710 other followers