3 Takeaways from the IBM Big Data Event

Last week I attended the IBM Big Data at the Speed of Business event at IBM’s Research facility in Almaden.  At this analyst event IBM announced multiple capabilities around its big data initiative including its new BLU Acceleration and IBM PureData System for Hadoop.  Additionally, new versions of Infosphere Big Insights and Infosphere Streams (for data streams) were announced as enhancements to IBM’s Big Data Platform.  A new version of Informix that includes time series acceleration was also announced.

The overall goal of these products is to make big data more consumable –i.e. to make it simple to manage and analyze big data.  For example, IBM PureData System for Hadoop is basically Hadoop as an appliance, making it easier to stand up and deploy.  Executives at the event said that a recent customer had gotten its PureData System “loading and interrogating data 89 minutes.”  The solution comes packaged with analytics and visualization technology too.  BLU Acceleration combines a number of technologies including dynamic in-memory processing and active compression to make it 8-25x faster for reporting and analytics.

For me, some of the most interesting presentations focused on big data analytics.  These included emerging patterns for big data analytics deployments, dealing with time series data, and the notion of the contextual enterprise.

Big data analytics use cases.  IBM has identified five big data use cases from studying hundreds of engagements it has done across 15 different industries.   These high value use cases include:

  • 360 degree view of a customer- utilizing data from internal and external sources such as social chatter to understand behavior and “seminal psychometric markers” to gain insight into customer interactions.
  • Security/Intelligence- utilizing data from sources like GPS devices and RFID tags and consuming it at a rate to protect individual safety from fraud or cyber attack.

For more visit my tdwi blog

Two Weeks and Counting to Big Data for Dummies

I am excited to announce I’m a co-author of Big Data for Dummies which will be released in mid-April 2013.  Here’s the synopsis from Wiley:

Find the right big data solution for your business or organization

Big data management is one of the major challenges facing business, industry, and not-for-profit organizations. Data sets such as customer transactions for a mega-retailer, weather patterns monitored by meteorologists, or social network activity can quickly outpace the capacity of traditional data management tools. If you need to develop or manage big data solutions, you’ll appreciate how these four experts define, explain, and guide you through this new and often confusing concept. You’ll learn what it is, why it matters, and how to choose and implement solutions that work.

  • Effectively managing big data is an issue of growing importance to businesses, not-for-profit organizations, government, and IT professionals
  • Authors are experts in information management, big data, and a variety of solutions
  • Explains big data in detail and discusses how to select and implement a solution, security concerns to consider, data storage and presentation issues, analytics, and much more
  • Provides essential information in a no-nonsense, easy-to-understand style that is empowering

 

Big Data For Dummies cuts through the confusion and helps you take charge of big data solutions for your organization.

Hadoop + MapReduce + SQL + Big Data and Analytics: RainStor

As the volume and variety of data continues to increase, we’re going to see more companies entering the market with solutions to address big data and compliant retention and business analytics.  One such company is RainStor, which while not a new entrant (with over 150 end-customers through direct sales and partner channels) has recently started to market its big data capabilities more aggressively to enterprises.  I had an interesting conversation with Ramon Chen, VP of product management at RainStor, the other week.   

The RainStor database was built in the UK as a government defense project to process large amounts of data in-memory.  Many of the in-memory features have been retained while new capabilities including persistent retention on any physical storage have been added. And now the company is positioning itself as providing an enterprise database architected for big data. It even runs natively on Hadoop.

The Value Proposition

The value proposition is that Rainstor’s technology enables companies to store data in the RainStor database using a unique compression technology to reduce disk space requirements.  The company boasts as much as a 40 to 1 compression ratio (>97% reduction in size).  Additionally, the software can run on any commodity hardware and storage. 

For example, one of RainStor’s clients generates 17B logs a day that it is required to store and access for ninety days.  This is the equivalent of 2 petabytes (PB) of raw information over that period which would ordinarily cost millions of dollars to store. Using RainStor, the company compressed and retained the data 20 fold in a cost-efficient 100 Terabyte (TB) NAS. At the same time RainStor also replaced an Oracle Data Warehouse providing fast response times to meet queries in support of an operational call center.

RainStor ingests the data, stores it, and makes it available for query and other analytic workloads.  It comes in two editions – the Big Data Retention Edition and the Big Data Analytics on Hadoop edition.  Both editions  provide full SQL-92 and ODBC/JDBC access.  According to the company, the Hadoop edition is the only database that runs natively on Hadoop and supports access through MapReduce and the PIG Latin language. As a massively parallel processing (MPP) database RainStor runs on the same Hadoop nodes, writing and supporting access to compressed data on HDFS. It provides security, high availability, and lifecycle management and versioning capabilities.

The idea then is that RainStor can dramatically lower the cost of storing data in Hadoop through its compression which reduces the node count needed and accelerates the performance of MapReduce jobs and provides full SQL-92 access. This can reduce the need to transfer data out of the Hadoop cluster to a separate enterprise data warehouse.  RainStor allows the Hadoop environment to support real-time query access in addition to its batch-oriented MapReduce processing.

How does it work?

RainStor is not a relational database; instead it follows the NoSQL movement by storing data non-relationally.  In its case the data is physically stored as a set of trees with linked values and nodes.  The idea is illustrated below (source: RainStor) 

Image

Say a number of records with common value yahoo.com are ingested in the system.  Rainstor would throw away duplicates and only store the literal yahoo.com once but maintain references to the records that contained that value.  So, if the system is loading 1 million records and 500K contained yahoo.com it would only be stored once, saving significant storage.  This and additional pattern deduplication means that a resulting tree structure holds the same data in a significantly smaller footprint and higher compression ratio compared to other databases on the market, according to RainStor.  It also doesn’t require re-inflation like binary zip file compression which requires resources and time to re-inflate.  It writes the tree structure as is to disk, when you read it reads it back to disk.  Instead of unraveling all trees all the time, it only reads those relevant trees and branches of trees that are required to fulfill the query.  

Conclusion

RainStor is a good example of a kind of database that can enable big data analytics.  Just as many companies finally “got” the notion of business analytics and the importance of analytics in decision making so too are they realizing that as they accumulate and generate ever increasing amounts of data there is opportunity to analyze and act on it.

For example, according to the company, you can put a BI solution, like IBM Cognos, Microstrategy, Tableau or SAS, on top of RainStor.  RainStor would hold the raw data and any BI solution would access data either through MapReduce or ODBC/JDBC  (i.e. one platform) with no need to use Hive and HQL.  RainStor also recently announced a partnership with IBM BigInsights for its Big Data Analytics on Hadoop edition. 

What about big data appliances that are architected for high performance analytics?  RainStor claims that while some big data appliances  do have some MapReduce support (like Aster Data for example) it would be cheaper to use their solution together with open source Hadoop.  In other words, RainStor on Hadoop would be cheaper than any big data appliance.

It is still early in the game.  I am looking forward to seeing some big data analytics implementations which utilize RainStor.  I am interested to see use cases that go past querying huge amounts of data and provide some advanced analytics on top of RainStor.  Or, big data visualizations with rapid response time on top of RainStor, that only need to utilize a small number of nodes.  Please keep me posted, RainStor.

Four Vendor Views on Big Data and Big Data Analytics: IBM

Next in my discussion of big data providers is IBM.   Big data plays right into IBM’s portfolio of solutions in the information management space.  It also dove tails very nicely with the company’s Smarter Planet strategy.  Smarter Planet holds the vision of the world as a more interconnected, instrumented, and intelligent place.  IBM’s Smarter Cities and Smarter Industries are all part of its solutions portfolio.  For companies to be successful in this type of environment requires a new emphasis on big data and big data analytics.

Here’s a quick look at how IBM is positioning around big data, some of its product offerings, and use cases for big data analytics.

IBM

According to IBM, big data has three characteristics.  These are volume, velocity, and variety.   IBM is talking about large volumes of both structured and unstructured data.  This can include audio and video together with text and traditional structured data.  It can be gathered and analyzed in real time.

IBM has both hardware and software products to support both big data and big data analytics.  These products include:

  • Infosphere Streams – a platform that can be used to perform deep analysis of massive volumes of relational and non-relational data types with sub-millisecond response times.   Cognos Real-time Monitoring can also be used with Infosphere Streams for dashboarding capabilities.
  • Infosphere BigInsights – a product that consists of IBM research technologies on top of open source Apache Hadoop.  BigInsights provides core installation, development tools, web-based UIs, connectors for integration, integrated text analytics, and BigSheets for end-user visualization.
  • IBM Netezza – a high capacity appliance that allows companies to analyze pedabytes of data in minutes.
  • Cognos Consumer Insights- Leverages BigInsights and text analytics capabilities to perform social media sentiment analysis.
  • IBM SPSS- IBM’s predictive and advanced analytics platform that can read data from various data sources such as Netezza and be integrated with Infosphere Streams to perform advanced analysis.
  • IBM Content Analytics – uses text analytics to analyze unstructured data.  This can sit on top of Infosphere BigInsights.

At the Information on Demand (IOD) conference a few months ago, IBM and its customers presented many use cases around big data and big data analytics. Here is what some of the early adopters are doing:

  • Engineering:  Analyzing hourly wind data, radiation, heat and 78 other attributes to determine where to locate the next wind power plant.
  • Business:
    • Analyzing social media data, for example to understand what fans are saying about a sports game in real time.
    • Analyzing customer activity at a zoo to understand guest spending habits, likes and dislikes.
  • Analyzing healthcare data:
    • Analyzing streams of data from medical devices in neonatal units.
    •  Healthcare Predictive Analytics.  One hospital is using a product called Content and Predictive analytics to understand limit early hospital discharges which would result in re-admittance to the hospital

IBM is working with its clients and prospects to implement big data initiatives.  These initiatives generally involve a services component given the range of product offerings IBM has in the space and the newness of the market.  IBM is making significant investments in tools, integrated analytic accelerators, and solution accelerators to reduce deployment time and cost to deploy these kinds of solutions.

At IBM, big data is about the “the art of the possible.”   According to the company, price points on products that may have been too expensive five years ago are coming down.  IBM is a good example of a vendor that is both working with customers to push the envelope in terms of what is possible with big data and, at the same time, educating the market about big data.   The company believes that big data can change the way companies do business.  It’s still early in the game, but IBM has a well-articulated vision around big data.  And, the solutions its clients discussed were big, bold, and very exciting.  The company is certainly a leader in this space.

Informatica announces 9.1 and puts stake in the ground around big data

Earlier this week, Informatica announced the release of the Informatica 9.1 Platform for Big Data. The company joins other data centric vendors such as EMC and IBM by putting its stake in the ground around the hot topic of Big Data. Informatica defines Big Data as, “all data, including both transaction and interaction data, in sets whose size or complexity exceeds the ability of commonly used technologies to capture, manage and process at a reasonable cost and timeframe. Indeed, Informatica ‘s stance is that Big Data is the confluence of the three technology trends including big transaction data, big interaction data and big data processing.” In Informatica parlance the transactional data includes OLTP, OLAP, and data warehouse data; the interaction data might include social media data, call center records, click stream data, and even scientific data like that associated with genomics. Informatica targets native, high performance connectivity and future integration with Hadoop, the Big Data processing platform.

In 9.1 Informatica is providing an updated set of capabilities around self-service, authoritative and trustworthy data (MDM and data quality), data services and data integration. I wanted to focus on the data services here because of the connection to Big Data. Informatica is providing a platform that companies can use to integrate transactional data (at petabyte scale and beyond in volume) and social network data from Facebook, LinkedIn, and Twitter. Additionally, 9.1 provides the capability to move all data into and out of Hadoop in batch or real time using universal connectivity to including mainframe, databases, and applications which can help in managing unstructured data.

So, how will companies utilize this latest release? I recently had the opportunity to speak with Rob Myers, an Informatica customer, who is the manager of BI architecture and data warehousing, MDM, enterprise integration for HealthNow. HealthNow is a BlueCross/BlueShield provider for parts of western New York and the Albany area. The company is expanding geographically and is also providing value added services such as patient portals. It views its mission not simply as a claims processor but as a service provider to healthcare providers and patients. According to Rob, the company is looking to offer other value added services to doctors and patients as part of its competitive strategy. These offerings may include real time claims processing, identifying fraudulent claims, or analytics for healthier outcomes. For example, HealthNow might provide a service where it identifies patients with diabetes and provide proactive services to them to help manage the disease. Or, it might provide physicians with suggestions of tests they might consider for certain patients, given their medical records.

Currently, the company utilizes Informatica PowerCenter and Informatica Data Services for data integration including ETL and data abstraction. HealthNow has one large data warehouse and is currently building out a second. It is exposing data out to a logical model in data services tier. For example, its member portal utilizes data services to enable members to sign in and in real time, integrate 30-40 attributes around each member including demographic information, products, and eligibility for certain services into the portal. In addition, the company’s actuaries, marketing groups, and health services group have been utilizing its data warehouses to perform their own analysis. Rob doesn’t consider the data in these warehouses to be Big Data. Rather they are just sets of relational data. He views Big Data as some of the other data that the company currently has a hard time mining, for example data on social networks and the unstructured data in claims and medical notes. The company is in the beginning phase of determining how to gather and parse through this text and expose it in a way that it can be analyzed. For example, the company is interested in utilizing the data that they already have together with unstructured data and providing predictive analytics to its community. HealthNow is exploring Hadoop data stores as part of this plan and is excited about the direction that Informatica is moving. It views Informatica as the middleware that can get the trusted data out of the various silos and integrated in a way that it can then be analyzed or used in other value-added services.

It is certainly interesting to see what end-users have in mind for Big Data and, for that matter, how they define Big Data. Rob clearly views Big Data as high volume and disparate in nature (i.e. including structured and unstructured data). There seems to be a time dimension to it. He also made the point that its not just about having Big Data, it’s about doing something that he couldn’t do before with it – like processing and analyzing it. This is an important point that vendors and end-users are starting to pick up on. If Big Data were simply about volume of different kinds of data, then it would be a moving target. Really, an important aspect of Big Data is about is being able to perform activities on the data that weren’t possible before. I am glad to companies thinking about their use cases for Big Data and vendors such as Informatica putting a stake in the ground around the subject.

EMC and Big Data- Observations from EMC World 2011

I attended EMC’s User Conference last week in Las Vegas. The theme of the event was Big Data meets the Cloud. So, what’s going on with Big Data and EMC? Does this new strategy make sense?

EMC acquired Greenplum in 2010. At the time EMC described Greenplum as a “shared nothing, massively parallel processing (MPP) data warehousing system.” In other words, it could handle pedabytes of data. While the term data warehouse denotes a fairly static data store, at the user conference, EMC executives characterized big data as a high volume of disparate data, which is structured and unstructured, it is growing fast, and it may be processed in real time. Big data is becoming increasingly important to the enterprise not just because of the need to store this data but also because of the need to analyze it. Greenplum has some of its own analytical capabilities but recently the company formed a partnership with SAS to provide more oomph to its analytical arsenal. At the conference, EMC also announced that it has now included Hadoop as part of its Greenplum infrastructure to handle unstructured information.

Given EMC’s strength in data storage and content management, it is logical for EMC to move into the big data arena. However, I am left with some unanswered questions. These include questions related to how EMC will make storage, content management, data management, and data analysis all fit together.

• Data Management. How will data management issues be handled (i.e. quality, loading, etc.)? EMC has a partnership with Informatica and SAS has data management capabilities, but how will all of these components work together?
• Analytics. What analytics solutions will emerge from the partnership with SAS? This is important since EMC is not necessarily known for analytics. SAS is a leader in analytics and can make a great partner for EMC. But, its partnership with EMC is not exclusive. Additionally, EMC made a point of the fact that 90% most enterprises’ data is unstructured. EMC has incorporated Hadoop into Greenplum, ostensibly to deal with unstructured data. EMC executives mentioned that the open source community has even begun developing analytics around Hadoop. EMC Documentum also has some text analytics capabilities as part of Center Stage. SAS also has text analytics capabilities. How will all of these different components converge into a plan?
• Storage and content management. How do the storage and content management parts of the business fit into the big data roadmap? It was not clear from the discussions at the meeting how EMC plans to integrate its storage platforms into an overall big data analysis strategy. In the short term we may not see a cohesive strategy emerge.

EMC is taking on the right issues by focusing on customers’ needs to manage big data. However, it is a complicated area and I don’t expect EMC to have all of the answers today. The market is still nascent. Rather, it seems to me that EMC is putting its stake in the ground around big data. This will be an important stake for the future.

Advanced Analytics and the skills needed to make it happen: Takeaways from IBM IOD

Advanced Analytics was a big topic at the IBM IOD conference last week. As part of this, predictive analytics was again an important piece of the story along with other advanced analytics capabilities IBM has developed or is in the process of developing to support optimization. These include Big Insights (for big data), analyzing data streams, content/text analytics, and of course, the latest release of Cognos.

One especially interesting topic that was discussed at the conference was the skills required to make advanced analytics a reality. I have been writing and thinking a lot this subject so I was very happy to hear IBM address it head on during the second day keynote. This keynote included a customer panel and another speaker, Dr. Atul Gawande, and both offered some excellent insights. The panel included Scott Friesen (Best Buy), Scott Futren (Guinnett County Public Schools), Srinivas Koushik (Nationwide), and Greg Christopher (Nestle). Here are some of the interrelated nuggets from the discussions:

• Ability to deliver vs. the ability to absorb. One panelist made the point that a lot of new insights are being delivered to organizations. In the future, it may become difficult for people to absorb all of this information (and this will require new skills too).
• Analysis and interpretation. People will need to know how to analyze and how to interpret the results of an analysis. As Dr. Gawande pointed out, “Having knowledge is not the same as using knowledge effectively.”
• The right information. One of the panelists mentioned that putting analytics tools in the hands of line people might be too much for them, and instead the company is focusing on giving these employees the right information.
• Leaders need to have capabilities too. If executives are accustomed to using spreadsheets and relying on their gut instincts, then they will also need to learn how to make use of analytics.
• Cultural changes. From call center agents using the results of predictive models to workers on the line seeing reports to business analysts using more sophisticated models, change is coming. This change means people will be changing the way that they work. How this change is handled will require special thought by organizations.

IBM executives also made a point of discussing the critical skills required for analytics. These included strategy development, developing user interfaces, enterprise integration, modeling, and dealing with structured and unstructured data. IBM has, of course, made a huge investment in these skills. GBS executives emphasized the 8,500 employees in its Global Business Services Business Analytics and Optimization group. Executives also pointed to the fact that the company has thousands of partners in this space and that 1 in 3 IBMers will attend analytics training. So, IBM is prepared to help companies in their journey into business analytics.

Are companies there yet? I think that it is going to take organizations time to develop some of these skills (and some they should probably outsource). Sure, analytics has been around a long time. And sure, vendors are making their products easier to use and that is going to help end users become more effective. Even if we’re just talking about a lot of business people making use of analytic software (as opposed to operationalizing it in a business process), the reality is that analytics requires a certain mindset. Additionally, unless someone understands the context of the information he or she is dealing with, it doesn’t matter how user friendly the platform is – they can still get it wrong. People using analytics will need to think critically about data, understand their data, and understand context. They will also need to know what questions to ask.

I whole-heartedly believe it is worth the investment of time and energy to make analytics happen.

Please note:

As luck would have it, I am currently fielding a study on advanced analytics! In am interested in understanding what your company’s plans are for advanced analytics. If you’re not planning to use advanced analytics, I’d like to know why. If you’re already using advanced analytics I’d like to understand your experience.

If you participate in this survey I would be happy to send you a report of our findings. Simply provide your email address at the end of the survey! Here’s the link:

Click here to take survey

Analyzing Big Data

The term “Big Data” has gained popularity over the past 12-24 months as a) amounts of data available to companies continually increase and b) technologies have emerged to more effectively manage this data. Of course, large volumes of data have been around for a long time. For example, I worked in the telecommunications industry for many years analyzing customer behavior. This required analyzing call records. The problem was that the technology (particularly the infrastructure) couldn’t necessarily support this kind of compute intensive analysis, so we often analyzed billing records rather than streams of calls detail records, or sampled the records instead.

Now companies are looking to analyze everything from the genome to Radio Frequency ID (RFID) tags to business event streams. And, newer technologies have emerged to handle massive (TB and PB) quantities of data more effectively. Often this processing takes place on clusters of computers meaning that processing is occurring across machines. The advent of cloud computing and the elastic nature of the cloud has furthered this movement.

A number of frameworks have also emerged to deal with large-scale data processing and support large-scale distributed computing. These include MapReduce and Hadoop:

-MapReduce is a software framework introduced by Google to support distributed computing on large sets of data. It is designed to take advantage of cloud resources. This computing is done across large numbers of computer clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs
-Apache Hadoop is an open source distributed computing platform that is written in Java and inspired by MapReduce. Data is stored over many machines in blocks that are replicated to other servers. It uses a hash algorithm to cluster data elements that are similar. Hadoop can cerate a map function of organized key/value pairs that can be output to a table, to memory, or to a temporary file to be analyzed.

But what about tools to actually analyze this massive amount of data?

Datameer

I recently had a very interesting conversation with the folks at Datameer. Datameer formed in 2009 to provide business users with a way to analyze massive amounts of data. The idea is straightforward: provide a platform to collect and read different kinds of large data stores, put it into a Hadoop framework, and then provide tools for analysis of this data. In other words, hide the complexity of Hadoop and provide analysis tools on top of it. The folks at Datameer believe their solution is particularly useful for data greater than 10 TB, where a company may have hit a cost wall using traditional technologies but where a business user might want to analyze some kind of behavior. So website activity, CRM systems, phone records, POS data might all be candidates for analysis. Datameer provides 164 functions (i.e. group, average, median, etc) for business users with APIs to target more specific requirements.

For example, suppose you’re in marketing at a wireless service provider and you offered a “free minutes” promotion. You want to analyze the call detail records of those customers who made use of the program to get a feel for how customers would use cell service if given unlimited minutes. The chart below shows the call detail records from one particular day of the promotion – July 11th. The chart shows the call number (MDN) as well as the time the call started and stopped and the duration of the call in milliseconds. Note that the data appear under the “analytics” tab. The “Data” tab provides tools to read different data sources into Hadoop.

This is just a snapshot – there may be TB of data from that day. So, what about analyzing this data? The chart below illustrates a simple analysis of the longest calls and the phone numbers those calls came from. It also illustrates basic statistics about all of the calls on that day – the average, median, and maximum call duration.

From this brief example, you can start to visualize the kind of analysis that is possible with Datameer.

Note too that since Datameer runs on top of Hadoop, it can deal with unstructured as well as structured data. The company has some solutions in the unstructured realm (such as basic analysis of twitter feeds), and is working to provide more sophisticated tools. Datameer offers its software either on either a SaaS license or on premises.

In the Cloud?

Not surprisingly, early adopters of the technology are using it in a private cloud model. This makes sense since some companies often want to keep control of their own data. Some of these companies already have Hadoop clusters in place and are looking for analytics capabilities for business use. Others are dealing with big data, but have not yet adopted Hadoop. They are looking at a complete “big data BI” type solution.

So, will there come a day when business users can analyze massive amounts of data without having to drag IT entirely into the picture? Utilizing BI adoption as a model, the folks from Datameer hope so. I’m interested in any thoughts readers might have on this topic!

Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers