Two Big Data Resources Worth Exploring

It’s a good day.  Our new book, Big Data for Dummies, is being released today and I’m busy working on a Big Data Analytics maturity model at TDWI with Krish Krishnan.  Krish, a faculty member at TDWI, is actually presenting some of the model at the TDWI World Conference:  Big Data Tipping Point taking place during the first week of May (see sidebar).  I would encourage people to attend, even if you aren’t that far along in your big data deployments.  TDWI has terrific courses in all aspects of information management and we understand that most companies will need to leverage their existing infrastructure to support big data initiatives.  In fact the title of this World conference is, “Preparing for the Practical Realities of Big Data.”   Check it out.

Back to the book.  Here’s a look at the Introduction!  Enjoy!

 

3 Takeaways from the IBM Big Data Event

Last week I attended the IBM Big Data at the Speed of Business event at IBM’s Research facility in Almaden.  At this analyst event IBM announced multiple capabilities around its big data initiative including its new BLU Acceleration and IBM PureData System for Hadoop.  Additionally, new versions of Infosphere Big Insights and Infosphere Streams (for data streams) were announced as enhancements to IBM’s Big Data Platform.  A new version of Informix that includes time series acceleration was also announced.

The overall goal of these products is to make big data more consumable –i.e. to make it simple to manage and analyze big data.  For example, IBM PureData System for Hadoop is basically Hadoop as an appliance, making it easier to stand up and deploy.  Executives at the event said that a recent customer had gotten its PureData System “loading and interrogating data 89 minutes.”  The solution comes packaged with analytics and visualization technology too.  BLU Acceleration combines a number of technologies including dynamic in-memory processing and active compression to make it 8-25x faster for reporting and analytics.

For me, some of the most interesting presentations focused on big data analytics.  These included emerging patterns for big data analytics deployments, dealing with time series data, and the notion of the contextual enterprise.

Big data analytics use cases.  IBM has identified five big data use cases from studying hundreds of engagements it has done across 15 different industries.   These high value use cases include:

  • 360 degree view of a customer- utilizing data from internal and external sources such as social chatter to understand behavior and “seminal psychometric markers” to gain insight into customer interactions.
  • Security/Intelligence- utilizing data from sources like GPS devices and RFID tags and consuming it at a rate to protect individual safety from fraud or cyber attack.

For more visit my tdwi blog

Five Best Practices for Text Analytics

It’s been a while since I updated my blog and a lot has changed.  In January, I made the move to TDWI as Research Director for Advanced Analytics.  I’m excited to be there, although I miss Hurwitz & Associates.   One of the last projects I worked on while at Hurwitz & Associates was the Victory Index for Text Analytics.  Click here for more information on the Victory Index.  

As part of my research for the Victory Index, I spent I a lot of time talking to companies about how they’re using text analytics.  By far, one of the biggest use cases for text analytics centers on understanding customer feedback and behavior.  Some companies are using internal data such as call center notes or emails or survey verbatim to gather feedback and understand behavior, others are using social media, and still others are using both.  

What are these end users saying about how to be successful with text analytics?  Aside from the important best practices around defining the right problem, getting the right people, and dealing with infrastructure issues, I’ve also heard the following:

Best Practice #1 - Managing expectations among senior leadership.   A number of the end-users I speak with say that their management often thinks that text analytics will work almost out of the box and this can establish unrealistic expectations. Some of these executives seem to envision a big funnel where reams of unstructured text enter and concepts, themes, entities, and insights pop out at the other end.  Managing expectations is a balancing act.  On the one hand, executive management may not want to hear the details about how long it is going to take you to build a taxonomy or integrate data.  On the other hand, it is important to get wins under your belt quickly to establish credibility in the technology because no one wants to wait years to see some results.  That said, it is still important to establish a reasonable set of goals and prioritize them and to communicate them to everyone.  End users find that getting senior management involved and keeping them informed with well-defined plans on a realistic first project can be very helpful in handling expectations. 

 

for more visit my tdwi blog

 

 

Hadoop + MapReduce + SQL + Big Data and Analytics: RainStor

As the volume and variety of data continues to increase, we’re going to see more companies entering the market with solutions to address big data and compliant retention and business analytics.  One such company is RainStor, which while not a new entrant (with over 150 end-customers through direct sales and partner channels) has recently started to market its big data capabilities more aggressively to enterprises.  I had an interesting conversation with Ramon Chen, VP of product management at RainStor, the other week.   

The RainStor database was built in the UK as a government defense project to process large amounts of data in-memory.  Many of the in-memory features have been retained while new capabilities including persistent retention on any physical storage have been added. And now the company is positioning itself as providing an enterprise database architected for big data. It even runs natively on Hadoop.

The Value Proposition

The value proposition is that Rainstor’s technology enables companies to store data in the RainStor database using a unique compression technology to reduce disk space requirements.  The company boasts as much as a 40 to 1 compression ratio (>97% reduction in size).  Additionally, the software can run on any commodity hardware and storage. 

For example, one of RainStor’s clients generates 17B logs a day that it is required to store and access for ninety days.  This is the equivalent of 2 petabytes (PB) of raw information over that period which would ordinarily cost millions of dollars to store. Using RainStor, the company compressed and retained the data 20 fold in a cost-efficient 100 Terabyte (TB) NAS. At the same time RainStor also replaced an Oracle Data Warehouse providing fast response times to meet queries in support of an operational call center.

RainStor ingests the data, stores it, and makes it available for query and other analytic workloads.  It comes in two editions – the Big Data Retention Edition and the Big Data Analytics on Hadoop edition.  Both editions  provide full SQL-92 and ODBC/JDBC access.  According to the company, the Hadoop edition is the only database that runs natively on Hadoop and supports access through MapReduce and the PIG Latin language. As a massively parallel processing (MPP) database RainStor runs on the same Hadoop nodes, writing and supporting access to compressed data on HDFS. It provides security, high availability, and lifecycle management and versioning capabilities.

The idea then is that RainStor can dramatically lower the cost of storing data in Hadoop through its compression which reduces the node count needed and accelerates the performance of MapReduce jobs and provides full SQL-92 access. This can reduce the need to transfer data out of the Hadoop cluster to a separate enterprise data warehouse.  RainStor allows the Hadoop environment to support real-time query access in addition to its batch-oriented MapReduce processing.

How does it work?

RainStor is not a relational database; instead it follows the NoSQL movement by storing data non-relationally.  In its case the data is physically stored as a set of trees with linked values and nodes.  The idea is illustrated below (source: RainStor) 

Image

Say a number of records with common value yahoo.com are ingested in the system.  Rainstor would throw away duplicates and only store the literal yahoo.com once but maintain references to the records that contained that value.  So, if the system is loading 1 million records and 500K contained yahoo.com it would only be stored once, saving significant storage.  This and additional pattern deduplication means that a resulting tree structure holds the same data in a significantly smaller footprint and higher compression ratio compared to other databases on the market, according to RainStor.  It also doesn’t require re-inflation like binary zip file compression which requires resources and time to re-inflate.  It writes the tree structure as is to disk, when you read it reads it back to disk.  Instead of unraveling all trees all the time, it only reads those relevant trees and branches of trees that are required to fulfill the query.  

Conclusion

RainStor is a good example of a kind of database that can enable big data analytics.  Just as many companies finally “got” the notion of business analytics and the importance of analytics in decision making so too are they realizing that as they accumulate and generate ever increasing amounts of data there is opportunity to analyze and act on it.

For example, according to the company, you can put a BI solution, like IBM Cognos, Microstrategy, Tableau or SAS, on top of RainStor.  RainStor would hold the raw data and any BI solution would access data either through MapReduce or ODBC/JDBC  (i.e. one platform) with no need to use Hive and HQL.  RainStor also recently announced a partnership with IBM BigInsights for its Big Data Analytics on Hadoop edition. 

What about big data appliances that are architected for high performance analytics?  RainStor claims that while some big data appliances  do have some MapReduce support (like Aster Data for example) it would be cheaper to use their solution together with open source Hadoop.  In other words, RainStor on Hadoop would be cheaper than any big data appliance.

It is still early in the game.  I am looking forward to seeing some big data analytics implementations which utilize RainStor.  I am interested to see use cases that go past querying huge amounts of data and provide some advanced analytics on top of RainStor.  Or, big data visualizations with rapid response time on top of RainStor, that only need to utilize a small number of nodes.  Please keep me posted, RainStor.

Four Vendor Views on Big Data and Big Data Analytics Part 1: Attensity

I am often asked whether it is the vendors or the end users who are driving the Big Data market. I usually reply that both are. There are early adopters of any technology that push the vendors to evolve their own products and services. The vendors then show other companies what can be done with this new and improved technology.

Big Data and Big Data Analytics are hot topics right now. Different vendors of course, come at it from their own point of view. Here’s a look at how four vendors (Attensity, IBM, SAS, and SAP) are positioning around this space, some of their product offerings, and use cases for Big Data Analytics.

In Attensity’s world Big Data is all about high volume customer conversations. Attensity text analytics solutions can be used to analyze both internal and external data sources to better understand the customer experience. For example, it can analyze sources such as call center notes, emails, survey verbatim and other documents to understand customer behavior. With its recent acquisition of Biz360 the company can combine social media from 75 million sources and analyze this content to understand the customer experience. Since industry estimates put the structured/unstructured data ratio at 20%/80%, this kind of data needs to be addressed. While vendors with Big Data appliances have talked about integrating and analyzing unstructured data as part of the Big Data equation, most of what has been done to date has dealt primarily with structured data. This is changing, but it is good to see a text analytics vendor address this issue head on.

Attensity already has a partnership with Teradata so it can marry information extracted from its unstructured data (from internal conversations) together with structured data stored in the Teradata Warehouse. Recently, Attensity extended this partnership to Aster data, which was acquired by Teradata. Aster Data provides a platform for Big Data Analytics. The Aster MapReduce Platform is a massively parallel software solution that embeds MapReduce analytic processing with data stores for big data analytics on what the company terms “multistructured data sources and types.” Attensity can now be embedded as a runtime SQL in the Aster Data library to enable the real time analysis of social media streams. Aster Data will also act as long term archival and analytics platform for the Attensity real-time Command Center platform for social media feeds and iterative exploratory analytics. By mid 2012 the plan is for complete integration to the Attensity Analyze application.

Attensity describes several use cases for the real time analysis of social streams:

1. Voice of the Customer Command Center: the ability to semantically annotate real-time social data streams and combine that with multi-channel customer conversation data in a Command Center view that gives companies a real-time view of what customers are saying about their company, products and brands.
2. Hotspotting: the ability to analyze customer conversations to identify emerging trends. Unlike common keyword based approaches, Hotspot reports identify issues that a company might not already know about, as they emerge, by measuring the “significance” of change in probability for a data value between a historical period and the current period. Attensity then assigns a “temperature” value to mark the degree of difference between the two probabilities. Hot means significantly trending upward in the current period vs. historical. Cold means significantly trending downward in the current period vs. historical.
3. Customer service: the ability to analyze conversations to identify top complaints and issues and prioritize incoming calls, emails or social requests accordingly.

Next Up:SAS

Four Findings from the Hurwitz & Associates Advanced Analytics Survey

Hurwitz & Associates conducted an online survey on advanced analytics in January 2011. Over 160 companies across a range of industries and company size participated in the survey. The goal of the survey was to understand how companies are using advanced analytics today and what their plans are for the future. Specific topics included:

- Motivation for advanced analytics
– Use cases for advanced analytics
– Kinds of users of advanced analytics
– Challenges with advanced analytics
– Benefits of the technology
– Experiences with BI and advanced analytics
– Plans for using advanced analytics

What is advanced analytics ?
Advanced analytics provides algorithms for complex analysis of either structured or unstructured data. It includes sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data mining techniques. Among its many use cases, it can be deployed to find patterns in data, prediction, optimization, forecasting, and for complex event processing. Examples include predicting churn, identifying fraud, market basket analysis, and analyzing social media for brand management. Advanced analytics does not include database query and reporting and OLAP cubes.

Many early adopters of this technology have used predictive analytics as part of their marketing efforts. However, the diversity of use cases for predictive analytics is growing. In addition to marketing related analytics for use in areas such as market basket analysis, promotional mix, consumer behavior analysis, brand loyalty, churn analysis, companies are using the technology in new and innovative ways. For example, there are newer industry use cases emerging including reliability assessment (i.e. predicting failure in machines), situational awareness, behavior (defense), investment analysis, fraud identification (insurance, finance), predicting disabilities from claims (insurance), and finding patterns in health related data (medical)

The two charts below illustrate several key findings from the survey on how companies use advanced analytics and who within the organization is using this technology.

• Figure 1 indicates that the top uses for advanced analytics include finding patterns in data and building predictive models.

• Figure 2 illustrates that users of advanced analytics in many organizations have expanded from statisticians and other highly technical staff to include business analysts and other business users. Many vendors anticipated this shift to business users and enhanced their offerings by adding new user interfaces, for example, which suggest or dictate what model should be used, given a certain set of data.

Other highlights include:

• Survey participants have seen a huge business benefit from advanced analytics. In fact, over 40% of the respondents who had implemented advanced analytics believed it had increased their company’s top-line revenue. Only 2% of respondents stated that advanced analytics provided little or no value to their company.
• Regardless of company size, the vast majority of respondents expected the number of users of advanced analytics in their companies to increase over the next six to 12 months. In fact, over 50% of respondents currently using the technology expected the number of users to increase over this time period.

The final report will be published in March 2011. Stay tuned!

Five requirements for Advanced Analytics

The other day I was looking at the analytics discussion board that I moderate on the Information Management site. I had posted a topic entitled “the value of advanced analytics.” I noticed that the number of views on this topic was at least 3 times as many as on other topics that had been posted on the forum. The second post that generated a lot of traffic was a question about a practical guide to predictive analytics.

Clearly, companies are curious and excited about advanced analytics. Advanced analytics utilizes sophisticated techniques to understand patterns and predict outcomes. It includes complex techniques such as statistical modeling, machine learning, linear programming, mathematics, and even natural language processing (on the unstructured side). While many kinds of “advanced analytics” have been around for the last 20+ years (I utilized it extensively in the 80s) and the term may simply be a way to invigorate the business analytics market, the point is that companies are finally starting to realize the value this kind of analysis can provide.

Companies want to better understand the value this technology brings and how to get started. And, while the number of users interested in advanced analytics continues to increase, the reality is that there will likely be a skills shortage in this area. Why? Because advanced analytics isn’t the same beast as what I refer to as, “slicing and dicing and shaking and baking” data to produce reports that might include information such as sales per region, revenue per customer, etc.

So what skills are needed for the business user to face the advanced analytics challenge? It’s a tough question. There is a certain thought process that goes into advanced analytics. Here are five (there are no doubt, more) skills I would say at a minimum, you should have:

1. It’s about the data. So, thoroughly understand your data. A business user needs to understand all aspects of his or her data. This includes answers to questions such as, “What is a customer?” “What does it mean if a data field is blank?” “Is there seasonality in my time series data?” It also means understanding what kind of derived variables (e.g. a ratio) you might be interested in and how you want to calculate them.
2. Garbage in, Garbage out. Appreciate data quality issues. A business user analyzing data cannot simply assume that the data (from whatever source) is absolutely fine. It might be the case, but you still need to check. Part of this ties to understanding your data, but it also means first looking at the data and asking if it make sense. And, what do you do with data that doesn’t make sense?
3. Know what questions to ask. I remember a time in graduate school when, excited by having my data and trying to analyze it, a wise professor told me not to simply throw statistical models at the data because you can. First, know what questions you are trying to answer from the data. Ask yourself if you have the right data to answer the questions. Look at the data to see what it is telling you. Then start to consider the models. Knowing what questions to ask will require business acumen.
4. Don’t skip the training step. Know how to use tools and what the tools can do for you. Again, it is simple to throw data at a model, especially if the software system suggests a certain model. However, it is important to understand what the models are good for. When does it make sense to use a decision tree? What about survival analysis? Certain tools will take your data and suggest a model. My concern is that if you don’t know what the model means, it makes it more difficult to defend your output. That is why vendors suggest training.
5. Be able to defend your output. At the end of the day, you’re the one who needs to present your analysis to your company. Make sure you know enough to defend it. Turn the analysis upside down, ask questions of it, and make sure you can articulate the output

I could go on and on but I’ll stop here. Advanced analytics tools are simply that – tools. And they will be only as good as the person utilizing them. This will require understanding the tools as well as how to think and strategize around the analysis. So my message? Utilized properly these tools can be great. Utilized incorrectly– well – it’s analogous to a do-it-yourself electrician who burns down the house.

Follow

Get every new post delivered to your Inbox.

Join 1,190 other followers