Hadoop + MapReduce + SQL + Big Data and Analytics: RainStor

As the volume and variety of data continues to increase, we’re going to see more companies entering the market with solutions to address big data and compliant retention and business analytics.  One such company is RainStor, which while not a new entrant (with over 150 end-customers through direct sales and partner channels) has recently started to market its big data capabilities more aggressively to enterprises.  I had an interesting conversation with Ramon Chen, VP of product management at RainStor, the other week.   

The RainStor database was built in the UK as a government defense project to process large amounts of data in-memory.  Many of the in-memory features have been retained while new capabilities including persistent retention on any physical storage have been added. And now the company is positioning itself as providing an enterprise database architected for big data. It even runs natively on Hadoop.

The Value Proposition

The value proposition is that Rainstor’s technology enables companies to store data in the RainStor database using a unique compression technology to reduce disk space requirements.  The company boasts as much as a 40 to 1 compression ratio (>97% reduction in size).  Additionally, the software can run on any commodity hardware and storage. 

For example, one of RainStor’s clients generates 17B logs a day that it is required to store and access for ninety days.  This is the equivalent of 2 petabytes (PB) of raw information over that period which would ordinarily cost millions of dollars to store. Using RainStor, the company compressed and retained the data 20 fold in a cost-efficient 100 Terabyte (TB) NAS. At the same time RainStor also replaced an Oracle Data Warehouse providing fast response times to meet queries in support of an operational call center.

RainStor ingests the data, stores it, and makes it available for query and other analytic workloads.  It comes in two editions – the Big Data Retention Edition and the Big Data Analytics on Hadoop edition.  Both editions  provide full SQL-92 and ODBC/JDBC access.  According to the company, the Hadoop edition is the only database that runs natively on Hadoop and supports access through MapReduce and the PIG Latin language. As a massively parallel processing (MPP) database RainStor runs on the same Hadoop nodes, writing and supporting access to compressed data on HDFS. It provides security, high availability, and lifecycle management and versioning capabilities.

The idea then is that RainStor can dramatically lower the cost of storing data in Hadoop through its compression which reduces the node count needed and accelerates the performance of MapReduce jobs and provides full SQL-92 access. This can reduce the need to transfer data out of the Hadoop cluster to a separate enterprise data warehouse.  RainStor allows the Hadoop environment to support real-time query access in addition to its batch-oriented MapReduce processing.

How does it work?

RainStor is not a relational database; instead it follows the NoSQL movement by storing data non-relationally.  In its case the data is physically stored as a set of trees with linked values and nodes.  The idea is illustrated below (source: RainStor) 

Image

Say a number of records with common value yahoo.com are ingested in the system.  Rainstor would throw away duplicates and only store the literal yahoo.com once but maintain references to the records that contained that value.  So, if the system is loading 1 million records and 500K contained yahoo.com it would only be stored once, saving significant storage.  This and additional pattern deduplication means that a resulting tree structure holds the same data in a significantly smaller footprint and higher compression ratio compared to other databases on the market, according to RainStor.  It also doesn’t require re-inflation like binary zip file compression which requires resources and time to re-inflate.  It writes the tree structure as is to disk, when you read it reads it back to disk.  Instead of unraveling all trees all the time, it only reads those relevant trees and branches of trees that are required to fulfill the query.  

Conclusion

RainStor is a good example of a kind of database that can enable big data analytics.  Just as many companies finally “got” the notion of business analytics and the importance of analytics in decision making so too are they realizing that as they accumulate and generate ever increasing amounts of data there is opportunity to analyze and act on it.

For example, according to the company, you can put a BI solution, like IBM Cognos, Microstrategy, Tableau or SAS, on top of RainStor.  RainStor would hold the raw data and any BI solution would access data either through MapReduce or ODBC/JDBC  (i.e. one platform) with no need to use Hive and HQL.  RainStor also recently announced a partnership with IBM BigInsights for its Big Data Analytics on Hadoop edition. 

What about big data appliances that are architected for high performance analytics?  RainStor claims that while some big data appliances  do have some MapReduce support (like Aster Data for example) it would be cheaper to use their solution together with open source Hadoop.  In other words, RainStor on Hadoop would be cheaper than any big data appliance.

It is still early in the game.  I am looking forward to seeing some big data analytics implementations which utilize RainStor.  I am interested to see use cases that go past querying huge amounts of data and provide some advanced analytics on top of RainStor.  Or, big data visualizations with rapid response time on top of RainStor, that only need to utilize a small number of nodes.  Please keep me posted, RainStor.

Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers