While text analytics is considered a “must have” technology by the majority of companies that use it, challenges abound. So I’ve learned from the many companies I’ve talked to as I prepare Hurwitz & Associates’ Victory Index for Text Analytics,a tool that assesses not just the technical capability of the technology but its ability to provide tangible value to the business (look for the results of the Victory Index in about a month). Here are the top five: http://bit.ly/Tuk8DB. Interestingly, most of them have nothing to do with the technology itself.
As the volume and variety of data continues to increase, we’re going to see more companies entering the market with solutions to address big data and compliant retention and business analytics. One such company is RainStor, which while not a new entrant (with over 150 end-customers through direct sales and partner channels) has recently started to market its big data capabilities more aggressively to enterprises. I had an interesting conversation with Ramon Chen, VP of product management at RainStor, the other week.
The RainStor database was built in the UK as a government defense project to process large amounts of data in-memory. Many of the in-memory features have been retained while new capabilities including persistent retention on any physical storage have been added. And now the company is positioning itself as providing an enterprise database architected for big data. It even runs natively on Hadoop.
The Value Proposition
The value proposition is that Rainstor’s technology enables companies to store data in the RainStor database using a unique compression technology to reduce disk space requirements. The company boasts as much as a 40 to 1 compression ratio (>97% reduction in size). Additionally, the software can run on any commodity hardware and storage.
For example, one of RainStor’s clients generates 17B logs a day that it is required to store and access for ninety days. This is the equivalent of 2 petabytes (PB) of raw information over that period which would ordinarily cost millions of dollars to store. Using RainStor, the company compressed and retained the data 20 fold in a cost-efficient 100 Terabyte (TB) NAS. At the same time RainStor also replaced an Oracle Data Warehouse providing fast response times to meet queries in support of an operational call center.
RainStor ingests the data, stores it, and makes it available for query and other analytic workloads. It comes in two editions – the Big Data Retention Edition and the Big Data Analytics on Hadoop edition. Both editions provide full SQL-92 and ODBC/JDBC access. According to the company, the Hadoop edition is the only database that runs natively on Hadoop and supports access through MapReduce and the PIG Latin language. As a massively parallel processing (MPP) database RainStor runs on the same Hadoop nodes, writing and supporting access to compressed data on HDFS. It provides security, high availability, and lifecycle management and versioning capabilities.
The idea then is that RainStor can dramatically lower the cost of storing data in Hadoop through its compression which reduces the node count needed and accelerates the performance of MapReduce jobs and provides full SQL-92 access. This can reduce the need to transfer data out of the Hadoop cluster to a separate enterprise data warehouse. RainStor allows the Hadoop environment to support real-time query access in addition to its batch-oriented MapReduce processing.
How does it work?
RainStor is not a relational database; instead it follows the NoSQL movement by storing data non-relationally. In its case the data is physically stored as a set of trees with linked values and nodes. The idea is illustrated below (source: RainStor)
Say a number of records with common value yahoo.com are ingested in the system. Rainstor would throw away duplicates and only store the literal yahoo.com once but maintain references to the records that contained that value. So, if the system is loading 1 million records and 500K contained yahoo.com it would only be stored once, saving significant storage. This and additional pattern deduplication means that a resulting tree structure holds the same data in a significantly smaller footprint and higher compression ratio compared to other databases on the market, according to RainStor. It also doesn’t require re-inflation like binary zip file compression which requires resources and time to re-inflate. It writes the tree structure as is to disk, when you read it reads it back to disk. Instead of unraveling all trees all the time, it only reads those relevant trees and branches of trees that are required to fulfill the query.
RainStor is a good example of a kind of database that can enable big data analytics. Just as many companies finally “got” the notion of business analytics and the importance of analytics in decision making so too are they realizing that as they accumulate and generate ever increasing amounts of data there is opportunity to analyze and act on it.
For example, according to the company, you can put a BI solution, like IBM Cognos, Microstrategy, Tableau or SAS, on top of RainStor. RainStor would hold the raw data and any BI solution would access data either through MapReduce or ODBC/JDBC (i.e. one platform) with no need to use Hive and HQL. RainStor also recently announced a partnership with IBM BigInsights for its Big Data Analytics on Hadoop edition.
What about big data appliances that are architected for high performance analytics? RainStor claims that while some big data appliances do have some MapReduce support (like Aster Data for example) it would be cheaper to use their solution together with open source Hadoop. In other words, RainStor on Hadoop would be cheaper than any big data appliance.
It is still early in the game. I am looking forward to seeing some big data analytics implementations which utilize RainStor. I am interested to see use cases that go past querying huge amounts of data and provide some advanced analytics on top of RainStor. Or, big data visualizations with rapid response time on top of RainStor, that only need to utilize a small number of nodes. Please keep me posted, RainStor.
I just started writing a blog for AllAnalytics, focusing on advanced analytics. My first posting outlines five use cases for text analytics. These include voice of the customer, fraud, warranty analysis, lead generation, and customer service routing. Check it out.
Of course there are many more use cases for text analytics. On the horizontal solutions front these include enhancing search, survey analysis and eDiscovery. The list is huge on the vertical side including medical analysis, other scientific research, government intelligence, and the list goes on.
If you want to learn more about text analytics, please join me for my webinar on Best Practices for Text Analytics this Thursday, April 29th, at 2pm ET. You can register here
This week marks the one year anniversary of the IBM Watson computer system succeeding at Jeopardy!. Since then, IBM has gotten a lot of interest in Watson. Companies want one of those.
But what exactly is Watson and what makes it unique? What does it mean to have a Watson? And, how is commercial Watson different from Jeopardy Watson?
What is Watson and why is it unique?
Watson is a set of technologies that processes and analyzes massive amounts of both structured and unstructured data in a unique way. One statistic given at the recent IOD conference is that Watson can process and analyze information from 200 million books in three seconds. While Watson is very advanced it uses technologies that are commercially available with some “secret sauce” technologies that IBM Research has either enhanced or developed. It combines software technologies from big data, content and predictive analytics, and industry specific software to make it work.
Watson includes several core pieces of technology that make it unique
So what is this secret sauce? Watson understands natural language, generates and evaluates hypotheses, and adapts and learns.
First, Watson uses Natural Language Processing (NLP). NLP is a very broad and complex field, which has developed over the last ten to twenty years. The goals of NLP are to derive meaning from text. NLP generally makes use of linguistic concepts such as grammatical structures and parts of speech. It breaks apart sentences and extracts information such as entities, concepts, and relationships. IBM is using a set of annotators to extract information like symptoms, age, location, and so on.
So, NLP by itself is not new, however, Watson is processing vast amounts of this unstructured data quickly, using an architecture designed for this.
Second, Watson works by generating hypotheses which are potential answers to a question. It is trained by feeding question and answer (Q/A) data into the system. In other words, it is shown representative questions and learns from the supplied answers. This is called evidence based learning. The goal is to generate a model that can produce a confidence score (think logistic regression with a bunch of attributes). Watson would start with a generic statistical model and then look at the first Q/A and use that to tweak coefficients. As it gains more evidence it continues to tweak the coefficients until it can “say” confidence is high. Training Watson is key since what is really happening is that the trainers are building statistical models that are scored. At the end of the training, Watson has a system that has feature vectors and models so that eventually it can use the model to probabilistically score the answers. The key here is something that Jeopardy! did not showcase – which is that it is not deterministic (i.e. using rules). Watson is probabilistic and that makes it dynamic.
When Watson generates a hypothesis it then scores the hypothesis based on the evidence. Its goal is to get the right answer for the right reason. (So, theoretically, if there are 5 symptoms that must be positive for a certain disease and 4 that must be negative and Watson only has 4 of the 9 pieces of information, it could ask for more.) The hypothesis with the highest score is presented. By the end the analysis, Watson is confident when it knows the answer and when it doesn’t know the answer.
Here’s an example. Suppose you go in to see your doctor because you are not feeling well. Specifically, you might have heart palpitations, fatigue, hair loss, and muscle weakness. You decide to go see a doctor to determine if there is something wrong with your thyroid or if it is something else. If your doctor has access to a Watson system then he could use it to help advise him regarding your diagnosis. In this case, Watson would already have ingested and curated all of the information in books and journals associated with thyroid disease. It also has the diagnosis and related information from other patients from this hospital and other doctors in the practice from the electronic medical records of prior cases that it has in its data banks. Based on the first set of symptoms you might report it would generate a hypothesis along with probabilities associated with the hypothesis (i.e. 60% hyperthyroidism, 40% anxiety, etc.). It might then ask for more information. As it is fed this information, i.e. example patient history, Watson would continue to refine its hypothesis along with the probability of the hypothesis being correct. After it is given all of the information and it iterates through it and presents the diagnosis with the highest confidence level, the physician would use this information to help assist him in making the diagnosis and developing a treatment plan. If Watson doesn’t know the answer, it will state that it has does not have an answer or doesn’t have enough information to provide an answer.
IBM likens the process of training a Watson to teaching a child how to learn. A child can read a book to learn. However, he can also learn by a teacher asking questions and reinforcing the answers about that text.
Can I buy a Watson?
Watson will be offered in the cloud in an “as a service” model. Since Watson is in its own class, let’s call this Watson as a Service (WaaS). Since Watson’s knowledge is essentially built in tiers, the idea is that IBM will provide the basic core knowledge in a particular WaaS solution space, say all of the corpus about a particular subject – like diabetes – and then different users could build on this.
For example, in September IBM announced an agreement to create the first commercial applications of Watson with WellPoint – a health benefits company. Under the agreement, WellPoint will develop and launch Watson-based solutions to help improve patient care. IBM will develop the base Watson healthcare technology on which WellPoint’s solution will run. Last month, Cedars-Sinai signed on with WellPoint to help develop an oncology solution using Watson. Cedars-Sinai’s oncology experts will help develop recommendations on appropriate clinical content for the WellPoint health care solutions. They will assist in the evaluation and testing of these tools. In fact, these oncologists will “enter hypothetical patient scenarios, evaluate the proposed treatment options generated by IBM Watson, and provide guidance on how to improve the content and utility of the treatment options provided to the physicians.” Wow.
Moving forward, picture potentially large numbers of core knowledge bases that are trained and available for particular companies to build upon. This would be available in a public cloud model and potentially a private one as well, but with IBM involvement. This might include Watsons for law or financial planning or even politics (just kidding) – any area where there is a huge corpus of information that people need to wrap their arms around in order to make better decisions.
IBM is now working with its partners to figure out what the user interface for these Watsons- as a Service might look like. Will Watson ask the questions? Can end-users, say doctors, put in their own information and Watson will use it? This remains to be seen.
Ready for Watson?
In the meantime, IBM recently rolled out its “Ready for Watson.” The idea is that a move to Watson might not be a linear progression. It depends on the business problem that companies are looking to solve. So IBM has tagged certain of its products as “ready” to be incorporated into a Watson solution. IBM Content and Predictive Analytics for Healthcare is one example of this. It combines IBM’s content analytics and predictive analytics solutions that are components of Watson. Therefore, if a company used this solution it could migrate it to a Watson-as a Service deployment down the road.
So happy anniversary IBM Watson! You have many people excited and some people a little bit scared. For myself, I am excited to see where Watson is on its first anniversary and am looking forward to see what progress it has made on its second anniversary.
Next in my discussion of big data providers is IBM. Big data plays right into IBM’s portfolio of solutions in the information management space. It also dove tails very nicely with the company’s Smarter Planet strategy. Smarter Planet holds the vision of the world as a more interconnected, instrumented, and intelligent place. IBM’s Smarter Cities and Smarter Industries are all part of its solutions portfolio. For companies to be successful in this type of environment requires a new emphasis on big data and big data analytics.
Here’s a quick look at how IBM is positioning around big data, some of its product offerings, and use cases for big data analytics.
According to IBM, big data has three characteristics. These are volume, velocity, and variety. IBM is talking about large volumes of both structured and unstructured data. This can include audio and video together with text and traditional structured data. It can be gathered and analyzed in real time.
IBM has both hardware and software products to support both big data and big data analytics. These products include:
- Infosphere Streams – a platform that can be used to perform deep analysis of massive volumes of relational and non-relational data types with sub-millisecond response times. Cognos Real-time Monitoring can also be used with Infosphere Streams for dashboarding capabilities.
- Infosphere BigInsights – a product that consists of IBM research technologies on top of open source Apache Hadoop. BigInsights provides core installation, development tools, web-based UIs, connectors for integration, integrated text analytics, and BigSheets for end-user visualization.
- IBM Netezza – a high capacity appliance that allows companies to analyze pedabytes of data in minutes.
- Cognos Consumer Insights- Leverages BigInsights and text analytics capabilities to perform social media sentiment analysis.
- IBM SPSS- IBM’s predictive and advanced analytics platform that can read data from various data sources such as Netezza and be integrated with Infosphere Streams to perform advanced analysis.
- IBM Content Analytics – uses text analytics to analyze unstructured data. This can sit on top of Infosphere BigInsights.
At the Information on Demand (IOD) conference a few months ago, IBM and its customers presented many use cases around big data and big data analytics. Here is what some of the early adopters are doing:
- Engineering: Analyzing hourly wind data, radiation, heat and 78 other attributes to determine where to locate the next wind power plant.
- Analyzing social media data, for example to understand what fans are saying about a sports game in real time.
- Analyzing customer activity at a zoo to understand guest spending habits, likes and dislikes.
- Analyzing healthcare data:
- Analyzing streams of data from medical devices in neonatal units.
- Healthcare Predictive Analytics. One hospital is using a product called Content and Predictive analytics to understand limit early hospital discharges which would result in re-admittance to the hospital
IBM is working with its clients and prospects to implement big data initiatives. These initiatives generally involve a services component given the range of product offerings IBM has in the space and the newness of the market. IBM is making significant investments in tools, integrated analytic accelerators, and solution accelerators to reduce deployment time and cost to deploy these kinds of solutions.
At IBM, big data is about the “the art of the possible.” According to the company, price points on products that may have been too expensive five years ago are coming down. IBM is a good example of a vendor that is both working with customers to push the envelope in terms of what is possible with big data and, at the same time, educating the market about big data. The company believes that big data can change the way companies do business. It’s still early in the game, but IBM has a well-articulated vision around big data. And, the solutions its clients discussed were big, bold, and very exciting. The company is certainly a leader in this space.
Next up in my discussion on big data providers is SAS. What’s interesting about SAS is that, in many ways, big data analytics is really just an evolution for the company. One of the company’s goals has always been to support complex analytical problem solving. It is well respected by its customers for its ability to analyze data at scale. It is also well regarded for its ETL capabilities. SAS has had parallel processing capabilities for quite some time. Recently, the company has been pushing analytics into databases and appliances. So, in many ways big data is an extension of what SAS has been doing for quite a while.
At SAS, big data goes hand in hand with big data analytics. The company is focused on analyzing big data to make decisions. SAS defines big data as follows, “When volume, velocity and variety of data exceeds an organization’s storage or compute capacity for accurate and timely decision-making.” However, SAS also includes another attribute when discussing big data which is relevance in terms of analysis. In other words, big data analytics is not simply about analyzing large volumes of disparate data types in real time. It is also about helping companies to analyze relevant data.
SAS can support several different big data analytics scenarios. It can deal with complete datasets. It can also deal with situations where it is not technically feasible to utilize an entire big data set or where the entire set is not relevant to the analysis. In fact, SAS supports what it terms a “stream it, store it, score it” paradigm to deal with big data relevance. It likens this to an email spam filter that determines what emails are relevant for a person. Only appropriate emails go to the person to be read. Likewise, only relevant data for a particular kind of analysis might be analyzed using SAS statistical and data mining technologies.
The specific solutions that support the “stream it, store it, score it” model include:
- Data reduction of very large data volumes using stream processing. This occurs at the data preparation stage. SAS Information Management capabilities are leveraged to interface with various data sources that can be streamed into the platform and filtered based on analytical models built from what it terms “organizational knowledge” using products like SAS Enterprise Miner, SAS Text Miner and SAS Social Network Analytics. SAS Information Management (SAS DI Studio, DI Server, which includes DataFlux capabilities) provides the high speed filtering and data enrichment (with additional meta-data that is used to build more indices that makes the downstream analytics process more efficient). In other words, it utilizes analytics and data management to prioritize, categorize, and normalize data while it is determining relevance. This means that massive amounts of data does not have to be stored in an appliance or data warehouse.
- SAS High Performance Computing (HPC). SAS HPC includes a combination of grid, in-memory and in-database technologies. It is appliance ready software built on specifically configured hardware from SAS database partners. In addition to the technology, SAS provides pre-packaged solutions that are using the in-memory architecture approach.
- SAS Business Analytics. SAS offerings include a combination of reporting, BI, and other advanced analytics functionality (including text analytics, forecasting, operations research, model management and deployment) using some of the same tools (SAS Enterprise Miner, etc) as listed above. SAS also includes support for mobile devices.
Of course, this same set of products can be used to handle a complete data set.
Additionally, SAS supports a Hadoop implementation to enable its customers to push data into Hadoop and be able to manage it. SAS analytics software can be used to run against Hadoop for analysis. The company is working to utilize SAS within Hadoop so that data does not have to be brought out to SAS software.
SAS has utilized its software to help clients solve big data problems in a number of areas including:
- Retail: Analyzing data in real time at check-out to determine store coupons at big box stores; Markdown optimization at point of sale; Assortment planning
- Finance: Scoring transactional data in real time for credit card fraud prevention and detection; Risk modeling: e.g. moving from looking at loan risk modeling as one single model to running multiple models against a complete data set that is segmented.
- Customer Intelligence: using social media information and social network analysis
For example, one large U.S. insurance company is scoring over 600,000 records per second on a multi node parallel set of processors.
What is a differentiator about the SAS approach is that since the company has been growing its big data capabilities through time, all of the technologies are delivered or supported based on a common framework or platform. While newer vendors may try to down play SAS by saying that its technology has been around for thirty years, why is that a bad thing? This has given the company time to grow its analytics arsenal and to put together a cohesive solution that is architected so that the piece parts can work together. Some of the newer big data analytics vendors don’t have nearly the analytics capability of SAS. Experience matters. Enough said for now.