Are you ready for IBM Watson?

This week marks the one year anniversary of the IBM Watson computer system succeeding at Jeopardy!. Since then, IBM has gotten a lot of interest in Watson.  Companies want one of those.

But what exactly is Watson and what makes it unique?  What does it mean to have a Watson?  And, how is commercial Watson different from Jeopardy Watson?

What is Watson and why is it unique?

Watson is a new class of analytic solution

Watson is a set of technologies that processes and analyzes massive amounts of both structured and unstructured data in a unique way.   One statistic given at the recent IOD conference is that Watson can process and analyze information from 200 million books in three seconds.  While Watson is very advanced it uses technologies that are commercially available with some “secret sauce” technologies that IBM Research has either enhanced or developed.  It combines software technologies from big data, content and predictive analytics, and industry specific software to make it work.

Watson includes several core pieces of technology that make it unique

So what is this secret sauce?  Watson understands natural language, generates and evaluates hypotheses, and adapts and learns.

First, Watson uses Natural Language Processing (NLP). NLP is a very broad and complex field, which has developed over the last ten to twenty years. The goals of NLP are to derive meaning from text. NLP generally makes use of linguistic concepts such as grammatical structures and parts of speech.  It breaks apart sentences and extracts information such as entities, concepts, and relationships.  IBM is using a set of annotators to extract information like symptoms, age, location, and so on.

So, NLP by itself is not new, however, Watson is processing vast amounts of this unstructured data quickly, using an architecture designed for this.

Second, Watson works by generating hypotheses which are potential answers to a question.  It is trained by feeding question and answer (Q/A) data into the system. In other words, it is shown representative questions and learns from the supplied answers.  This is called evidence based learning.  The goal is to generate a model that can produce a confidence score (think logistic regression with a bunch of attributes).  Watson would start with a generic statistical model and then look at the first Q/A and use that to tweak coefficients. As it gains more evidence it continues to tweak the coefficients until it can “say” confidence is high.  Training Watson is key since what is really happening is that the trainers are building statistical models that are scored.  At the end of the training, Watson has a system that has feature vectors and models so that eventually it can use the model to probabilistically score the answers.   The key here is something that Jeopardy! did not showcase – which is that it is not deterministic (i.e. using rules).  Watson is probabilistic and that makes it dynamic.

When Watson generates a hypothesis it then scores the hypothesis based on the evidence.   Its goal is to get the right answer for the right reason.  (So, theoretically, if there are 5 symptoms that must be positive for a certain disease and 4 that must be negative and Watson only has 4 of the 9 pieces of information, it could ask for more.) The hypothesis with the highest score is presented.   By the end the analysis, Watson is confident when it knows the answer and when it doesn’t know the answer.

Here’s an example.  Suppose you go in to see your doctor because you are not feeling well.  Specifically, you might have heart palpitations, fatigue, hair loss, and muscle weakness.  You decide to go see a doctor to determine if there is something wrong with your thyroid or if it is something else.  If your doctor has access to a Watson system then he could use it to help advise him regarding your diagnosis.  In this case, Watson would already have ingested and curated all of the information in books and journals associated with thyroid disease.  It also has the diagnosis and related information from other patients from this hospital and other doctors in the practice from the electronic medical records of prior cases that it has in its data banks.  Based on the first set of symptoms you might report it would generate a hypothesis along with probabilities associated with the hypothesis (i.e. 60% hyperthyroidism, 40% anxiety, etc.).  It might then ask for more information.  As it is fed this information, i.e. example patient history, Watson would continue to refine its hypothesis along with the probability of the hypothesis being correct.  After it is given all of the information and it iterates through it and presents the diagnosis with the highest confidence level, the physician would use this information to help assist him in making the diagnosis and developing a treatment plan.  If Watson doesn’t know the answer, it will state that it has does not have an answer or doesn’t have enough information to provide an answer.

IBM likens the process of training a Watson to teaching a child how to learn.  A child can read a book to learn.  However, he can also learn by a teacher asking questions and reinforcing the answers about that text.

Can I buy a Watson?

Watson will be offered in the cloud in an “as a service” model.  Since Watson is in its own class, let’s call this Watson as a Service (WaaS).  Since Watson’s knowledge is essentially built in tiers, the idea is that IBM will provide the basic core knowledge in a particular WaaS solution space, say all of the corpus about a particular subject – like diabetes – and then different users could build on this.

For example, in September IBM announced an agreement to create the first commercial applications of Watson with WellPoint – a health benefits company. Under the agreement, WellPoint will develop and launch Watson-based solutions to help improve patient care. IBM will develop the base Watson healthcare technology on which WellPoint’s solution will run.  Last month, Cedars-Sinai signed on with WellPoint to help develop an oncology solution using Watson.  Cedars-Sinai’s oncology experts will help develop recommendations on appropriate clinical content for the WellPoint health care solutions. They will assist in the evaluation and testing of these tools.  In fact, these oncologists will “enter hypothetical patient scenarios, evaluate the proposed treatment options generated by IBM Watson, and provide guidance on how to improve the content and utility of the treatment options provided to the physicians.”  Wow.

Moving forward, picture potentially large numbers of core knowledge bases that are trained and available for particular companies to build upon.  This would be available in a public cloud model and potentially a private one as well, but with IBM involvement.  This might include Watsons for law or financial planning or even politics (just kidding) – any area where there is a huge corpus of information that people need to wrap their arms around in order to make better decisions.

IBM is now working with its partners to figure out what the user interface for these Watsons- as a Service might look like.  Will Watson ask the questions?  Can end-users, say doctors, put in their own information and Watson will use it?  This remains to be seen.

Ready for Watson?

In the meantime, IBM recently rolled out its “Ready for Watson.”  The idea is that a move to Watson might not be a linear progression.  It depends on the business  problem that companies are looking to solve.  So IBM has tagged certain of its products as “ready” to be incorporated into a Watson solution.  IBM Content and Predictive Analytics for Healthcare is one example of this.  It combines IBM’s content analytics and predictive analytics solutions that are components of Watson.  Therefore, if a company used this solution it could migrate it to a Watson-as a Service deployment down the road.

So happy anniversary IBM Watson!  You have many people excited and some people a little bit scared.  For myself, I am excited to see where Watson is on its first anniversary and am looking forward to see what progress it has made on its second anniversary.

EMC and Big Data- Observations from EMC World 2011

I attended EMC’s User Conference last week in Las Vegas. The theme of the event was Big Data meets the Cloud. So, what’s going on with Big Data and EMC? Does this new strategy make sense?

EMC acquired Greenplum in 2010. At the time EMC described Greenplum as a “shared nothing, massively parallel processing (MPP) data warehousing system.” In other words, it could handle pedabytes of data. While the term data warehouse denotes a fairly static data store, at the user conference, EMC executives characterized big data as a high volume of disparate data, which is structured and unstructured, it is growing fast, and it may be processed in real time. Big data is becoming increasingly important to the enterprise not just because of the need to store this data but also because of the need to analyze it. Greenplum has some of its own analytical capabilities but recently the company formed a partnership with SAS to provide more oomph to its analytical arsenal. At the conference, EMC also announced that it has now included Hadoop as part of its Greenplum infrastructure to handle unstructured information.

Given EMC’s strength in data storage and content management, it is logical for EMC to move into the big data arena. However, I am left with some unanswered questions. These include questions related to how EMC will make storage, content management, data management, and data analysis all fit together.

• Data Management. How will data management issues be handled (i.e. quality, loading, etc.)? EMC has a partnership with Informatica and SAS has data management capabilities, but how will all of these components work together?
• Analytics. What analytics solutions will emerge from the partnership with SAS? This is important since EMC is not necessarily known for analytics. SAS is a leader in analytics and can make a great partner for EMC. But, its partnership with EMC is not exclusive. Additionally, EMC made a point of the fact that 90% most enterprises’ data is unstructured. EMC has incorporated Hadoop into Greenplum, ostensibly to deal with unstructured data. EMC executives mentioned that the open source community has even begun developing analytics around Hadoop. EMC Documentum also has some text analytics capabilities as part of Center Stage. SAS also has text analytics capabilities. How will all of these different components converge into a plan?
• Storage and content management. How do the storage and content management parts of the business fit into the big data roadmap? It was not clear from the discussions at the meeting how EMC plans to integrate its storage platforms into an overall big data analysis strategy. In the short term we may not see a cohesive strategy emerge.

EMC is taking on the right issues by focusing on customers’ needs to manage big data. However, it is a complicated area and I don’t expect EMC to have all of the answers today. The market is still nascent. Rather, it seems to me that EMC is putting its stake in the ground around big data. This will be an important stake for the future.

Five Analytics Predictions for 2011

In 2011 analytics will take center stage as a key trend because companies are at a tipping point with the volume of data they have and their urgent need to do something about it. So, with 2010 now past and 2011 to look forward to, I wanted to take the opportunity to submit my predictions (no pun intended) regarding the analytics and advanced analytics market.

Advanced Analytics gains more steam. Advanced Analytics was hot last year and will remain so in 2011. Growth will come from at least three different sources. First, advanced analytics will increase its footprint in large enterprises. A number of predictive and advanced analytics vendors tried to make their tools easier to use in 2009-2010. In 2011 expect new users in companies already deploying the technology to come on board. Second, more companies will begin to purchase the technology because they see it as a way to increase top line revenue while gaining deeper insights about their customers. Finally, small and mid sized companies will get into the act, looking for lower cost and user -friendly tools.
Social Media Monitoring Shake Out. The social media monitoring and analysis market is one crowded and confused space, with close to 200 vendors competing across no cost, low cost, and enterprise-cost solution classes. Expect 2011 to be a year of folding and consolidation with at least a third of these companies tanking. Before this happens, expect new entrants to the market for low cost social media monitoring platforms and everyone screaming for attention.
Discovery Rules. Text Analytics will become a main stream technology as more companies begin to finally understand the difference between simply searching information and actually discovering insight. Part of this will be due to the impact of social media monitoring services that utilize text analytics to discover, rather than simply search social media to find topics and patterns in unstructured data. However, innovative companies will continue to build text analytics solutions to do more than just analyze social media.
Sentiment Analysis is Supplanted by other Measures. Building on prediction #3, by the end of 2011 sentiment analysis won’t be the be all and end all of social media monitoring. Yes, it is important, but the reality is that most low cost social media monitoring vendors don’t do it well. They may tell you that they get 75-80% accuracy, but it ain’t so. In fact, it is probably more like 30-40%. After many users have gotten burned by not questioning sentiment scores, they will begin to look for other meaningful measures.
Data in the cloud continues to expand as well as BI SaaS. Expect there to still be a lot of discussion around data in the cloud. However, business analytics vendors will continue to launch SaaS BI solutions and companies will continue to buy the solutions, especially small and mid sized companies that find the SaaS model a good alternative to some pricey enterprise solutions. Expect to see at least ten more vendors enter the market.

On-premise becomes a new word. This last prediction is not really related to analytics (hence the 5 rather than 6 predictions), but I couldn’t resist. People will continue to use the term, “on-premise”, rather than “on-premises” when referring to cloud computing even though it is incorrect. This will continue to drive many people crazy since premise means “a proposition supporting or helping to support a conclusion” (dictionary.com) rather than a singular form of premises. Those of us in the know will finally give up correcting everyone else.

Analyzing Big Data

The term “Big Data” has gained popularity over the past 12-24 months as a) amounts of data available to companies continually increase and b) technologies have emerged to more effectively manage this data. Of course, large volumes of data have been around for a long time. For example, I worked in the telecommunications industry for many years analyzing customer behavior. This required analyzing call records. The problem was that the technology (particularly the infrastructure) couldn’t necessarily support this kind of compute intensive analysis, so we often analyzed billing records rather than streams of calls detail records, or sampled the records instead.

Now companies are looking to analyze everything from the genome to Radio Frequency ID (RFID) tags to business event streams. And, newer technologies have emerged to handle massive (TB and PB) quantities of data more effectively. Often this processing takes place on clusters of computers meaning that processing is occurring across machines. The advent of cloud computing and the elastic nature of the cloud has furthered this movement.

A number of frameworks have also emerged to deal with large-scale data processing and support large-scale distributed computing. These include MapReduce and Hadoop:

-MapReduce is a software framework introduced by Google to support distributed computing on large sets of data. It is designed to take advantage of cloud resources. This computing is done across large numbers of computer clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs
-Apache Hadoop is an open source distributed computing platform that is written in Java and inspired by MapReduce. Data is stored over many machines in blocks that are replicated to other servers. It uses a hash algorithm to cluster data elements that are similar. Hadoop can cerate a map function of organized key/value pairs that can be output to a table, to memory, or to a temporary file to be analyzed.

But what about tools to actually analyze this massive amount of data?

Datameer

I recently had a very interesting conversation with the folks at Datameer. Datameer formed in 2009 to provide business users with a way to analyze massive amounts of data. The idea is straightforward: provide a platform to collect and read different kinds of large data stores, put it into a Hadoop framework, and then provide tools for analysis of this data. In other words, hide the complexity of Hadoop and provide analysis tools on top of it. The folks at Datameer believe their solution is particularly useful for data greater than 10 TB, where a company may have hit a cost wall using traditional technologies but where a business user might want to analyze some kind of behavior. So website activity, CRM systems, phone records, POS data might all be candidates for analysis. Datameer provides 164 functions (i.e. group, average, median, etc) for business users with APIs to target more specific requirements.

For example, suppose you’re in marketing at a wireless service provider and you offered a “free minutes” promotion. You want to analyze the call detail records of those customers who made use of the program to get a feel for how customers would use cell service if given unlimited minutes. The chart below shows the call detail records from one particular day of the promotion – July 11th. The chart shows the call number (MDN) as well as the time the call started and stopped and the duration of the call in milliseconds. Note that the data appear under the “analytics” tab. The “Data” tab provides tools to read different data sources into Hadoop.

This is just a snapshot – there may be TB of data from that day. So, what about analyzing this data? The chart below illustrates a simple analysis of the longest calls and the phone numbers those calls came from. It also illustrates basic statistics about all of the calls on that day – the average, median, and maximum call duration.

From this brief example, you can start to visualize the kind of analysis that is possible with Datameer.

Note too that since Datameer runs on top of Hadoop, it can deal with unstructured as well as structured data. The company has some solutions in the unstructured realm (such as basic analysis of twitter feeds), and is working to provide more sophisticated tools. Datameer offers its software either on either a SaaS license or on premises.

In the Cloud?

Not surprisingly, early adopters of the technology are using it in a private cloud model. This makes sense since some companies often want to keep control of their own data. Some of these companies already have Hadoop clusters in place and are looking for analytics capabilities for business use. Others are dealing with big data, but have not yet adopted Hadoop. They are looking at a complete “big data BI” type solution.

So, will there come a day when business users can analyze massive amounts of data without having to drag IT entirely into the picture? Utilizing BI adoption as a model, the folks from Datameer hope so. I’m interested in any thoughts readers might have on this topic!

Metrics Matter

I had a very interesting conversation last week with Dyke Hensen, SVP of Product Strategy for PivotLink.  For those of you not familiar with PivotLink, the company is a SaaS BI provider that has been in business for about 10 years (before the term SaaS became popular). Historically, the company has worked with small to mid sized firms, which often had large volumes of data (100s of millions of rows) taken from disparate data sources.  For example, the company has done a lot of work with retail POS systems, ecommerce sites, and the like.  Pivotlink enables companies to integrate information into a large columnar database and create dashboards to help slice and dice the information for decision-making.

Recently, the company announced ReadiMetrix, a SaaS BI service designed to provide, “Best practices-based metrics to accelerate time to insight.”  The company provides metrics in three areas:  Sales, Marketing, and HR.  These are actionable measures that companies can use to measure itself against its objectives.  If some of this sounds vaguely familiar (e.g. LucidEra), you might be asking yourself, “Can this succeed?”  Here are four reasons to think that it might:

  • PivotLink has been around for the past decade.  It has an established base of customers and business model. The company knows what its customers want. It should be able to upsell existing customers and it knows how to sell to new customers.
  • From a technical perspective, ReadiMetrix is not a tab in Salesforce.com like many other business SaaS services.   Rather, the company is partnering with integrators like Boomi to provide the connectors to on premises as well as cloud based applications. So, they are not trying to do the integration themselves (which often trips companies up).  The integration also utilizes a SOA based approach, which enables flexibility.
  • The company is building a community of best practices around metrics to continue to grow what it can provide and to raise awareness around the importance of metrics.
  • SaaS BI has some legs.  Since the economic downturn, companies realize the importance of gaining insight from their data and BI companies of all kinds (on and off premises) have benefited from this.  Additionally, the concept of a metric is not necessarily new (think Balanced Scorecard and other measurement systems), so the chasm has been crossed in that regard.

Of course, a critical key to success will be whether or not companies actually think they need or want these kinds of metrics.  Many companies may believe that they are “all set” when it comes to metrics.  However, I’ve seen firms all too often think that “more is better” when it comes to information, rather than considering a selected number of metrics with drill down capability underneath.  The right metrics require some thought.  I think that the idea of an established set of metrics, developed in conjunction with a best practices community might be appealing for companies that do not have expertise in developing their own.  It will be important for PivotLink to educate the market on “why” these categories of metrics matter and their value.

My Take on the SAS Analyst Conference

I just got back from the SAS analyst event that was held in Steamboat Springs, Colorado.   It was a great meeting.  Here are some of the themes I heard over the few days I was there:

SAS is a unique place to work.

Consider the following:  SAS revenue per employee is somewhat lower than the software industry average because everyone is on the payroll.  That’s right.  Everyone from the grounds keepers to the health clinic professionals to those involved in advertising are on the SAS payroll.   The company treats its employees very well, providing fitness facilities and on site day care (also on the payroll). You don’t even have to buy your own coffee or soda! The company has found that these kinds of perks have a positive impact.  SAS announced no layoffs in 2009 and this further increased morale and productivity.  The company actually saw increased profits in 2009.   Executives from SAS also made the point that even thought they might have their own advertising, etc. they do not want to be insular.  The company knows it needs new blood and new ideas.  On that note, check out the next two themes:

Innovation is very important to SAS.

Here are some examples:

  • Dr. Goodnight gave his presentation using the latest version of the SAS BI dashboard, which looked pretty slick.
  • SAS has recently introduced some very innovative products and the trend will continue. One example is its social network analysis product that has been doing very well in the market.  The product analyzes social networks and can, for example, uncover groups of people working together to commit fraud.  This product was able to find $32M in welfare fraud in several weeks.
  • SAS continues to enhance its UI, which it has been beat up about in the past. We also got pre-briefed on some new product announcements that I can’t talk about yet, but other analysts did tweet about them at the conference.   There were a lot of tweats at this conference and they were analyzed in real time.

The partnership with Accenture is a meaningful one.

SAS execs stated that although they may not have that many partnerships, they try to make the ones they have very real.  While, on the surface, the recent announcement regarding the Accenture SAS Analytics Group might seem like a me too after IBM BAO, it is actually different.  Accenture’s goal is transform the front office, like ERP/CRM was transformed.  It wants to, “Take the what and turn it into so what and now what?” It views analytics not simply as a technology, but a new competitive management science that enables agility.  It obviously won’t market it that way as the company takes a business focus.  Look for the Accenture SAS Analytics Group to put out services such as Churn management as a service, Risk and fraud detection as a service.  They will operationalize this as part of a business process.

The Cloud!

SAS has a number of SaaS offerings in the market and will, no doubt, introduce more.  What I found refreshing was that SAS takes issues around SaaS very seriously.  You’d expect a data company to be concerned about their customers’ data and they are. 

Best line of the conference

SAS is putting a lot of effort into making its products easier to use and that is a good thing.  There are ways to get analysis to those people who aren’t that analytical.  In a discussion about the skill level required for people to use advanced analytics, however, one customer commented, “Just because you can turn on a stove doesn’t mean you know how to cook.”  More on this in another post.

Five Predictions for Advanced Analytics in 2010

With 2010 now upon us, I wanted to take the opportunity to talk about five advanced analytics technology trends that will take flight this year.  Some of these are up in the clouds, some down to earth.

  • Text Analytics:  Analyzing unstructured text will continue to be a hot area for companies. Vendors in this space have weathered the economic crisis well and the technology is positioned to do even better once a recovery begins.  Social media analysis really took off in 2009 and a number of text analytics vendors, such as Attensity and Clarabridge, have already partnered with online providers to offer this service. Those that haven’t will do so this year.  Additionally, numerous “listening post” services, dealing with brand image and voice of the customer have also sprung up. However, while voice of the customer has been a hot area and will continue to be, I think other application areas such as competitive intelligence will also gain momentum.  There is a lot of data out on the Internet that can be used to gain insight about markets, trends, and competitors.
  • Predictive Analytics Model Building:  In 2009, there was a lot of buzz about predictive analytics.  For example, IBM bought SPSS and other vendors, such as SAS and Megaputer, also beefed up offerings.  A newish development that will continue to gain steam is predictive analytics in the cloud.  For example, vendors Aha! software and Clario are providing predictive capabilities to users in a cloud-based model.  While different in approach they both speak to the trend that predictive analytics will be hot in 2010.
  • Operationalizing Predictive Analytics:  While not every company can or may want to build a predictive model, there are certainly a lot of uses for operationalizing predictive models as part of a business process.  Forward looking companies are already using this as part of the call center process, in fraud analysis, and churn analysis, to name a few use cases.  The momentum will continue to build making advanced analytics more pervasive.
  • Advanced Analytics in the Cloud:  speaking of putting predictive models in the cloud, business analytics in general will continue to move to the cloud for mid market companies and others that deem it valuable.  Companies such as QlikTech introduced a cloud-based service in 2009.  There are also a number of pure play SaaS vendors out there, like GoodData and others that provide cloud-based services in this space.  Expect to hear more about this in 2010.
  • Analyzing complex data streams.  A number of forward-looking companies with large amounts of real-time data (such as RFID or financial data) are already investing in analyzing these data streams.   Some are using the on-demand capacity of cloud based model to do this.  Expect this trend to continue in 2010.

Top of Mind – Data in the Cloud

I attended Cloud Camp Boston yesterday. It was a great meeting with some good discussions.  Several hundred people attended.  What struck me about the general session (when all attendees were present) was that there was a lot of interest around data in the cloud.  For example, during the “unpanel” (where people become panelists in real time), 50%; (5 of the 10 questions) that were up for grabs dealt with data in the cloud.  That’s pretty significant. 

  • How do I integrate large amounts of enterprise data in the cloud? (answers included various approaches, more traditional to new vendor technology were mentioned)
  • How do I move my enterprise data into the cloud? (answers included ship it FedEx on a hard drive and make sure there is a proven chain of custody around the transfer)
  • How do I ensure the security of my data in the cloud? (no answer – that deserved its own breakout session)
  • What is the maximum sustained data transfer rate in the cloud? (answers included when it takes a server down, no one knows, but a year ago someone mentioned that 8 gigabytes a second took down a cloud provider)
  • How do applications (and data) interoperate in the cloud? (answers included that standards need to rule)

 There were some interesting break out sessions as well.  One – the aforementioned security (and audit), another an intro to cloud computing (moderated by Judith Hurwitz), one about channel strategies, and a number of others.  I attended a break out session about Analytics and BI in the cloud and again, for obvious reasons, much of the discussion was data centric.   Some of the discussion items included: 

  • What public data sets are available in the cloud? 
  • What is the data infrastructure needed to support various kinds of data analysis? 
  • What SaaS vendors offer business analytics in the cloud? 
  • How do I determine what apps/data make sense to move to the cloud?

 The upshot?  Data in the cloud – moving it, securing it, accessing it, manipulating it, and analyzing it – is going to be a hot topic in 2010.

Analyzing Data in the Cloud

I had an interesting chat with Roman Stanek, CEO of Good Data last week about the debate over data security and reliability in the cloud.  For those of you who are not familiar with Good Data, it provides a collaborative business analytics platform as a SaaS offering.

The upshot of the discussion was something like this:

 The argument over data security and reliability in the cloud is the wrong argument.    It’s not just about moving your existing data to the cloud.  It’s about using the cloud to provide a different level of functionality, capability, and service than you could obtain using a traditional premises solution- even if you move that solution to the “hosted” cloud. 

What does this mean?  First, companies should not simply be asking the question,  “should I move my data to the cloud?”  They should be thinking about new capabilities the cloud provides as part of the decision making process.  For example, Good Data touts its collaborative capabilities and its ability to do mash ups and certain kinds of benchmarking (utilizing external information) as differentiators to standard premises-based BI solutions.  This leads to the second point that a hosted BI solution is a different animal than a SaaS solution. For example, a user of Good Data could pull in information from other SaaS solutions (such as Salesforce.com) as part of the analysis process.  This might be difficult with a vanilla hosted solution.

 So, when BI users think about moving to the public cloud they need to assess the risk vs. the reward of the move.  If they are able to perform certain analysis that they couldn’t perform via a premises model and this analysis is valuable, then any perceived or real risk might be worth it.

Security and Reliability of Data in the Cloud

Over the past few days, I got a chance to speak to two different companies in the business analytics space about data in the cloud.  One was a SaaS provider, the other an enterprise software vendor.  Two vendors, two different stories that illustrate the jury is still very much out regarding how end users feel about putting their sensitive data in the cloud.

The SaaS provider runs its operation in the Amazon EC2 cloud (and no, I do not believe that the company was using Amazon’s new Virtual Private Cloud services).  Interestingly, the company said that even organizations in the public sector were starting to get comfortable with the level of security and reliability of the cloud.  In fact, the company said that the security and reliability of a cloud data center was, more often than not, better than the security and reliability of the infrastructure on a customer’s premises.    This is an argument I have heard before.

The enterprise software vendor also provides a cloud-like option to its customers.  This company told me that 80% of its customers did not want to keep their data in a cloud environment because of security concerns.  These customers are analyzing some pretty sensitive data about customers, revenue, and the like.

Considerations

When you think about data in the cloud, it is important to think about it from at least 2 perspectives:  Yours and the cloud provider.  Let’s say you are a mid sized company running a business analytics application in the cloud.  From your perspective, the amount of data that you are storing and processing in this service may not that great.  However, your SaaS provider might have five thousand customers.  In fact, it may be running its application across many servers.  It may house your data and the 4999 other companies it calls its clients on multiple database servers.   Once your company’s data is in the SaaS provider’s database, it may exist there with data from other companies.  The concern, of course, is that your data is in a shared environment that you don’t control.   The SaaS provider will tell you that since this is their business, they have a higher level of skill around issues such as security and reliability than might exist in your own company.  And this may be true, depending on your company.  Each organization needs to evaluate its own needs and issues and make a decision for itself.

Here are some issues to consider about security and reliability:

Data Security

    o       Different kinds of data require different levels of security.  There are huge numbers of issues associated with security –including transporting the data securely to the cloud, as well as data access and data leakage .  (those interested should check out a very interesting paper that looks at potential threats from “non-provider affiliated malicious parties” by Ristenpar, Tromer, Shacham, and Savage.)

    o       Along with this are controls over your data that need to be addressed.  These include controls to ensure data integrity such as completeness, accuracy, and reasonableness?  There are processing controls to ensure that data remains accurate. And, there also need to be output controls in place. And of course, there needs to be controls over the actual transport of data from your company to the cloud.

    o       There are also data compliance issues to think about.  These might include retention as well as issues such as cross country data transfer.

    o       Data ownership – Who owns your data once it goes into the cloud?  Some service providers might want to take your data, merge it with other data and do some analysis.

    Reliability/Availability

      o       Availability:  A provider might state that its servers are available  99.999% of the time, but read the contract.  Does this uptime include scheduled maintenance?

      o       Business continuity plans.  If you cloud provider’s data center goes down, what plans are in place to get your data back up and available again.  For example, a SaaS vendor might tell you that they back up data every day, but it might take several days to get the back up onto systems in another facility.

      o       Loss of data. What provisions are in your contract if something happens and your providers loses your data?

      o       Contract termination-   How will data be returned if the contract is terminated?

      o       Vendor Lock-in – If you create applications with one cloud vendor and then decide to move to another vendor, you need to find out how difficult it will be to move your data from one to the next.

      Follow

      Get every new post delivered to your Inbox.

      Join 1,189 other followers