Five Best Practices for Text Analytics

It’s been a while since I updated my blog and a lot has changed.  In January, I made the move to TDWI as Research Director for Advanced Analytics.  I’m excited to be there, although I miss Hurwitz & Associates.   One of the last projects I worked on while at Hurwitz & Associates was the Victory Index for Text Analytics.  Click here for more information on the Victory Index.  

As part of my research for the Victory Index, I spent I a lot of time talking to companies about how they’re using text analytics.  By far, one of the biggest use cases for text analytics centers on understanding customer feedback and behavior.  Some companies are using internal data such as call center notes or emails or survey verbatim to gather feedback and understand behavior, others are using social media, and still others are using both.  

What are these end users saying about how to be successful with text analytics?  Aside from the important best practices around defining the right problem, getting the right people, and dealing with infrastructure issues, I’ve also heard the following:

Best Practice #1 - Managing expectations among senior leadership.   A number of the end-users I speak with say that their management often thinks that text analytics will work almost out of the box and this can establish unrealistic expectations. Some of these executives seem to envision a big funnel where reams of unstructured text enter and concepts, themes, entities, and insights pop out at the other end.  Managing expectations is a balancing act.  On the one hand, executive management may not want to hear the details about how long it is going to take you to build a taxonomy or integrate data.  On the other hand, it is important to get wins under your belt quickly to establish credibility in the technology because no one wants to wait years to see some results.  That said, it is still important to establish a reasonable set of goals and prioritize them and to communicate them to everyone.  End users find that getting senior management involved and keeping them informed with well-defined plans on a realistic first project can be very helpful in handling expectations. 

 

for more visit my tdwi blog

 

 

Hadoop + MapReduce + SQL + Big Data and Analytics: RainStor

As the volume and variety of data continues to increase, we’re going to see more companies entering the market with solutions to address big data and compliant retention and business analytics.  One such company is RainStor, which while not a new entrant (with over 150 end-customers through direct sales and partner channels) has recently started to market its big data capabilities more aggressively to enterprises.  I had an interesting conversation with Ramon Chen, VP of product management at RainStor, the other week.   

The RainStor database was built in the UK as a government defense project to process large amounts of data in-memory.  Many of the in-memory features have been retained while new capabilities including persistent retention on any physical storage have been added. And now the company is positioning itself as providing an enterprise database architected for big data. It even runs natively on Hadoop.

The Value Proposition

The value proposition is that Rainstor’s technology enables companies to store data in the RainStor database using a unique compression technology to reduce disk space requirements.  The company boasts as much as a 40 to 1 compression ratio (>97% reduction in size).  Additionally, the software can run on any commodity hardware and storage. 

For example, one of RainStor’s clients generates 17B logs a day that it is required to store and access for ninety days.  This is the equivalent of 2 petabytes (PB) of raw information over that period which would ordinarily cost millions of dollars to store. Using RainStor, the company compressed and retained the data 20 fold in a cost-efficient 100 Terabyte (TB) NAS. At the same time RainStor also replaced an Oracle Data Warehouse providing fast response times to meet queries in support of an operational call center.

RainStor ingests the data, stores it, and makes it available for query and other analytic workloads.  It comes in two editions – the Big Data Retention Edition and the Big Data Analytics on Hadoop edition.  Both editions  provide full SQL-92 and ODBC/JDBC access.  According to the company, the Hadoop edition is the only database that runs natively on Hadoop and supports access through MapReduce and the PIG Latin language. As a massively parallel processing (MPP) database RainStor runs on the same Hadoop nodes, writing and supporting access to compressed data on HDFS. It provides security, high availability, and lifecycle management and versioning capabilities.

The idea then is that RainStor can dramatically lower the cost of storing data in Hadoop through its compression which reduces the node count needed and accelerates the performance of MapReduce jobs and provides full SQL-92 access. This can reduce the need to transfer data out of the Hadoop cluster to a separate enterprise data warehouse.  RainStor allows the Hadoop environment to support real-time query access in addition to its batch-oriented MapReduce processing.

How does it work?

RainStor is not a relational database; instead it follows the NoSQL movement by storing data non-relationally.  In its case the data is physically stored as a set of trees with linked values and nodes.  The idea is illustrated below (source: RainStor) 

Image

Say a number of records with common value yahoo.com are ingested in the system.  Rainstor would throw away duplicates and only store the literal yahoo.com once but maintain references to the records that contained that value.  So, if the system is loading 1 million records and 500K contained yahoo.com it would only be stored once, saving significant storage.  This and additional pattern deduplication means that a resulting tree structure holds the same data in a significantly smaller footprint and higher compression ratio compared to other databases on the market, according to RainStor.  It also doesn’t require re-inflation like binary zip file compression which requires resources and time to re-inflate.  It writes the tree structure as is to disk, when you read it reads it back to disk.  Instead of unraveling all trees all the time, it only reads those relevant trees and branches of trees that are required to fulfill the query.  

Conclusion

RainStor is a good example of a kind of database that can enable big data analytics.  Just as many companies finally “got” the notion of business analytics and the importance of analytics in decision making so too are they realizing that as they accumulate and generate ever increasing amounts of data there is opportunity to analyze and act on it.

For example, according to the company, you can put a BI solution, like IBM Cognos, Microstrategy, Tableau or SAS, on top of RainStor.  RainStor would hold the raw data and any BI solution would access data either through MapReduce or ODBC/JDBC  (i.e. one platform) with no need to use Hive and HQL.  RainStor also recently announced a partnership with IBM BigInsights for its Big Data Analytics on Hadoop edition. 

What about big data appliances that are architected for high performance analytics?  RainStor claims that while some big data appliances  do have some MapReduce support (like Aster Data for example) it would be cheaper to use their solution together with open source Hadoop.  In other words, RainStor on Hadoop would be cheaper than any big data appliance.

It is still early in the game.  I am looking forward to seeing some big data analytics implementations which utilize RainStor.  I am interested to see use cases that go past querying huge amounts of data and provide some advanced analytics on top of RainStor.  Or, big data visualizations with rapid response time on top of RainStor, that only need to utilize a small number of nodes.  Please keep me posted, RainStor.

Informatica announces 9.1 and puts stake in the ground around big data

Earlier this week, Informatica announced the release of the Informatica 9.1 Platform for Big Data. The company joins other data centric vendors such as EMC and IBM by putting its stake in the ground around the hot topic of Big Data. Informatica defines Big Data as, “all data, including both transaction and interaction data, in sets whose size or complexity exceeds the ability of commonly used technologies to capture, manage and process at a reasonable cost and timeframe. Indeed, Informatica ‘s stance is that Big Data is the confluence of the three technology trends including big transaction data, big interaction data and big data processing.” In Informatica parlance the transactional data includes OLTP, OLAP, and data warehouse data; the interaction data might include social media data, call center records, click stream data, and even scientific data like that associated with genomics. Informatica targets native, high performance connectivity and future integration with Hadoop, the Big Data processing platform.

In 9.1 Informatica is providing an updated set of capabilities around self-service, authoritative and trustworthy data (MDM and data quality), data services and data integration. I wanted to focus on the data services here because of the connection to Big Data. Informatica is providing a platform that companies can use to integrate transactional data (at petabyte scale and beyond in volume) and social network data from Facebook, LinkedIn, and Twitter. Additionally, 9.1 provides the capability to move all data into and out of Hadoop in batch or real time using universal connectivity to including mainframe, databases, and applications which can help in managing unstructured data.

So, how will companies utilize this latest release? I recently had the opportunity to speak with Rob Myers, an Informatica customer, who is the manager of BI architecture and data warehousing, MDM, enterprise integration for HealthNow. HealthNow is a BlueCross/BlueShield provider for parts of western New York and the Albany area. The company is expanding geographically and is also providing value added services such as patient portals. It views its mission not simply as a claims processor but as a service provider to healthcare providers and patients. According to Rob, the company is looking to offer other value added services to doctors and patients as part of its competitive strategy. These offerings may include real time claims processing, identifying fraudulent claims, or analytics for healthier outcomes. For example, HealthNow might provide a service where it identifies patients with diabetes and provide proactive services to them to help manage the disease. Or, it might provide physicians with suggestions of tests they might consider for certain patients, given their medical records.

Currently, the company utilizes Informatica PowerCenter and Informatica Data Services for data integration including ETL and data abstraction. HealthNow has one large data warehouse and is currently building out a second. It is exposing data out to a logical model in data services tier. For example, its member portal utilizes data services to enable members to sign in and in real time, integrate 30-40 attributes around each member including demographic information, products, and eligibility for certain services into the portal. In addition, the company’s actuaries, marketing groups, and health services group have been utilizing its data warehouses to perform their own analysis. Rob doesn’t consider the data in these warehouses to be Big Data. Rather they are just sets of relational data. He views Big Data as some of the other data that the company currently has a hard time mining, for example data on social networks and the unstructured data in claims and medical notes. The company is in the beginning phase of determining how to gather and parse through this text and expose it in a way that it can be analyzed. For example, the company is interested in utilizing the data that they already have together with unstructured data and providing predictive analytics to its community. HealthNow is exploring Hadoop data stores as part of this plan and is excited about the direction that Informatica is moving. It views Informatica as the middleware that can get the trusted data out of the various silos and integrated in a way that it can then be analyzed or used in other value-added services.

It is certainly interesting to see what end-users have in mind for Big Data and, for that matter, how they define Big Data. Rob clearly views Big Data as high volume and disparate in nature (i.e. including structured and unstructured data). There seems to be a time dimension to it. He also made the point that its not just about having Big Data, it’s about doing something that he couldn’t do before with it – like processing and analyzing it. This is an important point that vendors and end-users are starting to pick up on. If Big Data were simply about volume of different kinds of data, then it would be a moving target. Really, an important aspect of Big Data is about is being able to perform activities on the data that weren’t possible before. I am glad to companies thinking about their use cases for Big Data and vendors such as Informatica putting a stake in the ground around the subject.

Four Findings from the Hurwitz & Associates Advanced Analytics Survey

Hurwitz & Associates conducted an online survey on advanced analytics in January 2011. Over 160 companies across a range of industries and company size participated in the survey. The goal of the survey was to understand how companies are using advanced analytics today and what their plans are for the future. Specific topics included:

- Motivation for advanced analytics
– Use cases for advanced analytics
– Kinds of users of advanced analytics
– Challenges with advanced analytics
– Benefits of the technology
– Experiences with BI and advanced analytics
– Plans for using advanced analytics

What is advanced analytics ?
Advanced analytics provides algorithms for complex analysis of either structured or unstructured data. It includes sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data mining techniques. Among its many use cases, it can be deployed to find patterns in data, prediction, optimization, forecasting, and for complex event processing. Examples include predicting churn, identifying fraud, market basket analysis, and analyzing social media for brand management. Advanced analytics does not include database query and reporting and OLAP cubes.

Many early adopters of this technology have used predictive analytics as part of their marketing efforts. However, the diversity of use cases for predictive analytics is growing. In addition to marketing related analytics for use in areas such as market basket analysis, promotional mix, consumer behavior analysis, brand loyalty, churn analysis, companies are using the technology in new and innovative ways. For example, there are newer industry use cases emerging including reliability assessment (i.e. predicting failure in machines), situational awareness, behavior (defense), investment analysis, fraud identification (insurance, finance), predicting disabilities from claims (insurance), and finding patterns in health related data (medical)

The two charts below illustrate several key findings from the survey on how companies use advanced analytics and who within the organization is using this technology.

• Figure 1 indicates that the top uses for advanced analytics include finding patterns in data and building predictive models.

• Figure 2 illustrates that users of advanced analytics in many organizations have expanded from statisticians and other highly technical staff to include business analysts and other business users. Many vendors anticipated this shift to business users and enhanced their offerings by adding new user interfaces, for example, which suggest or dictate what model should be used, given a certain set of data.

Other highlights include:

• Survey participants have seen a huge business benefit from advanced analytics. In fact, over 40% of the respondents who had implemented advanced analytics believed it had increased their company’s top-line revenue. Only 2% of respondents stated that advanced analytics provided little or no value to their company.
• Regardless of company size, the vast majority of respondents expected the number of users of advanced analytics in their companies to increase over the next six to 12 months. In fact, over 50% of respondents currently using the technology expected the number of users to increase over this time period.

The final report will be published in March 2011. Stay tuned!

Five Analytics Predictions for 2011

In 2011 analytics will take center stage as a key trend because companies are at a tipping point with the volume of data they have and their urgent need to do something about it. So, with 2010 now past and 2011 to look forward to, I wanted to take the opportunity to submit my predictions (no pun intended) regarding the analytics and advanced analytics market.

Advanced Analytics gains more steam. Advanced Analytics was hot last year and will remain so in 2011. Growth will come from at least three different sources. First, advanced analytics will increase its footprint in large enterprises. A number of predictive and advanced analytics vendors tried to make their tools easier to use in 2009-2010. In 2011 expect new users in companies already deploying the technology to come on board. Second, more companies will begin to purchase the technology because they see it as a way to increase top line revenue while gaining deeper insights about their customers. Finally, small and mid sized companies will get into the act, looking for lower cost and user -friendly tools.
Social Media Monitoring Shake Out. The social media monitoring and analysis market is one crowded and confused space, with close to 200 vendors competing across no cost, low cost, and enterprise-cost solution classes. Expect 2011 to be a year of folding and consolidation with at least a third of these companies tanking. Before this happens, expect new entrants to the market for low cost social media monitoring platforms and everyone screaming for attention.
Discovery Rules. Text Analytics will become a main stream technology as more companies begin to finally understand the difference between simply searching information and actually discovering insight. Part of this will be due to the impact of social media monitoring services that utilize text analytics to discover, rather than simply search social media to find topics and patterns in unstructured data. However, innovative companies will continue to build text analytics solutions to do more than just analyze social media.
Sentiment Analysis is Supplanted by other Measures. Building on prediction #3, by the end of 2011 sentiment analysis won’t be the be all and end all of social media monitoring. Yes, it is important, but the reality is that most low cost social media monitoring vendors don’t do it well. They may tell you that they get 75-80% accuracy, but it ain’t so. In fact, it is probably more like 30-40%. After many users have gotten burned by not questioning sentiment scores, they will begin to look for other meaningful measures.
Data in the cloud continues to expand as well as BI SaaS. Expect there to still be a lot of discussion around data in the cloud. However, business analytics vendors will continue to launch SaaS BI solutions and companies will continue to buy the solutions, especially small and mid sized companies that find the SaaS model a good alternative to some pricey enterprise solutions. Expect to see at least ten more vendors enter the market.

On-premise becomes a new word. This last prediction is not really related to analytics (hence the 5 rather than 6 predictions), but I couldn’t resist. People will continue to use the term, “on-premise”, rather than “on-premises” when referring to cloud computing even though it is incorrect. This will continue to drive many people crazy since premise means “a proposition supporting or helping to support a conclusion” (dictionary.com) rather than a singular form of premises. Those of us in the know will finally give up correcting everyone else.

Advanced Analytics and the skills needed to make it happen: Takeaways from IBM IOD

Advanced Analytics was a big topic at the IBM IOD conference last week. As part of this, predictive analytics was again an important piece of the story along with other advanced analytics capabilities IBM has developed or is in the process of developing to support optimization. These include Big Insights (for big data), analyzing data streams, content/text analytics, and of course, the latest release of Cognos.

One especially interesting topic that was discussed at the conference was the skills required to make advanced analytics a reality. I have been writing and thinking a lot this subject so I was very happy to hear IBM address it head on during the second day keynote. This keynote included a customer panel and another speaker, Dr. Atul Gawande, and both offered some excellent insights. The panel included Scott Friesen (Best Buy), Scott Futren (Guinnett County Public Schools), Srinivas Koushik (Nationwide), and Greg Christopher (Nestle). Here are some of the interrelated nuggets from the discussions:

• Ability to deliver vs. the ability to absorb. One panelist made the point that a lot of new insights are being delivered to organizations. In the future, it may become difficult for people to absorb all of this information (and this will require new skills too).
• Analysis and interpretation. People will need to know how to analyze and how to interpret the results of an analysis. As Dr. Gawande pointed out, “Having knowledge is not the same as using knowledge effectively.”
• The right information. One of the panelists mentioned that putting analytics tools in the hands of line people might be too much for them, and instead the company is focusing on giving these employees the right information.
• Leaders need to have capabilities too. If executives are accustomed to using spreadsheets and relying on their gut instincts, then they will also need to learn how to make use of analytics.
• Cultural changes. From call center agents using the results of predictive models to workers on the line seeing reports to business analysts using more sophisticated models, change is coming. This change means people will be changing the way that they work. How this change is handled will require special thought by organizations.

IBM executives also made a point of discussing the critical skills required for analytics. These included strategy development, developing user interfaces, enterprise integration, modeling, and dealing with structured and unstructured data. IBM has, of course, made a huge investment in these skills. GBS executives emphasized the 8,500 employees in its Global Business Services Business Analytics and Optimization group. Executives also pointed to the fact that the company has thousands of partners in this space and that 1 in 3 IBMers will attend analytics training. So, IBM is prepared to help companies in their journey into business analytics.

Are companies there yet? I think that it is going to take organizations time to develop some of these skills (and some they should probably outsource). Sure, analytics has been around a long time. And sure, vendors are making their products easier to use and that is going to help end users become more effective. Even if we’re just talking about a lot of business people making use of analytic software (as opposed to operationalizing it in a business process), the reality is that analytics requires a certain mindset. Additionally, unless someone understands the context of the information he or she is dealing with, it doesn’t matter how user friendly the platform is – they can still get it wrong. People using analytics will need to think critically about data, understand their data, and understand context. They will also need to know what questions to ask.

I whole-heartedly believe it is worth the investment of time and energy to make analytics happen.

Please note:

As luck would have it, I am currently fielding a study on advanced analytics! In am interested in understanding what your company’s plans are for advanced analytics. If you’re not planning to use advanced analytics, I’d like to know why. If you’re already using advanced analytics I’d like to understand your experience.

If you participate in this survey I would be happy to send you a report of our findings. Simply provide your email address at the end of the survey! Here’s the link:

Click here to take survey

Who is using advanced analytics?

Advanced analytics is currently a hot topic among businesses, but who is actually using it and why? What are the challenges and benefits to those companies that are using advanced analytics? And, what is keeping some companies from exploring this technology?

Hurwitz & Associates would like your help in answering a short (5 min) survey on advanced analytics. We are interested in understanding what your company’s plans are for advanced analytics. If you’re not planning to use advanced analytics, we’d like to know why. If you’re already using advanced analytics we’d like to understand your experience.

If you participate in this survey we would be happy to send you a report of our findings. Simply provide us your email address at the end of the survey! Thanks!

Here is the link to the survey:
Click here to take survey

Five requirements for Advanced Analytics

The other day I was looking at the analytics discussion board that I moderate on the Information Management site. I had posted a topic entitled “the value of advanced analytics.” I noticed that the number of views on this topic was at least 3 times as many as on other topics that had been posted on the forum. The second post that generated a lot of traffic was a question about a practical guide to predictive analytics.

Clearly, companies are curious and excited about advanced analytics. Advanced analytics utilizes sophisticated techniques to understand patterns and predict outcomes. It includes complex techniques such as statistical modeling, machine learning, linear programming, mathematics, and even natural language processing (on the unstructured side). While many kinds of “advanced analytics” have been around for the last 20+ years (I utilized it extensively in the 80s) and the term may simply be a way to invigorate the business analytics market, the point is that companies are finally starting to realize the value this kind of analysis can provide.

Companies want to better understand the value this technology brings and how to get started. And, while the number of users interested in advanced analytics continues to increase, the reality is that there will likely be a skills shortage in this area. Why? Because advanced analytics isn’t the same beast as what I refer to as, “slicing and dicing and shaking and baking” data to produce reports that might include information such as sales per region, revenue per customer, etc.

So what skills are needed for the business user to face the advanced analytics challenge? It’s a tough question. There is a certain thought process that goes into advanced analytics. Here are five (there are no doubt, more) skills I would say at a minimum, you should have:

1. It’s about the data. So, thoroughly understand your data. A business user needs to understand all aspects of his or her data. This includes answers to questions such as, “What is a customer?” “What does it mean if a data field is blank?” “Is there seasonality in my time series data?” It also means understanding what kind of derived variables (e.g. a ratio) you might be interested in and how you want to calculate them.
2. Garbage in, Garbage out. Appreciate data quality issues. A business user analyzing data cannot simply assume that the data (from whatever source) is absolutely fine. It might be the case, but you still need to check. Part of this ties to understanding your data, but it also means first looking at the data and asking if it make sense. And, what do you do with data that doesn’t make sense?
3. Know what questions to ask. I remember a time in graduate school when, excited by having my data and trying to analyze it, a wise professor told me not to simply throw statistical models at the data because you can. First, know what questions you are trying to answer from the data. Ask yourself if you have the right data to answer the questions. Look at the data to see what it is telling you. Then start to consider the models. Knowing what questions to ask will require business acumen.
4. Don’t skip the training step. Know how to use tools and what the tools can do for you. Again, it is simple to throw data at a model, especially if the software system suggests a certain model. However, it is important to understand what the models are good for. When does it make sense to use a decision tree? What about survival analysis? Certain tools will take your data and suggest a model. My concern is that if you don’t know what the model means, it makes it more difficult to defend your output. That is why vendors suggest training.
5. Be able to defend your output. At the end of the day, you’re the one who needs to present your analysis to your company. Make sure you know enough to defend it. Turn the analysis upside down, ask questions of it, and make sure you can articulate the output

I could go on and on but I’ll stop here. Advanced analytics tools are simply that – tools. And they will be only as good as the person utilizing them. This will require understanding the tools as well as how to think and strategize around the analysis. So my message? Utilized properly these tools can be great. Utilized incorrectly– well – it’s analogous to a do-it-yourself electrician who burns down the house.

Five Predictions for Advanced Analytics in 2010

With 2010 now upon us, I wanted to take the opportunity to talk about five advanced analytics technology trends that will take flight this year.  Some of these are up in the clouds, some down to earth.

  • Text Analytics:  Analyzing unstructured text will continue to be a hot area for companies. Vendors in this space have weathered the economic crisis well and the technology is positioned to do even better once a recovery begins.  Social media analysis really took off in 2009 and a number of text analytics vendors, such as Attensity and Clarabridge, have already partnered with online providers to offer this service. Those that haven’t will do so this year.  Additionally, numerous “listening post” services, dealing with brand image and voice of the customer have also sprung up. However, while voice of the customer has been a hot area and will continue to be, I think other application areas such as competitive intelligence will also gain momentum.  There is a lot of data out on the Internet that can be used to gain insight about markets, trends, and competitors.
  • Predictive Analytics Model Building:  In 2009, there was a lot of buzz about predictive analytics.  For example, IBM bought SPSS and other vendors, such as SAS and Megaputer, also beefed up offerings.  A newish development that will continue to gain steam is predictive analytics in the cloud.  For example, vendors Aha! software and Clario are providing predictive capabilities to users in a cloud-based model.  While different in approach they both speak to the trend that predictive analytics will be hot in 2010.
  • Operationalizing Predictive Analytics:  While not every company can or may want to build a predictive model, there are certainly a lot of uses for operationalizing predictive models as part of a business process.  Forward looking companies are already using this as part of the call center process, in fraud analysis, and churn analysis, to name a few use cases.  The momentum will continue to build making advanced analytics more pervasive.
  • Advanced Analytics in the Cloud:  speaking of putting predictive models in the cloud, business analytics in general will continue to move to the cloud for mid market companies and others that deem it valuable.  Companies such as QlikTech introduced a cloud-based service in 2009.  There are also a number of pure play SaaS vendors out there, like GoodData and others that provide cloud-based services in this space.  Expect to hear more about this in 2010.
  • Analyzing complex data streams.  A number of forward-looking companies with large amounts of real-time data (such as RFID or financial data) are already investing in analyzing these data streams.   Some are using the on-demand capacity of cloud based model to do this.  Expect this trend to continue in 2010.
Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers