What is Networked Content and Why Should We Care?

This is the first in a series of blogs about text analytics and content management. This one uses an interview format.

I recently had an interesting conversation with Daniel Mayer, from TEMIS regarding his new paper, the Networked Content Manifesto. I just finished reading it and found it to be insightful in terms of what he had to say about how enriched content might be used today and into the future.

So what is networked content? According to the Manifesto, networked content, “creates a network of semantic links between documents that enable new forms of navigation and improves retrieval from a collection of documents. “ It uses text analytics techniques to extract semantic metadata from documents. This metadata can be used to link documents together across the organization, thus providing a rich source of connected content for use by an entire company. Picture 50 thousand documents linked together across a company by enriched metadata that includes people, places, things, facts, or concepts and you can start to visualize what this might look like.

Here is an excerpt of my conversation with Daniel:

FH: So, what is the value of networked content?

DM: Semantic metadata creates a richer index than was previously possible using techniques such as manual tagging. There are five benefits that semantic metadata provides. The first two benefits are that it makes content more findable and easier to explore. You can’t find what you don’t know how to query. In many cases people don’t know how they should be searching. Any advanced search engine with facets is a simple example of how you can leverage metadata to enhance information access by enabling exploration. The third benefit is that networked content can boost insight into a subject of interest by revealing context and placing it into perspective. Context is revealed by showing what else there is around your precise search – for example related documents. Perspective is typically reached through analytics. That is, attaining a high level of insight into what can be found in a large amount of documents, like articles or call center notes. The final two benefits are more future looking. The first of these benefits is something we call “proactive delivery”. Up to now, people mostly access information by using search engines to return documents associated with a certain topic. For example, I might ask, “What are all of the restaurants in Boston?” But by leveraging information about your past behavior, your location, or your profile, I can proactively send you alerts about relevant restaurants you might be interested in. This is done by some advanced portals today, and the same principle can be applied to virtually any forms of content. The last benefit is tight integration with workflow applications. Today, people are used to searching Google or other search engines which require a dedicated interface. If you are writing a report and need to go to the web to look for more information, this interferes with your workflow. But instead, it is possible to pipe content directly to your workflow so that you don’t need to interrupt your work to access it. For example, we can foresee how in the near future, when typing a report in a word processing application such as MS Word, , right in the interface, you will be able to receive bits of information related contextually to what you are typing. As a chemist, , you might receive suggestions of scientific articles based on the metadata extracted from the text you are typing. Likewise, Content management interfaces in the future will be enriched with widgets that provide related documents and analytics.

FH: How is networked content different from other kinds of advanced classification systems provided by content management vendors today?

DM: Networked Content is ultimately a vision for how content can be better managed and distributed by leveraging semantic content enrichment. This vision is underpinned by an entire technical ecosystem, of which the Content Management System is only one element. Our White Paper illustrates how text analytics engines such as the Luxid® Content Enrichment Platform are a key part of this emerging ecosystem.

Making a blanket comparison is difficult, but generally speaking, Networked Content can leverage a level of granularity and domain specificity that the classification systems you are referring to don’t generally support.

FH: Do you need a taxonomy or ontology to make this work?

DM: I’d like to make sure we use caution when we use these terms. A taxonomy or ontology can be helpful, certainly. If a customer wants to improve navigation in content and already has an enterprise taxonomy, it will undoubtedly help by providing guidance and structure. However, in most cases it is not sufficient in and of itself to perform content enrichment. To do this you need to build an actual engine that is able to process text and identify within it some characteristics that will trigger the assignment of metadata (either by extracting concepts from the text itself or by judging the text as a whole) In the news domain, for example, the standard IPTC taxonomy is used to categorize news articles into topic areas such as economy, politics, or sports, and into subcategories like economy/economic policy or economy/macroeconomics, etc… You can think of this as a file cabinet where you ultimately want to file every article. What the IPTC taxonomy does is that it tells you the structure the file cabinet should have. But it doesn’t do the filing for you. For that, you need to build the metadata extraction engine. That’s where we come in. We provide a platform that includes standard extraction engines – that we call Skill Cartridges® as well as the full development environment to customize them, extend their coverage, and develop new ones from the ground up if needed.

FH: I know that TEMIS is heavily into the publishing industry and you cite publishing examples in the Manifesto. What other use cases do you see?

DM: The Life Sciences industry (especially Pharma and Crop Science) has been an early adopter of this technology for applications such as scientific discovery, IP management, knowledge management, pharmacovigilance,. These are typical use cases for all research-intensive sectors. Another group of common use cases for this technology in the private sector is what we call Market Intelligence: understanding your competitors and complementors (Competitive Intelligence), your customers (Voice of the Customer) and/or what is being said about you (Sentiment Analysis) You can think of all of these as departmental applications in the sense that primarily serve the needs of one department: R&D, Marketing, Strategy, etc…

Furthermore, we believe there is an ongoing trend for the Enterprise to adopt Networked Content transversally, beyond departmental applications, as a basic service of its core information system. There, content enrichment can act as the glue between content management, search, and BI, and can bring productivity gains and boost insight throughout the organization. This is what has led us to deploy within EMC Documentum and Microsoft SharePoint 2010 In the future all the departmental applications will become even more ubiquitous thanks to such deployments.

FH: How does Networked Content relate to the Semantic Web?

DM: They are very much related. The Semantic Web has been primarily concerned with how information that is available on the Web should be intelligently structured to facilitate access and manipulation by machines. Networked Content is focused on corporate – or private – content and how it can be connected with other content, either private, or public.

The Three C’s – conceptual search, clustering, and categorization

 

I recently had the opportunity to speak with Richard Turner, VP of Marketing, at Content Analyst.  Content Analyst was originally part of SAIC and spun out about 5 years ago.  The company provides content management, eDiscovery, and content workflow solutions to its clients – primarily as an OEM solutions.

The tool set is called CAAT (Content Analyst Analytic Technology).  It includes five core features:

 

  • Concept search: uses concepts rather than key words to search through documents.
  • Dynamic Clustering: classifies documents into clusters.
  • Categorization: classifies documents into user-defined categories.
  • Summarization: identifies conceptually definitive sentences in documents.
  • Foreign Language: identifies the language in a document and can work across any language.

Let me just briefly touch upon concept search and dynamic clustering and categorization.  Concept search is interesting because when most people think search, they think key word search.  However, key words may not give you what you’re looking for.  For example, I might be interested in finding documents that deal with banks.  However, the document might not state the word bank explicitly.  Rather, words like finance, money, and so on might occur in the document.  So, if you don’t insert the right word into the search engine, you will not get back all relevant documents.  Concept search allows you to search on the concept (not keyword) “bank” so you get back documents related to the concept even if they don’t contain the exact word.  CAAT learns the word bank in a given set of documents from words like “money”, “exchange rate”, etc.  It also learns that the word bank (as in financial institution) is not the same as the bank on the side of a river becuase of other terms in the document (such as money, transfer, or ATM).

Dynamic clustering enables the user to organize documents into categories based on content called clusters.  You can also categorize documents by using examples that fall into a certain cluster and then train the system to recognize similar documents that could fall into the same category.  You literally tag the document as belonging to a certain category and then give the system examples of other documents that are similar to this to train on.  In eDiscovery applications, this can help dramatically cut down the amount of time needed to find the right documents.  In the past, this was done manually, which obviously could be very time intensive. 

How do they do it?

The company uses a technique called Latent Semantic Indexing (LSI), along with other patented technology, to help it accomplish all of this.  Here is a good link that explains LSI.  The upshot is that LSI uses a vector representation of the information found in the documents to analyze the term space in a document.  Essentially, it removes the grammar, then counts and weights (e.g. how often a word appears on a page or in a document, etc.) the occurrence of the terms in the document.  It does this across all of the documents, and actually collapses the matrix using a technique patented at Bell Labs.  The more negative a term, the greater its distance from a page.  Since the approach is mathematical, there is no need to put together a dictionary or thesauri.  And, it’s this mathematical approach that makes the system language independent.

Some people have argued this technique can’t scale because the matrix would be too large and it would be hard to keep this in-memory.  However, when I asked the folks at Content Analyst about this they told me that they have been working on the problem and that CAAT contains a number of features to optimize memory and throughput.  The company regularly works with ligitation clients who might get 1-2 TB of information from opposing counsel and they are using CAAT for clustering, categorization, and search.  The company also works with organizations that have created indexes of 45+ million (>8 TB) documents.  That’s a lot of data!

Conceptual Search and Classification and Content Management

Aside from eDiscovery, Content Analyst is also being used in application such as improving search in media and publishing and of course, government applications.  The company is also looking into other application areas.

Concept search is definitely a big step up from keyword search and is important for any kind of document that might be stored in a content management system.  Automatic classification and clustering would also be huge (as would summarization and foreigh language recognition).  This could move Content Analyst into other areas including analyzing claims, medical records, and customer feedback.  Content management vendors such as IBM and EMC are definitely moving in the direction of providing more intelligence in their content management products.  This makes perfect sense, since a lot of unstructured information is stored in these systems.  Hopefully, other content management vendors will catch up soon.

A different way to search?

I recently had an interesting conversation about classification and search with James Zubok, CFO of Brainware, Inc. Brainware is a Virginia based company that was once part of SER Systems AG, a former German ECM company.  Brainware provides products that help companies extract information from unstructured and semi-structured documents, such as invoices, order forms, contracts, etc. without using templates.  The company also offers some interesting classification and search technology and this is what our conversation focused on.

 

We discussed two different, but interrelated technologies that Brainware has developed; one a search engine based on n-grams and another,  a classification engine that uses neural networks. Brainware offers both enterprise and desktop editions of each.  I received a demo of the desktop version of the products and now have both running on my laptop.

 

A Search example

On the desktop search side, the product, called Globalbrain Personal Edition, differs from many other search products on the market in that it does not make use of keyword search.  Rather, it searches are natural language based, using a patented n-gram approach.  When indexing a word, the word is parsed into three parts and then a vector is created.  For example, the word sample would be parsed as sam, amp, mpl, etc.  According to Brainware, this three-letter snippet approach makes the search engine language independent.   The capability provided by Brainware lets users search, not simply on key words, but on whole paragraphs. For example, I have many documents (in various formats) on my desktop that deal with all of the companies I speak with.  Say, I want to find some documents relating to specific challenges companies faced in deploying their text analytics solutions.  Rather than simply inputting “text analytics” and “challenges”, I can type in a phrase or even a paragraph with the wording I’m looking for.  This returns a much more targeted set of documents. 

 

A Classification example

On the desktop classification front, the product is very easy to use.  I simply loaded the software which provided me a user interface where I could develop classes and then train my system to automatically classify documents based on a few training examples. As I mentioned, I have many documents on my desktop that deal with various technology areas and I might want to classify them in an intelligent manner for some research I’m planning.  So, I created several classes: text analytics, visualization, and MDM. I simply created these classes and then dragged documents that I felt fell into each category onto those classes.  I trained the system on these examples. 

 

Brainware provides a visual interface that lets me view how “good” the learned set is via a series of points in three-dimensional space.  The closer together the points (from the same class) are on the plot, the better the classification will be.  Also, the more separate the various class points are, the better the classification.  In my classification test, the visualization and the MDM documents were tightly clustered, while the text analytics information was not.  In any event, I then ran the classifier over the rest of my documents (supplying a few parameters) and the system automatically classified what it could.  It also gave me a list of documents that it couldn’t classify, but suggested the appropriate categories. I could then just drag those documents into the appropriate categories and run the classifier again. I should add that it did a good job of suggesting the right class for the documents it put in the unclassified category.

 

 

Brainware on an enterprise level

 

The enterprise edition of the product combines the search and classification capabilities and lets users search and classify over 400 different document types. 

 

Now, Brainware isn’t planning to compete with Google, Yahoo!, Fast, etc.  Rather, the company sees its search as a complement to these inverted index approaches.  The idea would be to embed its search into other applications that deal with archiving, document management, or e-discovery, to name a few.  The classification piece could also be embedded into the appropriate applications.  I asked if the company was in discussions with content management providers and service providers that store emails and documents.  It would seem to me that this software would be a natural complement to some of these systems. My understanding is that the company is looking for partnerships in the area.  Brainware currently has a partnership with Attensity, a text analytics provider, to help classify and search documents as part of the text analytics process.

 

I’m interested to see what will develop with this company.

 

 

 

Follow

Get every new post delivered to your Inbox.

Join 1,189 other followers