The Importance of multi-language support in advanced search and text analytics

I had an interesting briefing with the Basis Technology team the other week.  They updated me on the latest release of their technology called Rosette 7.   In case you’re not familiar with Basis Technology it is the multilingual engine that is embedded in some of the biggest Internet search engines out there – including Google, Bing, and Yahoo.  Enterprises and the government also utilize it.  But, the company is not just about keyword search.  Its technology also enables the extraction of entities (about 18 different kinds) such as organizations, names, and places.  What does this mean?  It means that the software can discover these kinds of entities across massive amounts of data and perform context sensitive discovery in many different languages.

An Example

Here’s a simple example.  Say you’re in the Canadian consulate and you want to understand what is being said about Canada across the world.   You type “Canada” into your search engine and get back a listing of documents.  How do you make sense of this?  Using Basis Technology entity extraction (an enhancement to search and a basic component of text analytics), you could actually perform faceted (i.e. guided) navigation across multiple languages.  This is illustrated in the figure below.  Here, the user typed “Canada” into the search engine and got back 89 documents.  In the main pane in the browser, you can see that an arrow in a number of different languages highlights the word Canada, so you know that it is included in these documents.  On the left hand side of the screen is the guided navigation pane.  For example, you can see that there are 15 documents that contain a reference to Obama and another 6 that contain a reference to Barack Obama.  This is not necessarily a co-occurrence in a sentence, just in the document.  So, any of these articles would contain a reference to Obama and Canada.  This would help you determine what Obama might have said about Canada. Or, what the connection is between Canada and the BBC (under organization).  This idea is not necessarily new, but the strong multilingual capabilities make it compelling for global organizations.

If you have eagle eyes, you will notice that the search on Canada returned 89 documents, but the entity “Canada” only returned 61 documents.  This illustrates what entity extraction is all about.  When the search for Canada was run on the Rosette Name Indexer tab (see upper right hand corner of the screen shot) the query searched for Canada against all automatically extracted “Canada” entities that existed in all of the documents.  This includes all persons, locations, and organizations that have similar names. This included entities like “Canada Post” and “Canada Life” which are organizations, not the country itself. Therefore the 28 other documents with a Canada variant are organizations or other entities.

Use Cases

There are obviously a number of different use cases where the ability to extract entities across languages can be important.  Here are three:

  • Watch lists.  With the ability to extract entities, such as people, in multiple languages, this kind of technology is good for government or financial watch lists.  Basis can resolve matches and translate names in 9 different languages. This includes resolving multiple spelling variations of foreign names. It also enables organizations to match names of people, places, and organizations against entries in a multilingual database.
  • Legal discovery.  Basis technology can identify  55 different languages.    Companies would use this technology, for example, to identify multiple languages within a document and then route them appropriately.  Additionally, Basis can extract entities in 15 different languages (and search in 21) so the technology could be used to process many documents and extract the entities associated with them to find the right set of documents needed in legal discovery.
  • Brand image, competitive intelligence.   The technology can be used to extract company names across multiple languages.  The software can also be used against disparate data sources, such as internal document management systems as well as external sources such as the Internet.  This means that it could cull the Internet to extract company name (and variations on the name) in multiple languages.  I would expect this technology to be used by “listening posts” and other “Voice of the Customer” services in the near future.

While this technology is not a text analytics analysis platform, it does provide an important piece of core functionality needed in a global economy.  Look for more announcements from the company in 2010 around enhanced search in additional languages.

The Three C’s – conceptual search, clustering, and categorization

 

I recently had the opportunity to speak with Richard Turner, VP of Marketing, at Content Analyst.  Content Analyst was originally part of SAIC and spun out about 5 years ago.  The company provides content management, eDiscovery, and content workflow solutions to its clients – primarily as an OEM solutions.

The tool set is called CAAT (Content Analyst Analytic Technology).  It includes five core features:

 

  • Concept search: uses concepts rather than key words to search through documents.
  • Dynamic Clustering: classifies documents into clusters.
  • Categorization: classifies documents into user-defined categories.
  • Summarization: identifies conceptually definitive sentences in documents.
  • Foreign Language: identifies the language in a document and can work across any language.

Let me just briefly touch upon concept search and dynamic clustering and categorization.  Concept search is interesting because when most people think search, they think key word search.  However, key words may not give you what you’re looking for.  For example, I might be interested in finding documents that deal with banks.  However, the document might not state the word bank explicitly.  Rather, words like finance, money, and so on might occur in the document.  So, if you don’t insert the right word into the search engine, you will not get back all relevant documents.  Concept search allows you to search on the concept (not keyword) “bank” so you get back documents related to the concept even if they don’t contain the exact word.  CAAT learns the word bank in a given set of documents from words like “money”, “exchange rate”, etc.  It also learns that the word bank (as in financial institution) is not the same as the bank on the side of a river becuase of other terms in the document (such as money, transfer, or ATM).

Dynamic clustering enables the user to organize documents into categories based on content called clusters.  You can also categorize documents by using examples that fall into a certain cluster and then train the system to recognize similar documents that could fall into the same category.  You literally tag the document as belonging to a certain category and then give the system examples of other documents that are similar to this to train on.  In eDiscovery applications, this can help dramatically cut down the amount of time needed to find the right documents.  In the past, this was done manually, which obviously could be very time intensive. 

How do they do it?

The company uses a technique called Latent Semantic Indexing (LSI), along with other patented technology, to help it accomplish all of this.  Here is a good link that explains LSI.  The upshot is that LSI uses a vector representation of the information found in the documents to analyze the term space in a document.  Essentially, it removes the grammar, then counts and weights (e.g. how often a word appears on a page or in a document, etc.) the occurrence of the terms in the document.  It does this across all of the documents, and actually collapses the matrix using a technique patented at Bell Labs.  The more negative a term, the greater its distance from a page.  Since the approach is mathematical, there is no need to put together a dictionary or thesauri.  And, it’s this mathematical approach that makes the system language independent.

Some people have argued this technique can’t scale because the matrix would be too large and it would be hard to keep this in-memory.  However, when I asked the folks at Content Analyst about this they told me that they have been working on the problem and that CAAT contains a number of features to optimize memory and throughput.  The company regularly works with ligitation clients who might get 1-2 TB of information from opposing counsel and they are using CAAT for clustering, categorization, and search.  The company also works with organizations that have created indexes of 45+ million (>8 TB) documents.  That’s a lot of data!

Conceptual Search and Classification and Content Management

Aside from eDiscovery, Content Analyst is also being used in application such as improving search in media and publishing and of course, government applications.  The company is also looking into other application areas.

Concept search is definitely a big step up from keyword search and is important for any kind of document that might be stored in a content management system.  Automatic classification and clustering would also be huge (as would summarization and foreigh language recognition).  This could move Content Analyst into other areas including analyzing claims, medical records, and customer feedback.  Content management vendors such as IBM and EMC are definitely moving in the direction of providing more intelligence in their content management products.  This makes perfect sense, since a lot of unstructured information is stored in these systems.  Hopefully, other content management vendors will catch up soon.

A different way to search?

I recently had an interesting conversation about classification and search with James Zubok, CFO of Brainware, Inc. Brainware is a Virginia based company that was once part of SER Systems AG, a former German ECM company.  Brainware provides products that help companies extract information from unstructured and semi-structured documents, such as invoices, order forms, contracts, etc. without using templates.  The company also offers some interesting classification and search technology and this is what our conversation focused on.

 

We discussed two different, but interrelated technologies that Brainware has developed; one a search engine based on n-grams and another,  a classification engine that uses neural networks. Brainware offers both enterprise and desktop editions of each.  I received a demo of the desktop version of the products and now have both running on my laptop.

 

A Search example

On the desktop search side, the product, called Globalbrain Personal Edition, differs from many other search products on the market in that it does not make use of keyword search.  Rather, it searches are natural language based, using a patented n-gram approach.  When indexing a word, the word is parsed into three parts and then a vector is created.  For example, the word sample would be parsed as sam, amp, mpl, etc.  According to Brainware, this three-letter snippet approach makes the search engine language independent.   The capability provided by Brainware lets users search, not simply on key words, but on whole paragraphs. For example, I have many documents (in various formats) on my desktop that deal with all of the companies I speak with.  Say, I want to find some documents relating to specific challenges companies faced in deploying their text analytics solutions.  Rather than simply inputting “text analytics” and “challenges”, I can type in a phrase or even a paragraph with the wording I’m looking for.  This returns a much more targeted set of documents. 

 

A Classification example

On the desktop classification front, the product is very easy to use.  I simply loaded the software which provided me a user interface where I could develop classes and then train my system to automatically classify documents based on a few training examples. As I mentioned, I have many documents on my desktop that deal with various technology areas and I might want to classify them in an intelligent manner for some research I’m planning.  So, I created several classes: text analytics, visualization, and MDM. I simply created these classes and then dragged documents that I felt fell into each category onto those classes.  I trained the system on these examples. 

 

Brainware provides a visual interface that lets me view how “good” the learned set is via a series of points in three-dimensional space.  The closer together the points (from the same class) are on the plot, the better the classification will be.  Also, the more separate the various class points are, the better the classification.  In my classification test, the visualization and the MDM documents were tightly clustered, while the text analytics information was not.  In any event, I then ran the classifier over the rest of my documents (supplying a few parameters) and the system automatically classified what it could.  It also gave me a list of documents that it couldn’t classify, but suggested the appropriate categories. I could then just drag those documents into the appropriate categories and run the classifier again. I should add that it did a good job of suggesting the right class for the documents it put in the unclassified category.

 

 

Brainware on an enterprise level

 

The enterprise edition of the product combines the search and classification capabilities and lets users search and classify over 400 different document types. 

 

Now, Brainware isn’t planning to compete with Google, Yahoo!, Fast, etc.  Rather, the company sees its search as a complement to these inverted index approaches.  The idea would be to embed its search into other applications that deal with archiving, document management, or e-discovery, to name a few.  The classification piece could also be embedded into the appropriate applications.  I asked if the company was in discussions with content management providers and service providers that store emails and documents.  It would seem to me that this software would be a natural complement to some of these systems. My understanding is that the company is looking for partnerships in the area.  Brainware currently has a partnership with Attensity, a text analytics provider, to help classify and search documents as part of the text analytics process.

 

I’m interested to see what will develop with this company.

 

 

 

Follow

Get every new post delivered to your Inbox.

Join 1,190 other followers