Content analytics: The main enterprise search engine pillar
The comedian Steve Martin wasn’t thinking about content analytics in 1978 when he recorded his classic album, “A Wild and Crazy Guy.” One of his bits, though, turns out to be quite thought provoking from the perspective of today’s content intelligence and natural language processing workloads.
Martin suggests a “great dirty trick you can play on a three-year-old.” Noting that kids learn how to talk by listening to adults, he says, hey, why not just talk wrong when the kid is around? That way, he’ll show up for his first day of school and proclaim, “Me mambo dogface in the banana patch!”
Whether or not you find Martin’s joke funny, it inadvertently points to a split in the Artificial Intelligence (AI) field that affects how enterprise systems parse language. An empiricist will tell you that a machine can become intelligent if you expose it to enough examples of data. That is, if you allow a computer to read a thousand books about biology, it will start to understand how living organisms function.
A rationalist, in contrast, will argue that the machine will only become intelligent if the data examples are paired with logical structures that represent logical thought processes. Dumping a thousand biology books’ worth of data into a computer just creates a jumble of disconnected biology information. There will be no higher understanding achieved.
Both schools of thought have merit in AI as it is applied to the work of analyzing content. The pure empiricist viewpoint doesn’t work well. If you do not deploy an adequate natural language processing (NLP) toolset in your enterprise, any attempt to analyze content will give you the corporate equivalent of “Me mambo dogface in the banana patch!”
What is content analytics?
Content analytics is a version of data analytics that deals with information contained inside written content, such as digital documents. Content analytics blends the empiricist and rationalist approaches to AI as it seeks to find structure and meaning in enormous repositories of seemingly random information. From the empiricist perspective, a content analytics solution must consume, or crawl all the information contained in the content. This might mean “reading” a couple of million PDFs.
To make any sense of this mass of words, however, content analytics has to apply—or learn—some rationalist concepts to put content into context. For example, the word “table” in “Table of Contents” is not a table like the kind where you eat dinner. You know that. I know that.
But, does a machine know that? Probably not right at the start. The content analytics solution can learn to differentiate such terms, but this requires Natural Language Understanding (NLU) capabilities.
Natural Language Understanding, which is a subset of NLP, involves software that can understand human language. It accomplishes this task by means of a lexicon, which is a huge compendium of human language, along with rules-based processes that turn natural language into data a computer can use in analytics workloads.
How to strengthen your environment through content analytics
Content analytics is an essential element of an information-driven organization. To understand why this is the case, consider the quality of information available to employees if there were no content analytics in an organization’s enterprise search solution. Without content analytics, a search engine will return results that feature a keyword or phrase. However, if there is no content analytics going on, the search results will be poorly tuned and possibly even totally irrelevant.
For example, imagine you’re looking for a copy of last year’s brand marketing plan. You type “branding plan” into your enterprise search tool, because you think that’s what the plan was actually called. You’re not sure.
If your search tool has content analytics capabilities, it will have already read through all the documents that might fit the term “branding plan.” It will have associated the term “branding plan” with what the document is actually called, which is “Branding Vision and Plan.” The solution returns this document among the top search results.
In contrast, a solution that does not use content analytics might return no results, because nothing matches “branding plan.” Or, worse, it could show you hundreds of results that contain the words “branding” or “plan.” That’s completely useless and a waste of time.
As this example shows, content analytics depends on the ability to perform unstructured data analytics. Documents like PDFs and PowerPoints are unstructured. Unlike structured data, the neatly stored information in the rows and columns of a database, unstructured data is loose and subjective. Content, by definition, is unstructured data.
One concept that sometimes gets confused with content analytics is that of predictive analytics. Predictive analytics is an analytics process that examines data to make predictions about future events. For instance, a predictive analytics tool might ingest thousands of maintenance reports for a fleet of trucks. The tool might arrive at the conclusion that a truck will need new brake pads after it has driven a certain number of miles, regardless of what the official maintenance manual says.
A predictive analytics tool could interpret a data stream of real time truck odometer readings taken from Internet connected sensors and flag a truck that needs new brake pads. By alerting a maintenance manager about the need for the new pads, the predictive analytics tool could help avert an accident. Predictive analytics and content analytics, while separate workloads, do overlap. In this example, the truck maintenance reports are unstructured content.
Content analytics with enterprise search
Search and content analytics go together. In fact, you could argue that enterprise content analytics is essential for effective enterprise search. As the “branding plan” example shows, without content analytics, search won’t work very well.
Content analytics software enriches enterprise search. In addition to handling analytics content, a content analytics solutions embodies the rules and NLU processing capabilities required for meaningful analysis of unstructured content.