Inside Enterprise Search: The Search Index
Large enterprises and organizations collect an enormous amount of content and data, each day adding new pieces to the stockpile. More than 80% of that data is unstructured, scattered, and siloed across a variety of systems, formats, and languages.
To help employees spend less time searching this mass of data and more time problem solving, enterprise search was born. An enterprise search engine provides access to information from any source, in any format, whether structured or unstructured, through a single platform and user interface.
Just like with Google, part of the process that makes an enterprise search engine work includes building a search index. But unlike Google, an enterprise search engine must work across not only web pages but also databases, applications, email, CMS, and more. This makes the enterprise search indexing process unique.
Let’s walk through the details of an enterprise search index.
What is a search index?
The search index is the place where all of the data that the search platform collected is stored and organized. At the simplest level, it’s similar to an index in the back of a book, where you can look up a keyword and see all of the pages that have information related to that term.
The point of the search index is to retrieve the most relevant information for a query quickly and accurately. Large organizations can have tens of millions of documents, so a search index is necessary to understand and categorize them all. If someone in a manufacturing organization, for example, searches for a part number, everything from Slack conversations to call center logs and PLM system documents that relate to that product number can be called forth in an instant.
The most important part of the search indexing process is extracting meaning from every kind of enterprise dataset, in any language. An advanced enterprise search platform’s search index should be able to:
- Discern meaning from structured, unstructured, and semi-structured content.
- Support all major languages, including double-byte character sets (Chinese and Japanese).
- Conduct a full syntax analysis in many global languages.
- Build a rich index based on statistic, linguistic, and semantic analyses.
How does an enterprise search index work?
Every enterprise search platform will have its own indexing processes. Here at Sinequa, we have a variety of modes to start the process, depending on the freshness and update frequency desired for each data source.
- On-demand mode: indexing is triggered on-demand, as needed by the administrator.
- Scheduled mode: automatic indexing at pre-set intervals or according to a set calendar.
- Trigger mode: indexing automatically triggered based on set events (e.g., whenever a document is added to a particular location, or after 1,000 records are added to a database).
Depending on the circumstances, search indexing can either be complete or incremental:
- Complete indexing: the source is fully indexed (or re-indexed). This is used for initial indexing of a new data source or when a data store is replaced rather than updated.
- Incremental indexing: only new or updated data is indexed.
Why enterprise search needs multiple indexes.
If enterprise search was simply matching a keyword, a single index would be enough. But the complexity of enterprise search and the fact that much of the information is unstructured, necessitates the use of multiple indexes. When the multiple indexes are combined together, the platform can provide the most comprehensive assessment of the meaning of the text.
At Sinequa, our search index is a combination of three types of indexes: full text, structured, and semantic. Together, they deliver relevance at scale.
- Full text index: This index contains statistics of key terms and descriptions about how those terms appear within documents. When the information is processed, various statistics are gathered about the words, entities, and phrases present in the source documents.
- Structured index: This index contains the structured metadata of documents (titles, dates, authors, keywords, document version, etc.) whenever that metadata is available.
- Semantic index: This index is focused on the meaning of the information in the text. Deep text analysis and natural language processing distinguish between nuances in meaning, for instance, to understand that a glass of wine holds a drink, while a glass window refers to the material, and glasses are something you wear to see better.
In addition to building multiple indexes, an advanced enterprise search indexing process should also incorporate a feedback loop. Information should be collected about the documents returned, and if ratings are enabled, the rating information for each document. This helps the system learn which documents are most used and most highly rated, which in turn helps improve the quality of future search results.
The importance of security in the enterprise search index.
Another layer that is unique to the enterprise search index is security. In a large organization, managing who has access to what content is of critical importance. This starts with rules around the data that is initially ingested all the way to controlling the content that is populated in the auto-suggest feature.
An advanced enterprise search index will include role-based access control, to ensure that only authorized content is returned to each user depending on their access level. Content-based access control should also be included, allowing administrators to lock content as needed.
Interested in learning more about enterprise search? Watch our webinar.