Text classification with Elasticsearch : NO AI needed

05/23/2023

Advanced Usage, Business Intelligence (BI), Spoon's Elastic posts

Elasticsearch is very powerful for non-structured text classification.

Enhance your document informations In 4 easy steps
Find valuable informations of your company
Enrich documents
Use this information for Data driven decision making

In a recent use case for an important international Consulting Company, we had to exploit non-structured documents to allow our client to exploit his internal data, be able to analyse and find similar content, based on a tag classification.

That’s typically a case where Elasticsearch can look magic. In just a few days, based on the core functionality of Elasticsearch and Kibana, we were able to auto-classify all the documents of the company with information that really matters to them in 3 languages.

The global process can be divided into 4 steps:

1. Ingest

Sending a lot of different document types in a semi structured format to create a usable Elasticsearch index is quite easy with the good tools.

Thanks to Elastic developers, the FSCrawler open source app is done for this.

You’ll just need to follow the doc and add it to a cron job to send any TIKA compatible document to elasticsearch.

2. Analyse

It’s the longer part as you’ll need to do some tests by yourself. Thanks to Shingles, you can create a token with several words. You’ll have to do some trial and error to find the good limits for your shingle and to add the invaluable words to the stopword list of your index.

Don’t forget to force Fielddata=true on your content field to be able to aggregate it.

Kibana is a great tool for this step.

Take the most valuable terms, remove others ones, filter, and play with kibana to retrieve what you want. A significant term aggregation here can be a great help.

2. Enrich

For this part, we wrote a small script here to clean up keywords, normalise them the way we need and dedup them after normalisation. Now it’s time to create your percolators.

They will be used in the next step to add our keywords on the original documents.
A basic percolator will looks like this :

{
  "query": {
    "match_phrase": {
      "content": search
    }
   ,
  "tag": value
}

Then, the last step will be to use this percolator to find the tags to add on your documents.
Now, percolate each original document and update them with the matched tags.
It’s 2 queries per document.

Greetings, You now have a structured field for analysis !

4. Exploit

It was the most interesting step for our client ! This is where you can build Dashboards, customised for each their use cases.

Strategic decisions can now be taken using simple dashboards like this one:

Spoon consulting is a certified partner of Elastic

As a certified partner of the Elastic company, Spoon Consulting offers a high level consulting for all kinds of companies.

Read more information on your personal use Elasticsearch use case on Spoon consulting’s posts

Or contact Spoon consulting now.

Big Data classification Elastic Search elasticsearch Kibana spoon consulting