Denormalization for index mapping in Elasticsearch

09/01/2020

Discover Elastic, Simple Elastic Usage, Spoon's Elastic posts

Note: this is a re-post of our original article on Medium: Denormalization for index mapping in Elasticsearch.
Don’t hesitate to support us with a clap.

TL:DR

Elasticsearch is not a relational database. You will not be able to Join indices as you used to join Tables
Denormalization is not natural but is a key for efficiency in an Elasticsearch application.
Thinking of your data mapping at the early beginning will allow your app to fly for years.

Why Denormalize?

If you look for a definition of Elasticsearch you will probably find something like this :

Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene

At Spoon Consulting, we used to use another definition :

Elasticsearch is a distributed NoSQL database optimized for search

NoSQL databases come with their own rules.
In general they are optimized for specific use cases and designed to best fit underneath specific needs.

So when Elastic optimizes his NoSql DB for search, the focus is SPEED.

In analyzing classical relational databases performance, we all know that JOINs have a cost and can be very heavy and very slow.

The best way to speed them up at the maximum is to… remove them.

That’s why, in Elasticsearch, an index CAN NOT be joined with another one.
So If you need data, you must have it in your index.

As a developer, for years and years you worked to leverage the state of art of normalization, never repeating any information anywhere. You should be proud enough of some DB Design that you hung it on the wall and keep it near your rewards…

How to denormalize your data structure?

In a SQL equivalent, your goal is to gather several tables into only one.
And for this you will have to flatten and repeat yourself a lot!

It’s really hard to wrap your head around it.
But it’s a key point to transform a 200ms response time into 5ms. Optimizing data in an Elasticsearch paradigm does not mean the same as optimizing data in PostgreSQL.

Let’s take a very simple blog database example:

At first, we will ask ourselves what we want to search for?
We want to search on Posts, so our index will be on posts (not on blogs) and will keep a trace on interesting data for our search use case.

Then an elasticsearch document will probably look like this :

POST denormalized-blog-posts
{
  "id": "12345",
  "title": "Denormalization for elastic search",
  "post_content": "my long text content",
  "blog": {
    "id": 1,
    "title": "Spoon Consulting blog on Elasticsearch",  
    "slogan": "better, faster, bigger",
    "user": {
      "id": "1",
      "username": "Jérémy Gachet"
    }
  },
  "tags": ["elasticsearch", "bdd", "beginner"]
}

The blog part will be repeated on each post as is the User part
We are only interested in the name of the tags. So we will flatten them as an array on our post.

It can be very disturbing as a newbie in Elasticsearch, but don’t let your DRY side design your models.

Special nesting data structures :

More complexe data structure exists in Elastic.

But they are far more slow and complex to handle, so don’t use them if you don’t know what you are doing.

Nested Data Types

In our previous exemple, we added a ‘sub-object’ in our mapping.

In Elasticsearch, a nested is an “array of objects”.

It can be really useful but it has some caveats.
We will probably write a dedicated article on this blog to describe it in detail.

Let’s take our example back, we could have managed tags like this :

{
  "some_usefull_key": "some useful content",
  "tags": [
    {
      "id": 45,
      "name": "elasticsearch"
    },
    {
      "id": 401,
      "name": "bdd"
    },
    {
      "id": 2,
      "name": "beginner"
    }
  ]
}

Parent Child (Join data type)

But you told us that we cannot join indices?

No we still cannot.

In Elasticsearch, Parents and Children will live in the same index.

It’s totally possible to have totally different objects with totally different fields within the same index.

DON’T DO THIS!

This type exists for very specific use cases with a lot of child updates or as says the official documentation if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity.

This data type is more complicated to handle and have very poor performance

I will not detail parent/child here. And probably never will in this blog.
If you want to know more you can read the official documentation here.

Caveats of Denormalization

Flatten data make it easier to find them quickly. And it should be your main purpose by using Elasticsearch.

But of course, repeating data makes the document and index bigger on the Hard Drive.
It’s usually not an issue in most search use cases.

Update denormalized documents

And you should already ask yourself (or shouting on me behind your screen), what about updates?
If an author name changes, do I have to update all referring docs in my index?

In one word… Yes.

But Elastic offers several ways to do it easily.

Bulk Update

At first you can construct a batch at your application side to post each change into a bulk query.

POST denormalized-blog-posts/_bulk
{ "update" : {"_id" : "cQ_G4nMBw1rncBcgZz2d"} }
{ "doc" : {"user.pseudo" : "my new very cool pseudo"} }
{ "update" : {"_id" : "Ae_Cr6MBw5rncNcgJzia"} }
{ "doc" : {"user.pseudo" : "my new very cool pseudo"} }

It will be far less expensive than a single update on the _doc api of each document.

_update_by_query

The most powerful maintenance way to update a lot of data.

Create a search query that match the data you want to update
Ask elastic to do the whole job itself

POST denormalized-blogs-post/_update_by_query
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "user.id": 45
          }
        }
      ]
    }
  },
  "script": {
    "source": "ctx._source.['user']['pseudo'] = 'my new very cool pseudo';",
    "lang": "painless"
  }
}

The _update_by_query api is very very powerful. You can parallelize, manage throttling, script complexe conditions and more.

We will definitely write a dedicated post about it on this blog.

References :

A tutorial for denormalization. A little outdated but still interesting : https://medium.com/wolox/why-and-how-denormalize-indexing-with-elasticsearch-rails-6c3c12f03c7c
Another really good blog post on denormalization:
https://rockset.com/blog/can-i-do-sql-style-joins-in-elasticsearch/

Spoon consulting is a certified partner of Elastic

As a certified partner of the Elastic company, Spoon Consulting offers a high level consulting for all kinds of companies.

Read more information on your personal use Elasticsearch use case on Spoon consulting’s posts

Or contact Spoon consulting now.

best practice Elastic Search elasticsearch spoon consulting