Scoring TF/IDF with Elasticsrearch

10/05/2020

Discover Elastic, Spoon's Elastic posts, Understanding Elastic

The scoring is maybe the more game changing feature between SQL databases and Elasticsearch.
It works well by default for most use cases, but sooner or later you will want to know more about it.

TL:DR

Elasticsearch returns response ordered by relevance score
Score is calculated with the TF/IDF algorithm
Scoring is very powerful and customizable

Order by relevance score

In classical Databases (SQL or NoSQL), a query returns results ordered by a value. By default it’s often the ID desc, but it can be whatever you need. (publication date, ratings, alpha, etc…)

Elasticsearch is able to do the same, of course (and with the recent ad of index sorting, it can be done very effectively)

But it’s not the default behavior.

Let’s take a simple example with a search on this blog:

GET /_search?filter_path=hits.hits
{
  "query": {
    "multi_match": {
      "query": "elastic", # look for the work elastic 
      "fields": ["post_title", "post_content_filtered", "terms.category.name"] # in each fields
    }
  },
  "_source":  ["post_title"] # returns only the title
}

The response will be:

{
  "hits" : {
    "hits" : [
      {
        "_score" : 1.640074,
        "_source" : {
          "post_title" : "More Like This query (MLT)  - Suggest similar content with Elasticsearch"
        }
      },
      {
        "_score" : 1.6229129,
        "_source" : {
          "post_title" : "Paginating term aggregation"
        }
      },
      {
        "_score" : 1.58848,
        "_source" : {
          "post_title" : "Kibana Map on the Elastic stack 7.6+"
        }
      },
…
]

As you can see, for each “match” of my query, Elasticsearch computes a score and returns documents ordered with the best score first.

This can be defined as a relevance score.

For each search field, Elastic calcul a score based on the TF/IDF algorithm and the size of the fields, and sums all scores to get the document relevance score.

Let’s dig into this.

Elastic’s TF/IDF scoring algorithm

Let’s begin with a simple explanation.

3 main factors are taken into account :

Term Frequency (TF): the more the search appears in the field the more the field is relevant
Inverse document frequency (IDF): the more the search appears in all the subset of documents, the less relevant it is.
Let’s take our example: our blog is talking about Elasticsearch. So if a user searches for “Elasticsearch scoring”, every blog post of the site will talk about Elasticsearch. Then it’s not a very interesting word. But scoring is more a differentiator between all posts. So, as it appears in less documents of the blog, “scoring” will be more important than “Elasticsearch” in the context of spoon-elastic’s content.
Field size: The shorter a field is, the more important it is. If my words Elasticsearch scoring are found in a title of 100 characters, it will give a better score to the document than a match in a whole document.

The detailed function is:

∑ (           
    tf(t in d)  # tf = sqrt(termFreq)
    idf(t)²  # 1 + ln(maxDocs/(docFreq + 1))    
    t.getBoost() #query time boost applied on field
    norm(t,d)    #1/sqrt(numFieldTerms)
) (t in q)

score(q,d) is the relevance score of document d for query q
queryNorm(q) is the query normalization factor
coord(q,d) is the coordination factor
The sum of the weights for each term t in the query q for document d
tf(t in d) is the term frequency for term t in document d
idf(t) is the inverse document frequency for term t
t.getBoost() is the boost that has been applied to the query
norm(t,d) is the field-length norm, combined with the index-time field-level boost, if any

If you want to know more, you can see all the detail of the “practical scoring function” in the official Elasticsearch documentation. I promise you it’s nearly the same thing I simplified in the first part.

This algorithm is very powerful out of the box.

Your titles will have more power than your content, an article which quotes 10 times your query will be more relevant than the post which quotes it only once.

Of course you can leverage these features, by tweaking the weight of parameters, or even by writing your own scoring function. But tweaking the scoring will be part of another post.

To understand how a document score is calculated for a match, we can use the _analyse API with a search on the title of spoon-elastic’s blogs:

GET  _explain/4019
{
  "query": {
    "match": {
      "post_title": "elastic"
    }
  }
}

And the result will looks like:

{
  "_id" : "4019",
  "matched" : true,
  "explanation" : {
    "value" : 1.58848,
    "description" : "weight(post_title:elast in 0) [PerFieldSimilarity], result of:",
    "details" : [
      {
        "value" : 1.58848,
        "description" : "score(freq=1.0), computed as boost * idf * tf from:",
        "details" : [
          {
            "value" : 2.2,
            "description" : "boost",
            "details" : [ ]
          },
          {
            "value" : 1.9924302,
            "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
            "details" : [
              {
                "value" : 1,
                "description" : "n, number of documents containing term"
              },
              {
                "value" : 10,
                "description" : "N, total number of documents with field"
              }
            ]
          },
          {
            "value" : 0.3623898,
            "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
            "details" : [
              {
                "value" : 1.0, # elastic occurs once in  this document
                "description" : "freq, occurrences of term within document"
              },
              {
                "value" : 1.2, # Default Value
                "description" : "k1, term saturation parameter"
              },
              {
                "value" : 0.75,  # Default Value
                "description" : "b, length normalization parameter"
              },
              {
                "value" : 6.0, # , 6 words in the title field
                "description" : "dl, length of field"
              },
              {
                "value" : 3.7, # Average word count in post_title field in the shard
                "description" : "avgdl, average length of field"
              }
            ]
          }
        ]
      }
    ]
  }
}

As the description of Elasticsearch said, score for this document is:
boost * idf * tf from = 1.9924302*0.3623898*2.2 = 1.58848

Spoon consulting is a certified partner of Elastic

As a certified partner of the Elastic company, Spoon Consulting offers a high level consulting for all kinds of companies.

Read more information on your personal use Elasticsearch use case on Spoon consulting’s posts

Or contact Spoon consulting now.

Elastic Search elasticsearch scoring search TF/IDF