In Elasticsearch, paginating aggregations results is a recurring need.
By default, Elastic will send all results in your aggregation. If a query filter is often enough, it’s not always the wanted behavior.
First possibility, increase a lot the size parameter and do the pagination on front side.
It can be a good solution… for few hundred results, and a low cardinality.
But if we don’t want to crash our app, we probably can do better.
Depending on your specific use case you will have several choices:
Bucket Sort aggregation
ElasticSearch supports Bucket Sort Aggregation in v6.1 and later. It allows « sort », « size » and « from » parameters within aggregated results.
Exemple from official doc.
POST /sales/_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "date",
"calendar_interval": "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_sort": {
"bucket_sort": {
"sort": [
{ "total_sales": { "order": "desc" } }
],
"size": 3
}
}
}
}
}
}
But there is not a big performance difference because this is a pipeline aggregation. So it’s applicated on the results of other the targeted bucket.
Caclulated on results of the other aggregation. So you will have to put a large size on the first one. (so calculate this large size)
Easiness: 5
Performance: 2.5
Capabilities: 1
Ok for low to medium cardinality, but need to compute everything. So no real improvement.
Partitions aggregations
Partitioning an aggregation is more interesting. It really divide an aggregation
- Use the cardinality aggregation to estimate the total number of unique result values
- Pick a value for num_partitions to break the number from 1) up into more manageable chunks
- Pick a size value for the number of responses we want from each partition
- Run a test request
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 0,
"num_partitions": 20
},
"size": 10000,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
Far better. You can have a real server side pagination.
But it’s applicable only on the more simple aggregations.
Easiness: 3
Performance: 4
Capabilities: 3
Composite aggregation
This is the ultimate use case. But it introduces more complexity and can manage only a few aggregations (terms, histograms and geotile_grid).
It’s not a pipeline aggregation but a multi-bucket aggregation. It means that it’s also a real server side pagination.
Basically it works like a search after
GET /_search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 2,
"sources": [
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } },
{ "product": { "terms": { "field": "product", "order": "asc" } } }
],
"after": { "date": 1494288000000, "product": "mad max" }
}
}
}
}
It adds an after key parameter usable as it is to paginate on several aggregation at once.
"after_key" : {
"date" : 1594080000000,
"product" : "AE"
}
Easiness: 2
Performance: 4
Capabilities: 4
With recent versions of Elasticsearch, you should be able to manage all your use cases.
Spoon consulting is a certified partner of Elastic
As a certified partner of the Elastic company, Spoon Consulting offers a high level consulting for all kinds of companies.
Read more information on your personal use Elasticsearch use case on Spoon consulting’s posts