In previous posts Boolean compound queries, Full Text Query, Query term and Source filtering, Elasticsearch in filter context we performed search query and filter context and retrieved documents and specific fields of our interest. For analytics purpose we want result oriented response instead of verbose document. The man agenda of this post is to use Elasticsearch aggregation to execute analytical queries and get appropriate concise response.
Prerequisite :-
1. Elasticsearch instance is up and running.
2. Prepare dataset for analytical queries.
--> Remove [ and ] from file.
--> Using regex replace method replace "},{" with "}\n{"
--> Append index info before each payload: Find-Replace {"name" with {"index" : {}}\n{"name"
If you have processed file then Download cleaned and processed file.
4. Update json fie with new line at end of file.
5. Using "_bulk" API update all document with indices employee.
6. Validate 1000 documents are in place in index employees with type name personal
Aggregate query(metric aggregation)with search query : Below query finds average salary for all employee - either she is female employee or whose state value contains "Mississippi".
Elasticsearch stats aggregation - Wide range of statistics in one query. Below query uses "stats" aggregation to find various statistics on salary fields. it gives min, max, average and sum of salary in one go.
Number of Unique values of given filed (cardinality of field) : Below query find number of unique age value in entire document. "cardinality" aggregation gives number of unique values for age field.
Note:- Since Fielddata is disabled on text fields by default, if we replace field name as "gender", elastic search throws exception - "Fielddata is disabled on text fields by default. Set fielddata=true on [gender] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
What is "fielddata" in context of Elasticsearch(ELS) ?
In ELS Text field values are stored in an in-memory data structure called fielddata. fielddata is built on demand when a field is used for aggregations, sorting etc. Since fielddata on text fields take up lots of heap space so fielddata is disabled by default on text fields.
How to enable fielddata on textfield in Elasticsearch ?
Using "_mapping" API of elasticsearch we can enable fielddata on textfield. Below sample query enable fielddata on gender.
Once fielddata is enabled on gender we can execute query associated with cardinality aggregation with textfield . Below query display cardinality of gender as 2.
Bucket aggregations by field values : Below query execute term aggregation query and divide all documents in index employee into two category. All 1000 documents is split into two blocks (524 and 476) documents.
Range aggregation with key/value response: With <"keyed" : true> we indicate that response of this query will gives in key/value pair. Below query displays response as key/value pair.
Range aggregation with custom key value : Here we are passing key values in as input and same is displayed as key in output like "young", "quarter-aged", etc. instead of "30, "to" : 40" as displayed above.
Find average age of male and female within range of age (3 Level nesting): Outermost "aggs" specifies bucketing aggregation which divides documents into two bucket (male and Female). Second "aggs" split buckets into range of age and finally inner "aggs" find average age of each rage of age.
====******======
Prerequisite :-
1. Elasticsearch instance is up and running.
2. Prepare dataset for analytical queries.
- Open https://www.json-generator.com/ in browser.
- Copy and paste below sample template to generate compact form of JSON & save it in file "employee.json"
[ '{{repeat(1000, 1000)}}', { name: '{{firstName()}} {{surname()}}', age: '{{integer(18, 75)}}', salary:'{{integer(10000,80000)}}', gender: '{{gender()}}', email: '{{email()}}', phone: '+1 {{phone()}}', street: '{{integer(100, 999)}} {{street()}}', city: '{{city()}}', state: '{{state()}}, {{integer(100, 10000)}}' } ]
--> Remove [ and ] from file.
--> Using regex replace method replace "},{" with "}\n{"
--> Append index info before each payload: Find-Replace {"name" with {"index" : {}}\n{"name"
If you have processed file then Download cleaned and processed file.
4. Update json fie with new line at end of file.
5. Using "_bulk" API update all document with indices employee.
curl -H "Content-Type: application/x-ndjson" -XPOST 'localhost:9200/employees/personal/_bulk?pretty&refresh' --data-binary @"employees.json"
➜ Desktop curl -XGET 'localhost:9200/_cat/indices?v&pretty' health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open employee LU8xvoyMRwi-0o5K2JCyMg 5 1 100 0 86.1kb 86.1kb yellow open employees 3MQomR4CSRCLYYzywkZ9vg 5 1 1000 0 94.1kb 94.1kb yellow open customers KSbOq8eySwGgvJdH7VfQWQ 5 1 1000 0 485.1kb 485.1kb yellow open products Lf8I7-H1QPeU6DrDyHWb1A 5 1 8 0 17.5kb 17.5kb
Metrics aggregations
Find average salary of all employees: Below query is an example of aggregation query. It finds average salary of all employee. Here "aggs" indicates that query is of type aggregation, "avg_age" is filed name where result is assigned and "avg" is aggregate functions to compute average. Finally, field name indicates that - average is computed on which field. Since aggregation query is not interested in document so size= 0.➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "avg_age" : { "avg" : { "field" : "salary" } } } } ' -H 'Content-Type: application/json' { "took" : 17, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "avg_age" : { "value" : 44966.073 } } }
Aggregate query(metric aggregation)with search query : Below query finds average salary for all employee - either she is female employee or whose state value contains "Mississippi".
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "query" : { "bool" : { "should": [ { "match": { "state": "Mississippi" } }, { "match": { "gender": "female" } } ] } }, "aggs" : { "avg_age" : { "avg" : { "field" : "salary" } } } } ' -H 'Content-Type: application/json' { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 483, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "avg_age" : { "value" : 44222.351966873706 } } }
Elasticsearch stats aggregation - Wide range of statistics in one query. Below query uses "stats" aggregation to find various statistics on salary fields. it gives min, max, average and sum of salary in one go.
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "age_stats" : { "stats" : { "field" : "salary" } } } } ' -H 'Content-Type: application/json' { "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "age_stats" : { "count" : 1000, "min" : 10026.0, "max" : 79968.0, "avg" : 44966.073, "sum" : 4.4966073E7 } } }
Number of Unique values of given filed (cardinality of field) : Below query find number of unique age value in entire document. "cardinality" aggregation gives number of unique values for age field.
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "age_count" : { "cardinality" : { "field" : "age" } } } } ' -H 'Content-Type: application/json' { "took" : 18, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "age_count" : { "value" : 58 } } }
What is "fielddata" in context of Elasticsearch(ELS) ?
In ELS Text field values are stored in an in-memory data structure called fielddata. fielddata is built on demand when a field is used for aggregations, sorting etc. Since fielddata on text fields take up lots of heap space so fielddata is disabled by default on text fields.
How to enable fielddata on textfield in Elasticsearch ?
Using "_mapping" API of elasticsearch we can enable fielddata on textfield. Below sample query enable fielddata on gender.
➜ Desktop curl -XPUT 'localhost:9200/employees/_mapping/personal?pretty' -d' { "properties": { "gender": { "type": "text", "fielddata": true } } } ' -H 'Content-Type: application/json' { "acknowledged" : true }
Once fielddata is enabled on gender we can execute query associated with cardinality aggregation with textfield . Below query display cardinality of gender as 2.
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "age_count" : { "cardinality" : { "field" : "gender" } } } } ' -H 'Content-Type: application/json' { "took" : 23, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "age_count" : { "value" : 2 } } }
Bucketing aggregations
In elastic search each document is indexed and associated with some type name(logical grouping). For analytics purpose, we can logically group these indexed documents into buckets and each bucket satisfies some criterion, it is termed as Bucketing aggregation. Bucketing aggregation is perfomed using "_search" API.Bucket aggregations by field values : Below query execute term aggregation query and divide all documents in index employee into two category. All 1000 documents is split into two blocks (524 and 476) documents.
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "gender_bucket" : { "terms" : { "field" : "gender" } } } } ' -H 'Content-Type: application/json' { "took" : 16, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "gender_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 524 }, { "key" : "female", "doc_count" : 476 } ] } } }
Range aggregation with key/value response: With <"keyed" : true> we indicate that response of this query will gives in key/value pair. Below query displays response as key/value pair.
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "age_ranges" : { "range" : { "field" : "age", "keyed" : true, "ranges" : [ { "to" : 30 }, { "from" : 30, "to" : 40 }, { "from" : 40, "to" : 55 }, { "from" : 55 } ] } } } } ' -H 'Content-Type: application/json' { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "age_ranges" : { "buckets" : { "*-30.0" : { "to" : 30.0, "doc_count" : 216 }, "30.0-40.0" : { "from" : 30.0, "to" : 40.0, "doc_count" : 174 }, "40.0-55.0" : { "from" : 40.0, "to" : 55.0, "doc_count" : 248 }, "55.0-*" : { "from" : 55.0, "doc_count" : 362 } } } } }
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "age_ranges" : { "range" : { "field" : "age", "keyed" : true, "ranges" : [ { "key": "young", "to" : 35 }, { "key": "quarter-aged", "from" : 35, "to" : 45 }, { "key": "middle-aged", "from" : 45, "to" : 65 }, { "key": "senior", "from" : 55 } ] } } } } ' -H 'Content-Type: application/json' { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "age_ranges" : { "buckets" : { "young" : { "to" : 35.0, "doc_count" : 302 }, "quarter-aged" : { "from" : 35.0, "to" : 45.0, "doc_count" : 169 }, "middle-aged" : { "from" : 45.0, "to" : 65.0, "doc_count" : 324 }, "senior" : { "from" : 55.0, "doc_count" : 362 } } } } }
Nested aggregation - Multi-level aggregation
Metric aggregation inside Bucketing aggregation (Two level nesting) : Find average age of each gender. Below query first perform bucketing aggregation and then average age is computed. Outer "aggs" keyword specifies a gender bucket and inner "aggs" does average computation.➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "gender_bucket" : { "terms" : { "field" : "gender" }, "aggs": { "average_age": { "avg": { "field": "age" } } } } } } ' -H 'Content-Type: application/json' { "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "gender_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 524, "average_age" : { "value" : 47.333969465648856 } }, { "key" : "female", "doc_count" : 476, "average_age" : { "value" : 45.71848739495798 } } ] } } }
Find average age of male and female within range of age (3 Level nesting): Outermost "aggs" specifies bucketing aggregation which divides documents into two bucket (male and Female). Second "aggs" split buckets into range of age and finally inner "aggs" find average age of each rage of age.
➜ Desktop curl -XPOST 'localhost:9200/employees/_search?&pretty' -d' { "size" : 0, "aggs" : { "gender_bucket" : { "terms" : { "field" : "gender" }, "aggs" : { "age_ranges" : { "range" : { "field" : "age", "keyed" : true, "ranges" : [ { "key": "young", "to" : 35 }, { "key": "middle-aged", "from" : 35, "to" : 50 }, { "key": "senior", "from" : 55 } ] }, "aggs": { "average_age": { "avg": { "field": "age" } } } } } } } } ' -H 'Content-Type: application/json' { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "gender_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 524, "age_ranges" : { "buckets" : { "young" : { "to" : 35.0, "doc_count" : 145, "average_age" : { "value" : 26.048275862068966 } }, "middle-aged" : { "from" : 35.0, "to" : 50.0, "doc_count" : 135, "average_age" : { "value" : 42.19259259259259 } }, "senior" : { "from" : 55.0, "doc_count" : 192, "average_age" : { "value" : 65.734375 } } } } }, { "key" : "female", "doc_count" : 476, "age_ranges" : { "buckets" : { "young" : { "to" : 35.0, "doc_count" : 157, "average_age" : { "value" : 25.840764331210192 } }, "middle-aged" : { "from" : 35.0, "to" : 50.0, "doc_count" : 112, "average_age" : { "value" : 41.517857142857146 } }, "senior" : { "from" : 55.0, "doc_count" : 170, "average_age" : { "value" : 65.5 } } } } } ] } } }
Filter aggregation
Find average salary of all employees who belongs to minnesota. It to also an nested aggregation where first filtering is applied followed by aggregation is applied to find average salary.➜ Desktop curl -XPOST 'localhost:9200/employees/_search?size=0&pretty' -d' { "aggs" : { "state" : { "filter" : { "term": { "state": "minnesota" } }, "aggs" : { "avg_age" : { "avg" : { "field" : "salary" } } } } } } ' -H 'Content-Type: application/json' { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1000, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "state" : { "doc_count" : 17, "avg_age" : { "value" : 43855.35294117647 } } } }
====******======
Tags:
Elasticsearch
Đặt vé tại phòng vé Aivivu, tham khảo
ReplyDeletevé máy bay đi Mỹ khứ hồi
vé máy bay từ los về việt nam
vé máy bay từ canada về việt nam bao nhiêu tiền
giá vé máy bay nhật việt vietnam airline
khi nào mở lại đường bay hàn quốc
Vé máy bay từ Đài Loan về VN
khách sạn cách ly hồ chí minh