Druid Query Best Practices

Introduction#

Druid is a distributed data processing system that supports real-time multidimensional analysis. It supports high-speed real-time data ingestion and processing, as well as real-time and flexible multidimensional data analysis queries. It also supports pre-aggregation ingestion and aggregate analysis based on timestamps.

Term	Meaning
Timestamp column	The column in the data that contains timestamps, and all queries are centered around time.
Dimension column	The column in the data used for filtering.
Metric column	The aggregated column in the data, supports calculations such as sum, count, mix, and max.
Aggregation	The process of merging data based on timestamp column, dimension column, metric column, and aggregation granularity.
Aggregation granularity	The length of time that data is merged into one record.

Best Practices#

To fit the characteristics of Druid and avoid inefficient operations that can cause slow queries or even timeouts, follow these practices:

Increase pre-aggregation ratio
- Set the aggregation granularity to minutes.
- Remove unnecessary fields, and try to avoid including detailed fields.
- If one data source cannot satisfy all queries, split it into multiple data sources with different dimension columns.
- Set fields that require count distinct as hyperUnique metrics.
Avoid complex row-level calculations in SQL, and utilize pre-calculated fields as much as possible.
Avoid combining group by with count distinct on high cardinality fields.
- First, avoid grouping by high cardinality fields (fields with more than tens of thousands of values).
- Both group by and count distinct use heap storage for intermediate structures. Combining them can easily lead to query timeouts or exceeding resource limits. More seriously, it can cause Druid to perform full garbage collection operations, affecting the entire cluster.
- Druid also caches successful queries, so similar SQL queries (with slightly different time ranges but identical other conditions) can hit the cache.
Place filtering conditions in the outermost layer of the where clause.
Use the correct types for queries, and recommend using = or in to convert range filters to equality filters.
Avoid using like or regular expressions to extract large texts. It is recommended to extract them into separate fields when producing data upstream to reduce storage and query costs.
Use group by Floor(__time to DAY) instead of group by __time to specify a specific time granularity. The efficiency difference between the two is significant.