Introduction#
Druid is a distributed data processing system that supports real-time multidimensional analysis. It supports high-speed real-time data ingestion and processing, as well as real-time and flexible multidimensional data analysis queries. It also supports pre-aggregation ingestion and aggregate analysis based on timestamps.
| Term | Meaning |
|---|---|
| Timestamp column | The column in the data that contains timestamps, and all queries are centered around time. |
| Dimension column | The column in the data used for filtering. |
| Metric column | The aggregated column in the data, supports calculations such as sum, count, mix, and max. |
| Aggregation | The process of merging data based on timestamp column, dimension column, metric column, and aggregation granularity. |
| Aggregation granularity | The length of time that data is merged into one record. |
Best Practices#
To fit the characteristics of Druid and avoid inefficient operations that can cause slow queries or even timeouts, follow these practices:
- Increase pre-aggregation ratio
- Set the aggregation granularity to minutes.
- Remove unnecessary fields, and try to avoid including detailed fields.
- If one data source cannot satisfy all queries, split it into multiple data sources with different dimension columns.
- Set fields that require
count distinctashyperUniquemetrics.
- Avoid complex row-level calculations in
SQL, and utilize pre-calculated fields as much as possible. - Avoid combining
group bywithcount distincton high cardinality fields.- First, avoid grouping by high cardinality fields (fields with more than tens of thousands of values).
- Both
group byandcount distinctuseheapstorage for intermediate structures. Combining them can easily lead to query timeouts or exceeding resource limits. More seriously, it can causeDruidto perform full garbage collection operations, affecting the entire cluster. Druidalso caches successful queries, so similarSQLqueries (with slightly different time ranges but identical other conditions) can hit the cache.
- Place filtering conditions in the outermost layer of the
whereclause. - Use the correct types for queries, and recommend using
=orinto convert range filters to equality filters. - Avoid using
likeor regular expressions to extract large texts. It is recommended to extract them into separate fields when producing data upstream to reduce storage and query costs. - Use
group by Floor(__time to DAY)instead ofgroup by __timeto specify a specific time granularity. The efficiency difference between the two is significant.