Introduction#
Druid
is a distributed data processing system that supports real-time multidimensional analysis. It supports high-speed real-time data ingestion and processing, as well as real-time and flexible multidimensional data analysis queries. It also supports pre-aggregation ingestion and aggregate analysis based on timestamps.
Term | Meaning |
---|---|
Timestamp column | The column in the data that contains timestamps, and all queries are centered around time. |
Dimension column | The column in the data used for filtering. |
Metric column | The aggregated column in the data, supports calculations such as sum, count, mix, and max. |
Aggregation | The process of merging data based on timestamp column, dimension column, metric column, and aggregation granularity. |
Aggregation granularity | The length of time that data is merged into one record. |
Best Practices#
To fit the characteristics of Druid
and avoid inefficient operations that can cause slow queries or even timeouts, follow these practices:
- Increase pre-aggregation ratio
- Set the aggregation granularity to minutes.
- Remove unnecessary fields, and try to avoid including detailed fields.
- If one data source cannot satisfy all queries, split it into multiple data sources with different dimension columns.
- Set fields that require
count distinct
ashyperUnique
metrics.
- Avoid complex row-level calculations in
SQL
, and utilize pre-calculated fields as much as possible. - Avoid combining
group by
withcount distinct
on high cardinality fields.- First, avoid grouping by high cardinality fields (fields with more than tens of thousands of values).
- Both
group by
andcount distinct
useheap
storage for intermediate structures. Combining them can easily lead to query timeouts or exceeding resource limits. More seriously, it can causeDruid
to perform full garbage collection operations, affecting the entire cluster. Druid
also caches successful queries, so similarSQL
queries (with slightly different time ranges but identical other conditions) can hit the cache.
- Place filtering conditions in the outermost layer of the
where
clause. - Use the correct types for queries, and recommend using
=
orin
to convert range filters to equality filters. - Avoid using
like
or regular expressions to extract large texts. It is recommended to extract them into separate fields when producing data upstream to reduce storage and query costs. - Use
group by Floor(__time to DAY)
instead ofgroup by __time
to specify a specific time granularity. The efficiency difference between the two is significant.