jeremygo

jeremygo

我是把下一颗珍珠串在绳子上的人

Druid Query Best Practices

Introduction#

Druid is a distributed data processing system that supports real-time multidimensional analysis. It supports high-speed real-time data ingestion and processing, as well as real-time and flexible multidimensional data analysis queries. It also supports pre-aggregation ingestion and aggregate analysis based on timestamps.

TermMeaning
Timestamp columnThe column in the data that contains timestamps, and all queries are centered around time.
Dimension columnThe column in the data used for filtering.
Metric columnThe aggregated column in the data, supports calculations such as sum, count, mix, and max.
AggregationThe process of merging data based on timestamp column, dimension column, metric column, and aggregation granularity.
Aggregation granularityThe length of time that data is merged into one record.

image

Best Practices#

To fit the characteristics of Druid and avoid inefficient operations that can cause slow queries or even timeouts, follow these practices:

  • Increase pre-aggregation ratio
    • Set the aggregation granularity to minutes.
    • Remove unnecessary fields, and try to avoid including detailed fields.
    • If one data source cannot satisfy all queries, split it into multiple data sources with different dimension columns.
    • Set fields that require count distinct as hyperUnique metrics.
  • Avoid complex row-level calculations in SQL, and utilize pre-calculated fields as much as possible.
  • Avoid combining group by with count distinct on high cardinality fields.
    • First, avoid grouping by high cardinality fields (fields with more than tens of thousands of values).
    • Both group by and count distinct use heap storage for intermediate structures. Combining them can easily lead to query timeouts or exceeding resource limits. More seriously, it can cause Druid to perform full garbage collection operations, affecting the entire cluster.
    • Druid also caches successful queries, so similar SQL queries (with slightly different time ranges but identical other conditions) can hit the cache.
  • Place filtering conditions in the outermost layer of the where clause.
  • Use the correct types for queries, and recommend using = or in to convert range filters to equality filters.
  • Avoid using like or regular expressions to extract large texts. It is recommended to extract them into separate fields when producing data upstream to reduce storage and query costs.
  • Use group by Floor(__time to DAY) instead of group by __time to specify a specific time granularity. The efficiency difference between the two is significant.
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.