[metrics] Support DDSketch in the parquet pipeline#6257
[metrics] Support DDSketch in the parquet pipeline#6257
Conversation
db0e0db to
86c034b
Compare
c3fc790 to
2261237
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 727f085864
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs
Outdated
Show resolved
Hide resolved
|
i cannot review this PR due to lack of context. I know what DDSketch are, but I do not know what they are used for in the context the metrics ingestion pipeline, why they are stored in different files, etc. |
2261237 to
6e7d6a9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e7d6a90ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs
Outdated
Show resolved
Hide resolved
6e7d6a9 to
7490fe3
Compare
|
|
||
| let keys_inner = keys_builder.values(); | ||
| for &k in &dp.keys { | ||
| keys_inner.append_value(k); |
There was a problem hiding this comment.
Where do we store the length of keys/counts? I.e. how do we recover the number of elements per data_point from _inner arrays?
There was a problem hiding this comment.
they are stored implicitly by Arrow - it stores an array of offsets, so you can get the length of row i with offsets[i+1] - offsets[i]
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37f4298f66
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
quickwit/quickwit-metastore/migrations/postgresql/28_create-sketch-splits.up.sql
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a38c47b7c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| split_metadata: Option<&MetricsSplitMetadata>, | ||
| split_metadata: Option<&ParquetSplitMetadata>, | ||
| ) -> Result<(RecordBatch, WriterProperties), ParquetWriteError> { | ||
| validate_required_fields(&batch.schema()) |
There was a problem hiding this comment.
Validate sketch batches with sketch schema before writing
prepare_write always calls validate_required_fields, which enforces the metrics schema (metric_type/value) even when the split metadata is Sketches. In the new sketch pipeline, sketch batches reach this writer with fields like count/sum/min/max/keys/counts, so writes fail with SchemaValidation and sketch splits cannot be produced or uploaded.
Useful? React with 👍 / 👎.
| const DELETE_SPLITS_QUERY: &str = r#" | ||
| WITH input_splits AS ( | ||
| SELECT input_splits.split_id, metrics_splits.split_state | ||
| SELECT input_splits.split_id, {table_name}.split_state |
There was a problem hiding this comment.
Interpolate parquet table name in delete query
delete_parquet_splits_impl constructs DELETE_SPLITS_QUERY as a raw string containing {table_name} placeholders but executes it directly, so Postgres sees literal {table_name} tokens instead of metrics_splits/sketch_splits. This makes parquet split deletion fail at runtime for both delete RPCs on Postgres.
Useful? React with 👍 / 👎.
Description
This PR can be reviewed commit by commit.
This PR updates the parquet pipeline to process DDSketches. See https://datadoghq.atlassian.net/wiki/spaces/QKHS/pages/6291357728/DDSketch+in+Parquet for more information about the DDSketch spec.
How was this PR tested?
Describe how you tested this PR.