ClickHouseIO: Add DateTime64 support for sub-second timestamp precision#38510
ClickHouseIO: Add DateTime64 support for sub-second timestamp precision#38510Eliaaazzz wants to merge 1 commit into
Conversation
ClickHouse's DateTime64(precision[, 'timezone']) was not recognized by
TableSchema or the column-type parser, so pipelines emitting sub-second
timestamps (log/event ingestion, financial data) could not write to
DateTime64 columns.
This adds:
* TypeName.DATETIME64 with precision (0-9) and optional timezone fields,
plus a ColumnType.dateTime64(precision[, timezone]) factory.
* Parser grammar for DateTime64(<precision>[, '<timezone>']) so the type
is also recognized inside Nullable(...) and Array(...) via the
existing primitive() rule.
* Beam schema mapping picks the narrowest logical type that round-trips
the requested precision:
precision <= 3 → Joda DATETIME (preserves existing pipelines).
precision 4-6 → SqlTypes.TIMESTAMP (MicrosInstant).
precision >= 7 → NanosInstant, the only built-in logical type that
carries full nanosecond precision through a Row;
MicrosInstant would reject sub-micro nanos.
* Writer serialization as a little-endian Int64 of
epoch_seconds * 10^precision + sub_second_units, accepting both Joda
ReadableInstant and java.time.Instant inputs; floor division on
negative timestamps matches ClickHouse's own encoding.
Tests: parser cases for DateTime64(3), DateTime64(6,'UTC'),
DateTime64(9), Nullable(DateTime64(...)) and Array(DateTime64(...));
schema-mapping tests for the micros and nanos buckets; encoder unit
tests covering Joda/java.time inputs, zero/nano/negative edge cases and
the precision-7 100 ns tick truncation path; round-trip integration
tests against the ClickHouse testcontainer for precisions 3/6/9 (with
non-micro-aligned nanos for the nanos case) plus a nullable case.
Closes apache#38466
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for ClickHouse's DateTime64 data type, allowing users to write timestamps with sub-second precision (up to nanoseconds) to ClickHouse tables. The changes include updates to the schema model, parser, and writer to accommodate the precision and timezone requirements of DateTime64, while maintaining backward compatibility for existing pipelines using standard DateTime. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for ClickHouse DateTime64 columns with sub-second precision to ClickHouseIO. The implementation includes a new encoding method for timestamps, updates to the ColumnType parser, and logic to map ClickHouse precisions to appropriate Beam logical types like DATETIME, SqlTypes.TIMESTAMP, and NanosInstant. Comprehensive tests were added to verify the encoding and schema mapping. Reviewers recommended using a constant for NanosInstant to improve performance and suggested adding Objects.requireNonNull when unboxing the precision value to prevent potential NullPointerExceptions.
| } else if (p <= 6) { | ||
| return Schema.FieldType.logicalType(SqlTypes.TIMESTAMP); | ||
| } else { | ||
| return Schema.FieldType.logicalType(new NanosInstant()); |
There was a problem hiding this comment.
For better performance and consistency with other logical types (like SqlTypes.TIMESTAMP), consider defining a private constant for NanosInstant instead of instantiating it every time getEquivalentFieldType is called for a DATETIME64 column with precision ≥ 7.
| return Schema.FieldType.logicalType(new NanosInstant()); | |
| return Schema.FieldType.logicalType(NANOS_INSTANT); |
| break; | ||
|
|
||
| case DATETIME64: | ||
| BinaryStreamUtils.writeInt64(stream, encodeDateTime64(value, columnType.precision())); |
There was a problem hiding this comment.
The precision() method on ColumnType is marked as @Nullable. While the factory and parser ensure it is set for DATETIME64, unboxing it here to an int for the encodeDateTime64 call could theoretically throw a NullPointerException if a ColumnType was manually constructed via the builder without a precision. Consider adding a null check or using Objects.requireNonNull for robustness.
| BinaryStreamUtils.writeInt64(stream, encodeDateTime64(value, columnType.precision())); | |
| BinaryStreamUtils.writeInt64(stream, encodeDateTime64(value, java.util.Objects.requireNonNull(columnType.precision()))); |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Thanks, this looks pretty complete. R: @BentsiLeviav (ClickHouseIO owner) could you please take a look? |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
ClickHouseIO's
TableSchemaand column-type parser only recognizedDateTime(second precision), so pipelines emitting sub-second timestamps (log/event ingestion, financial data) could not write to ClickHouse tables declared withDateTime64(precision[, 'timezone'])columns.This change adds first-class
DateTime64support toClickHouseIO:TypeName.DATETIME64;ColumnTypecarriesprecision(0–9, validated) and an optionaltimezone, with aColumnType.dateTime64(precision[, timezone])factory.DateTime64(<precision>[, '<timezone>']), also reachable throughNullable(...)andArray(...)via the existingprimitive()rule.precision ≤ 3→ JodaDATETIME(preserves existing pipelines).precision 4–6→SqlTypes.TIMESTAMP(MicrosInstant).precision ≥ 7→NanosInstant, the only built-in logical type that preserves full nanosecond precision through aRow;MicrosInstantrejects non-micro-aligned nanos.DateTime64as a little-endianInt64ofepoch_seconds * 10^precision + sub_second_units, accepting both JodaReadableInstantandjava.time.Instant. UsesMath.floorDiv/Math.floorModso negative timestamps match ClickHouse's encoding, andMath.multiplyExact/Math.addExactfor overflow safety.Tests:
TableSchemaTest— parser cases forDateTime64(3),DateTime64(6, 'UTC'),DateTime64(9),Nullable(DateTime64(...)),Array(DateTime64(...)); schema-mapping tests for the millis, micros and nanos buckets; precision-range validation.ClickHouseWriterTest— encoder unit tests covering Joda andjava.time.Instantinputs, precision 0/3/6/7/9, negative timestamps and the precision-7 100 ns truncation path.ClickHouseIOIT— round-trip integration tests against the ClickHouse test container for precisions 3/6/9 (the 9-precision case uses non-micro-aligned nanos) andNullable(DateTime64(6)).fixes #38466
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.