[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown#54972
[SPARK-55978][SQL] Add TABLESAMPLE SYSTEM block sampling with DSv2 pushdown#54972stanyao wants to merge 1 commit intoapache:masterfrom
Conversation
bd6f9cf to
afbd7d5
Compare
|
cc @szehon-ho |
afbd7d5 to
340c545
Compare
|
Rebased and resolved the new merge conflict above. |
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks, the idea seems reasonable to me in DSV2 API. Added some comments
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java
Outdated
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
Outdated
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
Show resolved
Hide resolved
Summary: CONTEXT: Review feedback on PR apache#54972 for TABLESAMPLE SYSTEM block sampling. WHAT: - Add DSv2 SampleMethod Java enum to replace boolean isSystemSampling parameter - Update SupportsPushDownTableSample Javadoc to clarify BERNOULLI vs SYSTEM - Replace _LEGACY_ERROR_TEMP_0035 with UNSUPPORTED_FEATURE.TABLESAMPLE_SYSTEM - Restrict SYSTEM sample pushdown to direct table scans only (no subqueries/views) - Remove findScanBuilderHolder that incorrectly traversed through Filter/Project Test Plan: Existing tests. Error class added to error-conditions.json.
32fd86b to
84cdd7f
Compare
Summary: CONTEXT: Review feedback on PR apache#54972 for TABLESAMPLE SYSTEM block sampling. WHAT: - Add DSv2 SampleMethod Java enum to replace boolean isSystemSampling parameter - Update SupportsPushDownTableSample Javadoc to clarify BERNOULLI vs SYSTEM - Replace _LEGACY_ERROR_TEMP_0035 with UNSUPPORTED_FEATURE.TABLESAMPLE_SYSTEM - Restrict SYSTEM sample pushdown to direct table scans only (no subqueries/views) - Remove findScanBuilderHolder that incorrectly traversed through Filter/Project Test Plan: Existing tests. Error class added to error-conditions.json.
84cdd7f to
b390671
Compare
Summary: CONTEXT: Review feedback on PR apache#54972 for TABLESAMPLE SYSTEM block sampling. WHAT: - Add DSv2 SampleMethod Java enum to replace boolean isSystemSampling parameter - Update SupportsPushDownTableSample Javadoc to clarify BERNOULLI vs SYSTEM - Replace _LEGACY_ERROR_TEMP_0035 with UNSUPPORTED_FEATURE.TABLESAMPLE_SYSTEM - Restrict SYSTEM sample pushdown to direct table scans only (no subqueries/views) - Remove findScanBuilderHolder that incorrectly traversed through Filter/Project Test Plan: Existing tests. Error class added to error-conditions.json.
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks, leaving a few more comments
...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
Outdated
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java
Show resolved
Hide resolved
| val metadata = conn.getMetaData | ||
| // scalastyle:off line.size.limit | ||
| assert(metadata.getSQLKeywords === "ADD,AFTER,AGGREGATE,ALWAYS,ANALYZE,ANTI,ANY_VALUE,ARCHIVE,ASC,BINDING,BUCKET,BUCKETS,BYTE,CACHE,CACHED,CASCADE,CATALOG,CATALOGS,CHANGE,CHANGES,CLEAR,CLUSTER,CLUSTERED,CODEGEN,COLLATION,COLLATIONS,COLLECTION,COLUMNS,COMMENT,COMPACT,COMPACTIONS,COMPENSATION,COMPUTE,CONCATENATE,CONTAINS,CONTINUE,COST,DATA,DATABASE,DATABASES,DATEADD,DATEDIFF,DATE_ADD,DATE_DIFF,DAYOFYEAR,DAYS,DBPROPERTIES,DEFINED,DEFINER,DELAY,DELIMITED,DESC,DFS,DIRECTORIES,DIRECTORY,DISTRIBUTE,DIV,DO,ELSEIF,ENFORCED,ESCAPED,EVOLUTION,EXCHANGE,EXCLUDE,EXCLUSIVE,EXIT,EXPLAIN,EXPORT,EXTEND,EXTENDED,FIELDS,FILEFORMAT,FIRST,FLOW,FOLLOWING,FORMAT,FORMATTED,FOUND,FUNCTIONS,GENERATED,GEOGRAPHY,GEOMETRY,HANDLER,HOURS,IDENTIFIED,IDENTIFIER,IF,IGNORE,ILIKE,IMMEDIATE,INCLUDE,INCLUSIVE,INCREMENT,INDEX,INDEXES,INPATH,INPUT,INPUTFORMAT,INVOKER,ITEMS,ITERATE,JSON,KEY,KEYS,LAST,LAZY,LEAVE,LEVEL,LIMIT,LINES,LIST,LOAD,LOCATION,LOCK,LOCKS,LOGICAL,LONG,LOOP,MACRO,MAP,MATCHED,MATERIALIZED,MEASURE,METRICS,MICROSECOND,MICROSECONDS,MILLISECOND,MILLISECONDS,MINUS,MINUTES,MONTHS,MSCK,NAME,NAMESPACE,NAMESPACES,NANOSECOND,NANOSECONDS,NORELY,NULLS,OFFSET,OPTION,OPTIONS,OUTPUTFORMAT,OVERWRITE,PARTITIONED,PARTITIONS,PERCENT,PIVOT,PLACING,PRECEDING,PRINCIPALS,PROCEDURES,PROPERTIES,PURGE,QUARTER,QUERY,RECORDREADER,RECORDWRITER,RECOVER,RECURSION,REDUCE,REFRESH,RELY,RENAME,REPAIR,REPEAT,REPEATABLE,REPLACE,RESET,RESPECT,RESTRICT,ROLE,ROLES,SCHEMA,SCHEMAS,SECONDS,SECURITY,SEMI,SEPARATED,SERDE,SERDEPROPERTIES,SETS,SHORT,SHOW,SINGLE,SKEWED,SORT,SORTED,SOURCE,STATISTICS,STORED,STRATIFY,STREAM,STREAMING,STRING,STRUCT,SUBSTR,SYNC,SYSTEM_TIME,SYSTEM_VERSION,TABLES,TARGET,TBLPROPERTIES,TERMINATED,TIMEDIFF,TIMESTAMPADD,TIMESTAMPDIFF,TIMESTAMP_LTZ,TIMESTAMP_NTZ,TINYINT,TOUCH,TRANSACTION,TRANSACTIONS,TRANSFORM,TRUNCATE,TRY_CAST,TYPE,UNARCHIVE,UNBOUNDED,UNCACHE,UNLOCK,UNPIVOT,UNSET,UNTIL,USE,VAR,VARIABLE,VARIANT,VERSION,VIEW,VIEWS,VOID,WATERMARK,WEEK,WEEKS,WHILE,X,YEARS,ZONE") | ||
| assert(metadata.getSQLKeywords === "ADD,AFTER,AGGREGATE,ALWAYS,ANALYZE,ANTI,ANY_VALUE,ARCHIVE,ASC,BERNOULLI,BINDING,BUCKET,BUCKETS,BYTE,CACHE,CACHED,CASCADE,CATALOG,CATALOGS,CHANGE,CHANGES,CLEAR,CLUSTER,CLUSTERED,CODEGEN,COLLATION,COLLATIONS,COLLECTION,COLUMNS,COMMENT,COMPACT,COMPACTIONS,COMPENSATION,COMPUTE,CONCATENATE,CONTAINS,CONTINUE,COST,DATA,DATABASE,DATABASES,DATEADD,DATEDIFF,DATE_ADD,DATE_DIFF,DAYOFYEAR,DAYS,DBPROPERTIES,DEFINED,DEFINER,DELAY,DELIMITED,DESC,DFS,DIRECTORIES,DIRECTORY,DISTRIBUTE,DIV,DO,ELSEIF,ENFORCED,ESCAPED,EVOLUTION,EXCHANGE,EXCLUDE,EXCLUSIVE,EXIT,EXPLAIN,EXPORT,EXTEND,EXTENDED,FIELDS,FILEFORMAT,FIRST,FLOW,FOLLOWING,FORMAT,FORMATTED,FOUND,FUNCTIONS,GENERATED,GEOGRAPHY,GEOMETRY,HANDLER,HOURS,IDENTIFIED,IDENTIFIER,IF,IGNORE,ILIKE,IMMEDIATE,INCLUDE,INCLUSIVE,INCREMENT,INDEX,INDEXES,INPATH,INPUT,INPUTFORMAT,INVOKER,ITEMS,ITERATE,JSON,KEY,KEYS,LAST,LAZY,LEAVE,LEVEL,LIMIT,LINES,LIST,LOAD,LOCATION,LOCK,LOCKS,LOGICAL,LONG,LOOP,MACRO,MAP,MATCHED,MATERIALIZED,MEASURE,METRICS,MICROSECOND,MICROSECONDS,MILLISECOND,MILLISECONDS,MINUS,MINUTES,MONTHS,MSCK,NAME,NAMESPACE,NAMESPACES,NANOSECOND,NANOSECONDS,NORELY,NULLS,OFFSET,OPTION,OPTIONS,OUTPUTFORMAT,OVERWRITE,PARTITIONED,PARTITIONS,PERCENT,PIVOT,PLACING,PRECEDING,PRINCIPALS,PROCEDURES,PROPERTIES,PURGE,QUARTER,QUERY,RECORDREADER,RECORDWRITER,RECOVER,RECURSION,REDUCE,REFRESH,RELY,RENAME,REPAIR,REPEAT,REPEATABLE,REPLACE,RESET,RESPECT,RESTRICT,ROLE,ROLES,SCHEMA,SCHEMAS,SECONDS,SECURITY,SEMI,SEPARATED,SERDE,SERDEPROPERTIES,SETS,SHORT,SHOW,SINGLE,SKEWED,SORT,SORTED,SOURCE,STATISTICS,STORED,STRATIFY,STREAM,STREAMING,STRING,STRUCT,SUBSTR,SYNC,SYSTEM_TIME,SYSTEM_VERSION,TABLES,TARGET,TBLPROPERTIES,TERMINATED,TIMEDIFF,TIMESTAMPADD,TIMESTAMPDIFF,TIMESTAMP_LTZ,TIMESTAMP_NTZ,TINYINT,TOUCH,TRANSACTION,TRANSACTIONS,TRANSFORM,TRUNCATE,TRY_CAST,TYPE,UNARCHIVE,UNBOUNDED,UNCACHE,UNLOCK,UNPIVOT,UNSET,UNTIL,USE,VAR,VARIABLE,VARIANT,VERSION,VIEW,VIEWS,VOID,WATERMARK,WEEK,WEEKS,WHILE,X,YEARS,ZONE") |
There was a problem hiding this comment.
by the way, this doesnt support for Spark connect yet? (proto Sample message should include sample_method)
There was a problem hiding this comment.
@szehon-ho , Spark Connect is outside the scope of this PR. It will require changes in wire protocol and all language clients, which is worth a focused review. Is it OK I log a JIRA item to track a follow up addition of Spark Connect support? I also plan to log another JIRA item to add DataFrame API support too.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableSampleInfo.scala
Show resolved
Hide resolved
Summary: CONTEXT: Addressing code review feedback on PR apache#54972. WHAT: - Use PhysicalOperation extractor in pushDownSample for System sampling, consistent with other pushdown rules - Remove unnecessary SubqueryAlias match (eliminated before this rule runs) - Guard pushDownJoin against composing with pushed samples to prevent silent sample loss - Use user-friendly error message for TABLESAMPLE_SYSTEM - Include sample method in EXPLAIN pushed sample metadata - Add 'as x' alias to BERNOULLI REPEATABLE parser test for consistency Test Plan: Existing tests. No golden file or EXPLAIN string assertions affected.
3437a06 to
ee53951
Compare
Summary: CONTEXT: Review feedback on PR apache#54972 for TABLESAMPLE SYSTEM block sampling. WHAT: - Add DSv2 SampleMethod Java enum to replace boolean isSystemSampling parameter - Update SupportsPushDownTableSample Javadoc to clarify BERNOULLI vs SYSTEM - Replace _LEGACY_ERROR_TEMP_0035 with UNSUPPORTED_FEATURE.TABLESAMPLE_SYSTEM - Restrict SYSTEM sample pushdown to direct table scans only (no subqueries/views) - Remove findScanBuilderHolder that incorrectly traversed through Filter/Project Test Plan: Existing tests. Error class added to error-conditions.json.
Summary: CONTEXT: Addressing code review feedback on PR apache#54972. WHAT: - Use PhysicalOperation extractor in pushDownSample for System sampling, consistent with other pushdown rules - Remove unnecessary SubqueryAlias match (eliminated before this rule runs) - Guard pushDownJoin against composing with pushed samples to prevent silent sample loss - Use user-friendly error message for TABLESAMPLE_SYSTEM - Include sample method in EXPLAIN pushed sample metadata - Add 'as x' alias to BERNOULLI REPEATABLE parser test for consistency Test Plan: Existing tests. No golden file or EXPLAIN string assertions affected.
ca7eab3 to
549db12
Compare
Summary: CONTEXT: Review feedback on PR apache#54972 for TABLESAMPLE SYSTEM block sampling. WHAT: - Add DSv2 SampleMethod Java enum to replace boolean isSystemSampling parameter - Update SupportsPushDownTableSample Javadoc to clarify BERNOULLI vs SYSTEM - Replace _LEGACY_ERROR_TEMP_0035 with UNSUPPORTED_FEATURE.TABLESAMPLE_SYSTEM - Restrict SYSTEM sample pushdown to direct table scans only (no subqueries/views) - Remove findScanBuilderHolder that incorrectly traversed through Filter/Project Test Plan: Existing tests. Error class added to error-conditions.json.
Summary: CONTEXT: Addressing code review feedback on PR apache#54972. WHAT: - Use PhysicalOperation extractor in pushDownSample for System sampling, consistent with other pushdown rules - Remove unnecessary SubqueryAlias match (eliminated before this rule runs) - Guard pushDownJoin against composing with pushed samples to prevent silent sample loss - Use user-friendly error message for TABLESAMPLE_SYSTEM - Include sample method in EXPLAIN pushed sample metadata - Add 'as x' alias to BERNOULLI REPEATABLE parser test for consistency Test Plan: Existing tests. No golden file or EXPLAIN string assertions affected.
06ab21c to
447d891
Compare
|
Squashed multiple commits into a single one to make code review easier. |
…shdown This PR adds support for ANSI SQL `TABLESAMPLE SYSTEM` (block-level sampling) alongside the existing `TABLESAMPLE BERNOULLI` (row-level sampling). **SQL grammar**: Extended `TABLESAMPLE` to accept an optional `SYSTEM` or `BERNOULLI` qualifier before the sample method. Added both as non-reserved keywords. `TABLESAMPLE SYSTEM` only supports `PERCENT` sampling and does not support `REPEATABLE`. **Logical plan**: Introduced `SampleMethod` sealed trait (`Bernoulli`/`System`) and added it to the `Sample` node. Default is `Bernoulli` for backward compatibility. **DSv2 pushdown**: Added `SampleMethod` Java enum and extended `SupportsPushDownTableSample.pushTableSample()` with a new overload. Sources that don't override the new method reject SYSTEM sampling by default. SYSTEM pushdown is restricted to direct table scans via `PhysicalOperation`. **Physical planning**: SYSTEM samples that aren't pushed down to a DSv2 source raise an `AnalysisException` -- there is no row-level fallback since block sampling is data-source dependent.
447d891 to
a2170d4
Compare
What changes were proposed in this pull request?
This PR adds support for ANSI SQL
TABLESAMPLE SYSTEM(block-level sampling) alongside the existingTABLESAMPLE BERNOULLI(row-level sampling). Key changes:TABLESAMPLEto accept an optionalSYSTEMorBERNOULLIqualifier before the sample method. Added both as non-reserved keywords.SampleMethodsealed trait (Bernoulli/System) and added it to theSamplenode. Default isBernoullifor backward compatibility.TABLESAMPLE SYSTEMonly supportsPERCENTsampling and does not supportREPEATABLE. Other methods (ROWS, BUCKET, BYTES) are rejected with clear error messages.TABLESAMPLE SYSTEMis pushed down to data sources via an extendedSupportsPushDownTableSample.pushTableSample()overload withisSystemSamplingflag. Sources that don't override the new method reject SYSTEM sampling by default.AnalysisException— there is no row-level fallback since block sampling is data-source dependent.Why are the changes needed?
ANSI SQL defines two sampling methods:
BERNOULLI(row-level) andSYSTEM(implementation-dependent, typically block/split-level). Block sampling is significantly faster for large tables since it avoids per-row evaluation, making it useful for approximate queries and data exploration. Many databases (PostgreSQL, Hive, Trino) support this distinction.Does this PR introduce any user-facing change?
Yes. New SQL syntax
TABLESAMPLE SYSTEM (x PERCENT)andTABLESAMPLE BERNOULLI (x PERCENT).BERNOULLIandSYSTEMare added as non-reserved keywords. Existing queries without these keywords behave identically to before.How was this patch tested?
PlanParserSuitecovering: basic parsing, case insensitivity, boundary fractions, unsupported methods (ROWS/BUCKET with SYSTEM), REPEATABLE rejection, fraction validation, identifier preservation, and subquery contexts.SQLQuerySuitetests pass.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (thoroughly refined, reviewed, and tested by human)