Skip to content

fix(spark): preserve raw number text in json_tuple to match Spark#21264

Open
davidlghellin wants to merge 3 commits intoapache:mainfrom
davidlghellin:fix/json_tuple_raw_numbers
Open

fix(spark): preserve raw number text in json_tuple to match Spark#21264
davidlghellin wants to merge 3 commits intoapache:mainfrom
davidlghellin:fix/json_tuple_raw_numbers

Conversation

@davidlghellin
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

json_tuple was parsing JSON numbers with serde_json::Value and re-serializing them via Number::to_string(). This loses precision in two cases verified against Spark 4.1.1:

Input Spark DataFusion (before)
1.5e10 1.5E10 15000000000.0
99999999999999999999 99999999999999999999 1e+20

What changes are included in this PR?

  • Switched from serde_json::Value to HashMap<String, Box<RawValue>> in json_tuple_inner to preserve original JSON text for numbers
  • Added raw_value feature to serde_json dependency in datafusion-spark (lightweight, no behavior change for other code)
  • Spark uppercases exponent notation (1.5e101.5E10), handled with a simple replace('e', "E")
  • Added 8 new SLT tests: scientific notation, large integers, normal int/float, trailing comma, empty key, "null" as key, interleaved exists/missing fields
  • Added 5 unit tests for number precision edge cases

Are these changes tested?

Yes.

  • 7 unit tests in json_tuple.rs (5 new for number precision + 2 existing)
  • 27 SLT tests in spark/json/json_tuple.slt (8 new + 19 existing)
  • All results validated against Spark 4.1.1

Are there any user-facing changes?

Yes — json_tuple now returns the original JSON number text instead of a re-serialized float. This is a correctness fix aligning with Spark behavior.

Copilot AI review requested due to automatic review settings March 30, 2026 20:57
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Mar 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns datafusion-spark’s json_tuple behavior with Apache Spark by preserving the original JSON numeric text (avoiding precision/format changes caused by parsing into serde_json::Value and re-stringifying).

Changes:

  • Updated json_tuple_inner to deserialize JSON objects into HashMap<String, Box<RawValue>> and emit raw JSON text for non-string scalars (with special handling for exponent casing).
  • Enabled serde_json’s raw_value feature for the datafusion-spark crate.
  • Added SLT coverage for additional edge cases and numeric formatting/precision scenarios, plus unit tests for numeric cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
datafusion/spark/src/function/json/json_tuple.rs Switches parsing to RawValue to preserve numeric text; adds unit tests for numeric edge cases.
datafusion/spark/Cargo.toml Enables serde_json raw_value feature needed for RawValue deserialization.
datafusion/sqllogictest/test_files/spark/json/json_tuple.slt Adds Spark-compatibility SLT cases (invalid JSON, key edge cases, numeric precision/format).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +161 to +163
// Numbers, booleans: use raw text as-is
// Spark uppercases exponent: 1.5e10 → 1.5E10
if raw_str.contains('e') {
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exponent uppercasing logic is applied to any non-string JSON value that contains the character 'e'. This will corrupt values like the boolean literal false (becomes falsE) and can also mutate nested objects/arrays if their raw JSON contains 'e' in keys/strings (e.g. { "here": 1 } becomes { "hErE": 1 }). Uppercasing should only be applied when the raw value is a JSON number that uses exponent notation (e/E), not for booleans/objects/arrays.

Suggested change
// Numbers, booleans: use raw text as-is
// Spark uppercases exponent: 1.5e10 → 1.5E10
if raw_str.contains('e') {
// Numbers, booleans, objects, arrays: use raw text as-is
// Spark uppercases exponent in numeric literals: 1.5e10 → 1.5E10
if (raw_str.contains('e') || raw_str.contains('E'))
&& serde_json::from_str::<serde_json::Number>(raw_str).is_ok()
{
// Only adjust exponent marker for valid numeric literals

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +167
} else {
// Numbers, booleans: use raw text as-is
// Spark uppercases exponent: 1.5e10 → 1.5E10
if raw_str.contains('e') {
builder.append_value(raw_str.replace('e', "E"));
} else {
builder.append_value(raw_str);
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_tuple currently returns the raw text for -0 (i.e. "-0"), but the test comment notes Spark returns "0". If the goal is Spark compatibility, consider normalizing negative zero to "0" (and then assert exactly "0" in the unit test) so behavior is deterministic and matches Spark.

Copilot uses AI. Check for mistakes.
Comment on lines +278 to +284
// RawValue preserves '-0', but Spark returns '0'
// This is acceptable — both are valid representations
let result = json_tuple_single(r#"{"v":-0}"#, "v");
assert!(
result == Some("-0".to_string()) || result == Some("0".to_string()),
"expected '-0' or '0', got {:?}",
result
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test allows either "-0" or "0", which makes it possible for json_tuple to diverge from Spark without failing CI (the comment states Spark returns "0"). If Spark compatibility is required here, the test should assert a single expected value after the implementation normalizes negative zero.

Suggested change
// RawValue preserves '-0', but Spark returns '0'
// This is acceptable — both are valid representations
let result = json_tuple_single(r#"{"v":-0}"#, "v");
assert!(
result == Some("-0".to_string()) || result == Some("0".to_string()),
"expected '-0' or '0', got {:?}",
result
// Ensure compatibility with Spark by expecting normalized "0"
assert_eq!(
json_tuple_single(r#"{"v":-0}"#, "v"),
Some("0".to_string())

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Spark uppercases exponent in numeric literals:
// 1.5e10 → 1.5E10
// Only apply to numbers (not booleans like "false")
let first = raw_str.as_bytes().first();
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw_str.as_bytes().first() returns Option<&u8>, but matches!(first, Some(b'0'..=b'9' | b'-')) is matching as if it were Option<u8>. This should fail to compile due to the &u8 vs u8 mismatch. Consider using .copied() (or otherwise dereferencing) before matching so the pattern works on Option<u8>.

Suggested change
let first = raw_str.as_bytes().first();
let first = raw_str.as_bytes().first().copied();

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +164
} else {
// Numbers, booleans, objects, arrays: raw text
// Spark uppercases exponent in numeric literals:
// 1.5e10 → 1.5E10
// Only apply to numbers (not booleans like "false")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation now returns the raw JSON token text not only for numbers, but also for booleans/objects/arrays (builder.append_value(raw_str)), whereas the previous serde_json::Value + to_string() path would have produced a canonicalized serialization (e.g., potentially different whitespace/key ordering). If the intended user-facing change is only number formatting/precision, consider keeping to_string() for non-number values or updating the PR description/docs to reflect the broader behavior change.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants