Fix non-deterministic iteration in SessionStateBuilder#21262
Fix non-deterministic iteration in SessionStateBuilder#21262shehab-ali wants to merge 1 commit intoapache:mainfrom
Conversation
a9333c6 to
97e37c6
Compare
| /// one [`Arc`] per logical function. | ||
| fn dedup_function_registry_by_canonical_name<T>( | ||
| map: &HashMap<String, Arc<T>>, | ||
| canonical_name: impl Fn(&T) -> &str, |
There was a problem hiding this comment.
Do we need canonical_name as it's own function if all the uses are the same function?
There was a problem hiding this comment.
those function have different types though: WindowUDF, ScalarUDF, AggregateUDF
There was a problem hiding this comment.
ah I see now, my bad.
97e37c6 to
936c4bd
Compare
ahmed-mez
left a comment
There was a problem hiding this comment.
Thank you for fixing this! I hope the maintainers review this soon.
| /// matches the canonical name. The session stores one hash map entry per alias | ||
| /// plus the canonical name; filtering to canonical-name entries yields exactly | ||
| /// one [`Arc`] per logical function. | ||
| fn dedup_function_registry_by_canonical_name<T>( |
There was a problem hiding this comment.
minor: I think the function could consume the map instead of borrowing it, to avoid some arc clones
fn dedup_function_registry_by_canonical_name<T>(
map: HashMap<String, Arc<T>>,
canonical_name: impl Fn(&T) -> &str,
) -> Vec<Arc<T>> {
map.into_iter()
.filter(|(key, udf)| key.as_str() == canonical_name(udf.as_ref()))
.map(|(_, udf)| udf) // no Arc::clone needed
.collect()
}There was a problem hiding this comment.
Alternatively, it might be more robust to dedup by identity
fn dedup_by_identity<T>(map: HashMap<String, Arc<T>>) -> Vec<Arc<T>> {
let mut seen = HashSet::new();
map.into_values()
.filter(|arc| seen.insert(Arc::as_ptr(arc)))
.collect()
}|
Thanks for the description of the bug that's being encountered here. However, I'm having trouble understanding the implications. Is it a potential performance issue, if we have a massive number of UDFs? There's mention of it not being deterministic but what knock-on effects would this have? I tried out the test introduced here but it passes on main, so I want to understand what exactly this is fixing. |
|
The bug showed up when SessionState is rebuilt via SessionStateBuilder::new_from_existing(), a user's UDF override can be silently reverted, causing the wrong function to execute at query time. Example: a user registers postgres_to_char with alias "to_char" to override the built-in to_char. Queries using to_char() correctly call the custom implementation. But when any code path rebuilds the session state, HashMap::into_values() emits both the custom UDF and the built-in (which survived under its "date_format" alias key). build() re-registers both, and last writer wins. If the built-in happens to register last, "to_char" now points back to the built-in, and subsequent queries silently call the wrong function. This happens nondeterministically depending on HashMap iteration order. With these changes, the filter key.as_str() == udf.name() only keeps entries where the HashMap key is the UDF's canonical name. This drops the built-in because its canonical key ("to_char") was overwritten by the custom UDF; it only exists under "date_format", which doesn't match its canonical name "to_char". So the built-in is never passed to build(), and the custom UDF's override is preserved. |
|
Thanks for the example. I was able to reproduce it via modifying the test included here to have two registered UDFs as described. It would be good to enhance the test introduced here to clearly reproduce the error (even if it happens randomly).
Is this dropping of |
Which issue does this PR close?
Problem:
SessionStateBuilder::new_from_existing()previously usedHashMap::into_values().collect_vec()for scalar, aggregate, and window functions. Since the HashMap stores entries for both the canonical name and each alias (all pointing to the sameArc<UDF>),into_values()produced duplicate entries in arbitrary order. Whenbuild()re-registered them, the last-writer-wins behavior on shared alias keys was nondeterministic.Fix: Introduced
dedup_function_registry_by_canonical_name()which:Iterates HashMap keys in sorted order (deterministic)
Keeps only one
Arc<UDF>per unique canonical name (no duplicates)This ensures build() re-registers each function exactly once, making the round-trip through
new_from_existing()→build()deterministic and alias-preserving.Bug Example
Problem: When SessionState is rebuilt via SessionStateBuilder::new_from_existing(), a user's UDF override can be silently reverted, causing the wrong function to execute at query time.
Concrete example: a user registers postgres_to_char with alias "to_char" to override the built-in to_char. Queries using to_char(...) correctly call the custom implementation. But when any code path rebuilds the session state, HashMap::into_values() emits both the custom UDF and the built-in (which survived under its "date_format" alias key). build() re-registers both, and last writer wins. If the built-in happens to register last, "to_char" now points back to the built-in, and subsequent queries silently call the wrong function. This happens nondeterministically depending on HashMap iteration order.
What changed: the filter key.as_str() == udf.name() only keeps entries where the HashMap key is the UDF's canonical name. This drops the built-in because its canonical key ("to_char") was overwritten by the custom UDF; it only exists under "date_format", which doesn't match its canonical name "to_char". So the built-in is never passed to build(), and the custom UDF's override is preserved.
Are these changes tested?
Added unit test
Are there any user-facing changes?
No