In yesterday's PyData meetup in Zurich, one question prompted me to realize that we're incorrectly dealing with group rules and row-level rules: if a row-level rule removes a row which would make a group rule fail, we do not realize it. For example:
import dataframely as dy
import polars as pl
class DiagnosisSchema(dy.Schema):
invoice_id = dy.String(primary_key=True)
diagnosis = dy.String(primary_key=True, regex="^[A-Z]{3}$")
is_main = dy.Bool(nullable=False)
@dy.rule()
def exactly_one_main_diagnosis() -> pl.Expr:
return pl.col("is_main").sum() == 1
df = pl.DataFrame(
{
"invoice_id": ["A", "A", "A"],
"diagnosis": ["ABC", "ABD", "123"],
"is_main": [False, False, True],
}
)
good, _ = DiagnosisSchema.filter(df)
print(good)
results in
shape: (2, 3)
ββββββββββββββ¬ββββββββββββ¬ββββββββββ
β invoice_id β diagnosis β is_main β
β --- β --- β --- β
β str β str β bool β
ββββββββββββββͺββββββββββββͺββββββββββ‘
β A β ABC β false β
β A β ABD β false β
ββββββββββββββ΄ββββββββββββ΄ββββββββββ
which clearly violates the schema since we don't have a main diagnosis for the group.
In yesterday's PyData meetup in Zurich, one question prompted me to realize that we're incorrectly dealing with group rules and row-level rules: if a row-level rule removes a row which would make a group rule fail, we do not realize it. For example:
results in
which clearly violates the schema since we don't have a main diagnosis for the group.