Apply sample limit to UDFailedRowsExpressionQuery#1986
Apply sample limit to UDFailedRowsExpressionQuery#1986denisdallinga wants to merge 1 commit intosodadata:v3from
Conversation
This changes the way samples are collected from UserDefinedFailedRowsExpressionQueries. In order to apply the samples limit given in the check configuration, or apply the default samples limit, the failed rows expression needs to fire off an additional SampleQuery. The SampleQuery executes a copy of the query, appending a LIMIT clause. [sodadata#1985](sodadata#1985)
|
Denis Dallinga seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
|
@denisdallinga I'm working on a PR to address this issue. Maybe we can team up to get more momentum on this issue? |
m1n0
left a comment
There was a problem hiding this comment.
Thanks for this contribution! It looks like this works, but I would suggest a few changes to make it consistent with how other check work:
- use
.fetchone()method of the query class to get the result - set the check metric value
- if there are results and check has samples limit >0 then create a new sample query object with limit
You can check duplicate check query that does exactly that for inspiration, it should be pretty much 100% copy-paste except for the sql itself. https://github.com/sodadata/soda-core/blob/main/soda/core/soda/execution/query/duplicates_query.py#L95-L107
One note - please keep in mind that different data sources use different syntax for limit - TOP/FETCH FIRST, and it's even different syntax (you dont just append it to the end of the query) - so the way to go here is to:
- refactor
get_failed_rows_sqlto take the sql string itself from the data_source class - override that sql generation in each datasource that does not use LIMIT - the ones that need that override have the
LIMIT_KEYWORDattribute overridden
Lastly - this should be covered by tests as its error-prone with all the variances in how limit is done across data sources, there already is a test for how the limit keyword is applied here, it just needs to be adjusted to include user defined check with expression.
I hope this is not too much!

Github Issue: #1985
This changes the way samples are collected from
UserDefinedFailedRowsExpressionQueries. In order to apply the samples limit given in the check configuration, or apply the default samples limit, the failed rows expression needs to fire off an additional SampleQuery. The SampleQuery executes a copy of the query, appending a LIMIT clause.