Skip to content

Apply sample limit to UDFailedRowsExpressionQuery#1986

Draft
denisdallinga wants to merge 1 commit intosodadata:v3from
denisdallinga:soda-1985-fix-failed-row-checks-not-honoring-sample-limit
Draft

Apply sample limit to UDFailedRowsExpressionQuery#1986
denisdallinga wants to merge 1 commit intosodadata:v3from
denisdallinga:soda-1985-fix-failed-row-checks-not-honoring-sample-limit

Conversation

@denisdallinga
Copy link

@denisdallinga denisdallinga commented Jan 11, 2024

Github Issue: #1985

This changes the way samples are collected from UserDefinedFailedRowsExpressionQueries. In order to apply the samples limit given in the check configuration, or apply the default samples limit, the failed rows expression needs to fire off an additional SampleQuery. The SampleQuery executes a copy of the query, appending a LIMIT clause.

This changes the way samples are collected from
UserDefinedFailedRowsExpressionQueries. In order to apply the samples
limit given in the check configuration, or apply the default samples
limit, the failed rows expression needs to fire off an additional
SampleQuery. The SampleQuery executes a copy of the query, appending a
LIMIT clause.

[sodadata#1985](sodadata#1985)
@CLAassistant
Copy link

CLAassistant commented Jan 11, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Denis Dallinga seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@sonarqubecloud
Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@NathanBick
Copy link

@denisdallinga I'm working on a PR to address this issue. Maybe we can team up to get more momentum on this issue?

Copy link
Contributor

@m1n0 m1n0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! It looks like this works, but I would suggest a few changes to make it consistent with how other check work:

  • use .fetchone() method of the query class to get the result
  • set the check metric value
  • if there are results and check has samples limit >0 then create a new sample query object with limit

You can check duplicate check query that does exactly that for inspiration, it should be pretty much 100% copy-paste except for the sql itself. https://github.com/sodadata/soda-core/blob/main/soda/core/soda/execution/query/duplicates_query.py#L95-L107

One note - please keep in mind that different data sources use different syntax for limit - TOP/FETCH FIRST, and it's even different syntax (you dont just append it to the end of the query) - so the way to go here is to:

  • refactor get_failed_rows_sql to take the sql string itself from the data_source class
  • override that sql generation in each datasource that does not use LIMIT - the ones that need that override have the LIMIT_KEYWORD attribute overridden

Lastly - this should be covered by tests as its error-prone with all the variances in how limit is done across data sources, there already is a test for how the limit keyword is applied here, it just needs to be adjusted to include user defined check with expression.

I hope this is not too much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants