Fix stale DB redirects appearing as new assessments in quality logs#1107
Fix stale DB redirects appearing as new assessments in quality logs#1107arnchlmcodes wants to merge 5 commits intoopenzim:mainfrom
Conversation
ba03457 to
81f4b9b
Compare
|
I love it! 4 years old challenge fixed by a two lines patch! |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1107 +/- ##
==========================================
- Coverage 92.78% 92.73% -0.05%
==========================================
Files 74 74
Lines 4308 4310 +2
==========================================
Hits 3997 3997
- Misses 311 313 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
audiodude
left a comment
There was a problem hiding this comment.
Shouldn't you just add a comparison between page_touched and timestamp_dt to the SQL query itself? Why try to compare after the row has already been retrieved?
|
@audiodude Changed the approach to check the condition within the SQL query itself, Please take a look and lmk |
| args_dict = { | ||
| "title": wiki_db_title, | ||
| "namespace": namespace, | ||
| "timestamp": timestamp_dt.strftime("%Y%m%d%H%M%S") |
There was a problem hiding this comment.
You should not use a string literal for the timestamp format.
As you may have noticed, WP1 has very strange, unorthodox, and confusing ways of storing timestamps in MariaDB. It comes down to the fact that:
- Timestamps are stored as strings, in a format such that string comparison is equivalent to comparison of corresponding TIMESTAMP types.
- The WP1 application database, often referenced with the variable
wp10dbuses a different such string fromat than the Wikipedia replica database (often referenced aswikidb).
The formatting constants are stored in constants.py: https://github.com/openzim/wp1/blob/main/wp1/constants.py#L16-L17
See an example here:
https://github.com/openzim/wp1/blob/main/wp1/models/wp10/log.py
Not that the "WP1" model of log stores one of its timestamp fields in the WP1 format, because it is native data generated by our application, and the other in the Wikipedia replica format, because it is data taken from the replica db.
Since you are querying the replica db here (wikidb), it seems you have in fact introduced a bug, because the formatting you chose looks more like the WP1 format. But please confirm.
page_touched was being parsed into timestamp_dt directly, overwriting the project's last run cutoff before any comparison could be made. Now it Stores the parsed value in row_timestamp_dt and return None if not newer than last run
Fixes #462