fix(spark): parse month, day without leading zeros#7739
Conversation
geooo109
left a comment
There was a problem hiding this comment.
@deepyaman thanks for the PR, nice work!
I tested this a bit. Databricks and Hive seem to have the same strict/lenient distinction as Spark3/4 here. I also tried a few examples on Spark2 and they worked for both padded and non-padded formats, so Spark 2 looks lenient (you can also verify this).
This problem is a bit tricky though. If we apply the reverse mapping, we break the round-trip (and we'd also need to cover the try_* functions, etc).
using the code of this PR (ran the example on spark4):
input spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-MM-dd')
> NULL
output spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-M-d')
> 2000-01-02 00:00:00
In this example ^ we end up changing the semantics of the final query, it returned NULL before and now it returns a value.
My suggestion would be to first check which dialects are lenient vs non-lenient for this case in general. For example, if it's only the Spark hierarchy, then we should focus just on that and investigate whether an existing mapping can hold this "lenient" logic. If none fits, we can create a "virtual" one to tackle this specific problem.
@geooo109 thanks for taking the time to review!
Sorry, I'm not sure I'm following. I agree that Spark2 is lenient by default, which is why I didn't change how things worked in
Yes, 100% agree, this is exactly the risk I'd called out:
I personally don't see it as a huge risk in the real world, as I wrote, but I called it out because it does break an important property.
My understanding is what I also mentioned in the PR description:
I think the issue is isolated to Spark 3+ (where I made the changes), but let me know if you think it's in the wrong place, or if I'm missing some action items on how to address this differently. Might need a bit more guidance on next steps, but I agree with what you said, but don't understand what (if anything) to change. :) |
|
I think the suggestion for now is to test how other dialects behave in practice, to figure out what dialects we should modify and how. I agree that we shouldn't change the roundtrip like that, altering the semantics. These formats are very common in SQL codebases and so I don't want to risk any regressions. If all (or most) other dialects have the "lax" parsing semantics, like duckdb, we should preserve the existing On the other hand, when parsing Spark, we should produce a different canonical format for its own On how to proceed: please test against some dialects that you have access to (postgres, mysql, duckdb, clickhouse perhaps? some of these have online REPLs), and let us know for which ones you don't so we can test them ourselves. |
Makes sense.
OK, I'll try and find some time to go through my notes on this, and also test them more thoroughly in the next day or few; my looking into this before raising the PR was based on more Googling syntax/specs for different dialects than actually testing, and I really only saw this difference behavior on Spark 3+, but I can probably run some test against Ibis-supported backend dialects pretty easily. |
Resolves ibis-project/ibis#12004 (happy to also raise an issue on this repo, if helpful)
In short, Spark 3+ is different from most other SQL dialects in that
MMandddwill not parse single-digit months and days without leading zeros (it's very strict).Unfortunately, this creates some round-trip complications, although it shouldn't be a big deal unless somebody is relying on Spark 3+'s inability to parse something like
"2026-6-9"using"yyyy-MM-dd", and we round trip that back out to the more lenient"yyyy-M-d"and then their dates parse (like they would in just about any other SQL dialect with"yyyy-MM-dd"...).Note that there are a couple tests where the result in other dialects is getting spit out like
"%Y-%-m-%-d"now (instead of"%Y-%m-%d"), which reflects that leniency. I was trying to suppress that by adding some additional mappings to the Spark dialect:But I've decided to get feedback before going further down that rabbithole. 🐰