Skip to content

fix(spark): parse month, day without leading zeros#7739

Open
deepyaman wants to merge 3 commits into
tobymao:mainfrom
deepyaman:patch-1
Open

fix(spark): parse month, day without leading zeros#7739
deepyaman wants to merge 3 commits into
tobymao:mainfrom
deepyaman:patch-1

Conversation

@deepyaman

Copy link
Copy Markdown

Resolves ibis-project/ibis#12004 (happy to also raise an issue on this repo, if helpful)

In short, Spark 3+ is different from most other SQL dialects in that MM and dd will not parse single-digit months and days without leading zeros (it's very strict).

Unfortunately, this creates some round-trip complications, although it shouldn't be a big deal unless somebody is relying on Spark 3+'s inability to parse something like "2026-6-9" using "yyyy-MM-dd", and we round trip that back out to the more lenient "yyyy-M-d" and then their dates parse (like they would in just about any other SQL dialect with "yyyy-MM-dd"...).

Note that there are a couple tests where the result in other dialects is getting spit out like "%Y-%-m-%-d" now (instead of "%Y-%m-%d"), which reflects that leniency. I was trying to suppress that by adding some additional mappings to the Spark dialect:

diff --git a/sqlglot/dialects/spark.py b/sqlglot/dialects/spark.py
index 5ea11f6f..4e9524ee 100644
--- a/sqlglot/dialects/spark.py
+++ b/sqlglot/dialects/spark.py
@@ -17,6 +17,17 @@ class Spark(Spark2):
     ARRAY_FUNCS_PROPAGATES_NULLS = True
     EXPRESSION_METADATA = EXPRESSION_METADATA.copy()
 
+    TIME_MAPPING = {
+        **Spark2.TIME_MAPPING,
+        "M": "%m",
+        "d": "%d",
+    }
+
+    INVERSE_TIME_MAPPING = {
+        "%m": "MM",
+        "%d": "DD",
+    }
+
     LENIENT_INVERSE_TIME_MAPPING = {v: k for k, v in Spark2.TIME_MAPPING.items()} | {
         # Parse zero-padded months and days, as per strptime() behavior.
         "%m": "M",

But I've decided to get feedback before going further down that rabbithole. 🐰

@deepyaman deepyaman marked this pull request as ready for review June 11, 2026 15:59
@georgesittas georgesittas requested a review from geooo109 June 15, 2026 11:44

@geooo109 geooo109 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman thanks for the PR, nice work!

I tested this a bit. Databricks and Hive seem to have the same strict/lenient distinction as Spark3/4 here. I also tried a few examples on Spark2 and they worked for both padded and non-padded formats, so Spark 2 looks lenient (you can also verify this).

This problem is a bit tricky though. If we apply the reverse mapping, we break the round-trip (and we'd also need to cover the try_* functions, etc).

using the code of this PR (ran the example on spark4):

input spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-MM-dd')
> NULL

output spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-M-d')
> 2000-01-02 00:00:00

In this example ^ we end up changing the semantics of the final query, it returned NULL before and now it returns a value.

My suggestion would be to first check which dialects are lenient vs non-lenient for this case in general. For example, if it's only the Spark hierarchy, then we should focus just on that and investigate whether an existing mapping can hold this "lenient" logic. If none fits, we can create a "virtual" one to tackle this specific problem.

@deepyaman

Copy link
Copy Markdown
Author

@deepyaman thanks for the PR, nice work!

@geooo109 thanks for taking the time to review!

I tested this a bit. Databricks and Hive seem to have the same strict/lenient distinction as Spark3/4 here. I also tried a few examples on Spark2 and they worked for both padded and non-padded formats, so Spark 2 looks lenient (you can also verify this).

Sorry, I'm not sure I'm following. I agree that Spark2 is lenient by default, which is why I didn't change how things worked in spark2.py. Spark 2 used the lenient SimpleDateFormat, but from Spark 3 onwards it shifted to the strict DateTimeFormatter. Databricks is aligned with Spark 3+ (so strict; some examples updated). I honestly didn't look into Hive (haven't dealt with it in years), but left it untouched.

This problem is a bit tricky though. If we apply the reverse mapping, we break the round-trip (and we'd also need to cover the try_* functions, etc).

using the code of this PR (ran the example on spark4):

input spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-MM-dd')
> NULL

output spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-M-d')
> 2000-01-02 00:00:00

In this example ^ we end up changing the semantics of the final query, it returned NULL before and now it returns a value.

Yes, 100% agree, this is exactly the risk I'd called out:

Unfortunately, this creates some round-trip complications, although it shouldn't be a big deal unless somebody is relying on Spark 3+'s inability to parse something like "2026-6-9" using "yyyy-MM-dd", and we round trip that back out to the more lenient "yyyy-M-d" and then their dates parse (like they would in just about any other SQL dialect with "yyyy-MM-dd"...).

I personally don't see it as a huge risk in the real world, as I wrote, but I called it out because it does break an important property.

My suggestion would be to first check which dialects are lenient vs non-lenient for this case in general. For example, if it's only the Spark hierarchy, then we should focus just on that and investigate whether an existing mapping can hold this "lenient" logic. If none fits, we can create a "virtual" one to tackle this specific problem.

My understanding is what I also mentioned in the PR description:

In short, Spark 3+ [including Databricks] is different from most other SQL dialects in that MM and dd will not parse single-digit months and days without leading zeros (it's very strict).

I think the issue is isolated to Spark 3+ (where I made the changes), but let me know if you think it's in the wrong place, or if I'm missing some action items on how to address this differently. Might need a bit more guidance on next steps, but I agree with what you said, but don't understand what (if anything) to change. :)

@georgesittas

georgesittas commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

I think the suggestion for now is to test how other dialects behave in practice, to figure out what dialects we should modify and how. I agree that we shouldn't change the roundtrip like that, altering the semantics. These formats are very common in SQL codebases and so I don't want to risk any regressions.

If all (or most) other dialects have the "lax" parsing semantics, like duckdb, we should preserve the existing %m format and implicitly treat it as "lax". So, Spark will convert it into M. Similarly for other formats like %D.

On the other hand, when parsing Spark, we should produce a different canonical format for its own MM, so that it doesn't map back to %m. This new format doesn't have to be a valid strftime format, but can be something like %mstrict, or whatever. Basically something that can help us create a bijection and preserve the roundtrip, while encoding/representing the semantics in the name for clarity purposes.

On how to proceed: please test against some dialects that you have access to (postgres, mysql, duckdb, clickhouse perhaps? some of these have online REPLs), and let us know for which ones you don't so we can test them ourselves.

@deepyaman

Copy link
Copy Markdown
Author

I think the suggestion for now is to test how other dialects behave in practice, to figure out what dialects we should modify and how. I agree that we shouldn't change the roundtrip like that, altering the semantics. These formats are very common in SQL codebases and so I don't want to risk any regressions.

If all (or most) other dialects have the "lax" parsing semantics, like duckdb, we should preserve the existing %m, etc, formats and implicitly treat them as "lax". So, Spark will convert them into M, etc.

On the other hand, when parsing Spark, we should produce different canonical formats for its own MM, so that it doesn't map back to %m. This new format doesn't have to be a valid strftime format, but can be something like %mstrict, or whatever. Basically something that can help us create a bijection and preserve the roundtrip, ideally while encoding/representing the semantics in the name for clarity purposes.

Makes sense.

On how to proceed: please test against some dialects that you have access to (postgres, mysql, duckdb, clickhouse perhaps? some of these have online REPLs), and let us know for which ones you don't so we can test them ourselves.

OK, I'll try and find some time to go through my notes on this, and also test them more thoroughly in the next day or few; my looking into this before raising the PR was based on more Googling syntax/specs for different dialects than actually testing, and I really only saw this difference behavior on Spark 3+, but I can probably run some test against Ibis-supported backend dialects pretty easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: to_date fails to handle single-digit values in pyspark

3 participants