fix(spark): parse month, day without leading zeros by deepyaman · Pull Request #7739 · tobymao/sqlglot

deepyaman · 2026-06-11T14:27:21Z

Resolves ibis-project/ibis#12004 (happy to also raise an issue on this repo, if helpful)

In short, Spark 3+ is different from most other SQL dialects in that MM and dd will not parse single-digit months and days without leading zeros (it's very strict).

Unfortunately, this creates some round-trip complications, although it shouldn't be a big deal unless somebody is relying on Spark 3+'s inability to parse something like "2026-6-9" using "yyyy-MM-dd", and we round trip that back out to the more lenient "yyyy-M-d" and then their dates parse (like they would in just about any other SQL dialect with "yyyy-MM-dd"...).

Note that there are a couple tests where the result in other dialects is getting spit out like "%Y-%-m-%-d" now (instead of "%Y-%m-%d"), which reflects that leniency. I was trying to suppress that by adding some additional mappings to the Spark dialect:

diff --git a/sqlglot/dialects/spark.py b/sqlglot/dialects/spark.py
index 5ea11f6f..4e9524ee 100644
--- a/sqlglot/dialects/spark.py
+++ b/sqlglot/dialects/spark.py
@@ -17,6 +17,17 @@ class Spark(Spark2):
     ARRAY_FUNCS_PROPAGATES_NULLS = True
     EXPRESSION_METADATA = EXPRESSION_METADATA.copy()
 
+    TIME_MAPPING = {
+        **Spark2.TIME_MAPPING,
+        "M": "%m",
+        "d": "%d",
+    }
+
+    INVERSE_TIME_MAPPING = {
+        "%m": "MM",
+        "%d": "DD",
+    }
+
     LENIENT_INVERSE_TIME_MAPPING = {v: k for k, v in Spark2.TIME_MAPPING.items()} | {
         # Parse zero-padded months and days, as per strptime() behavior.
         "%m": "M",

But I've decided to get feedback before going further down that rabbithole. 🐰

geooo109

@deepyaman thanks for the PR, nice work!

I tested this a bit. Databricks and Hive seem to have the same strict/lenient distinction as Spark3/4 here. I also tried a few examples on Spark2 and they worked for both padded and non-padded formats, so Spark 2 looks lenient (you can also verify this).

This problem is a bit tricky though. If we apply the reverse mapping, we break the round-trip (and we'd also need to cover the try_* functions, etc).

using the code of this PR (ran the example on spark4):

input spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-MM-dd')
> NULL

output spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-M-d')
> 2000-01-02 00:00:00

In this example ^ we end up changing the semantics of the final query, it returned NULL before and now it returns a value.

My suggestion would be to first check which dialects are lenient vs non-lenient for this case in general. For example, if it's only the Spark hierarchy, then we should focus just on that and investigate whether an existing mapping can hold this "lenient" logic. If none fits, we can create a "virtual" one to tackle this specific problem.

deepyaman · 2026-06-16T04:39:46Z

@deepyaman thanks for the PR, nice work!

@geooo109 thanks for taking the time to review!

I tested this a bit. Databricks and Hive seem to have the same strict/lenient distinction as Spark3/4 here. I also tried a few examples on Spark2 and they worked for both padded and non-padded formats, so Spark 2 looks lenient (you can also verify this).

Sorry, I'm not sure I'm following. I agree that Spark2 is lenient by default, which is why I didn't change how things worked in spark2.py. Spark 2 used the lenient SimpleDateFormat, but from Spark 3 onwards it shifted to the strict DateTimeFormatter. Databricks is aligned with Spark 3+ (so strict; some examples updated). I honestly didn't look into Hive (haven't dealt with it in years), but left it untouched.

This problem is a bit tricky though. If we apply the reverse mapping, we break the round-trip (and we'd also need to cover the try_* functions, etc).

using the code of this PR (ran the example on spark4):
input spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-MM-dd')
> NULL

output spark:
SELECT TRY_TO_TIMESTAMP('2000-1-2', 'yyyy-M-d')
> 2000-01-02 00:00:00
In this example ^ we end up changing the semantics of the final query, it returned NULL before and now it returns a value.

Yes, 100% agree, this is exactly the risk I'd called out:

Unfortunately, this creates some round-trip complications, although it shouldn't be a big deal unless somebody is relying on Spark 3+'s inability to parse something like "2026-6-9" using "yyyy-MM-dd", and we round trip that back out to the more lenient "yyyy-M-d" and then their dates parse (like they would in just about any other SQL dialect with "yyyy-MM-dd"...).

I personally don't see it as a huge risk in the real world, as I wrote, but I called it out because it does break an important property.

My suggestion would be to first check which dialects are lenient vs non-lenient for this case in general. For example, if it's only the Spark hierarchy, then we should focus just on that and investigate whether an existing mapping can hold this "lenient" logic. If none fits, we can create a "virtual" one to tackle this specific problem.

My understanding is what I also mentioned in the PR description:

In short, Spark 3+ [including Databricks] is different from most other SQL dialects in that MM and dd will not parse single-digit months and days without leading zeros (it's very strict).

I think the issue is isolated to Spark 3+ (where I made the changes), but let me know if you think it's in the wrong place, or if I'm missing some action items on how to address this differently. Might need a bit more guidance on next steps, but I agree with what you said, but don't understand what (if anything) to change. :)

georgesittas · 2026-06-16T06:07:28Z

I think the suggestion for now is to test how other dialects behave in practice, to figure out what dialects we should modify and how. I agree that we shouldn't change the roundtrip like that, altering the semantics. These formats are very common in SQL codebases and so I don't want to risk any regressions.

If all (or most) other dialects have the "lax" parsing semantics, like duckdb, we should preserve the existing %m format and implicitly treat it as "lax". So, Spark will convert it into M. Similarly for other formats like %D.

On the other hand, when parsing Spark, we should produce a different canonical format for its own MM, so that it doesn't map back to %m. This new format doesn't have to be a valid strftime format, but can be something like %mstrict, or whatever. Basically something that can help us create a bijection and preserve the roundtrip, while encoding/representing the semantics in the name for clarity purposes.

On how to proceed: please test against some dialects that you have access to (postgres, mysql, duckdb, clickhouse perhaps? some of these have online REPLs), and let us know for which ones you don't so we can test them ourselves.

deepyaman · 2026-06-16T06:17:33Z

I think the suggestion for now is to test how other dialects behave in practice, to figure out what dialects we should modify and how. I agree that we shouldn't change the roundtrip like that, altering the semantics. These formats are very common in SQL codebases and so I don't want to risk any regressions.

If all (or most) other dialects have the "lax" parsing semantics, like duckdb, we should preserve the existing %m, etc, formats and implicitly treat them as "lax". So, Spark will convert them into M, etc.

On the other hand, when parsing Spark, we should produce different canonical formats for its own MM, so that it doesn't map back to %m. This new format doesn't have to be a valid strftime format, but can be something like %mstrict, or whatever. Basically something that can help us create a bijection and preserve the roundtrip, ideally while encoding/representing the semantics in the name for clarity purposes.

Makes sense.

On how to proceed: please test against some dialects that you have access to (postgres, mysql, duckdb, clickhouse perhaps? some of these have online REPLs), and let us know for which ones you don't so we can test them ourselves.

OK, I'll try and find some time to go through my notes on this, and also test them more thoroughly in the next day or few; my looking into this before raising the PR was based on more Googling syntax/specs for different dialects than actually testing, and I really only saw this difference behavior on Spark 3+, but I can probably run some test against Ibis-supported backend dialects pretty easily.

deepyaman added 2 commits June 10, 2026 17:05

fix(spark): parse month, day without leading zeros

4806819

fix(spark): also fix generation of exp.StrToTime

4c2e16b

deepyaman mentioned this pull request Jun 11, 2026

fix(backends): handle single-digit month/day in pyspark to_date ibis-project/ibis#12011

Open

chore(spark): narrow type to fix mypy attr-defined

af89ba3

deepyaman marked this pull request as ready for review June 11, 2026 15:59

georgesittas requested a review from geooo109 June 15, 2026 11:44

georgesittas assigned geooo109 Jun 15, 2026

geooo109 reviewed Jun 15, 2026

View reviewed changes

deepyaman mentioned this pull request Jun 16, 2026

test: add single-digit month/day date/timestamp cast tests (#12004) ibis-project/ibis#12021

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): parse month, day without leading zeros#7739

fix(spark): parse month, day without leading zeros#7739
deepyaman wants to merge 3 commits into
tobymao:mainfrom
deepyaman:patch-1

deepyaman commented Jun 11, 2026

Uh oh!

geooo109 left a comment

Uh oh!

deepyaman commented Jun 16, 2026

Uh oh!

georgesittas commented Jun 16, 2026 •

edited

Loading

Uh oh!

deepyaman commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deepyaman commented Jun 11, 2026

Uh oh!

geooo109 left a comment

Choose a reason for hiding this comment

Uh oh!

deepyaman commented Jun 16, 2026

Uh oh!

georgesittas commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepyaman commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

georgesittas commented Jun 16, 2026 •

edited

Loading