Skip to content

Support custom types#5254

Merged
penberg merged 64 commits intotursodatabase:mainfrom
glommer:custom-types
Feb 24, 2026
Merged

Support custom types#5254
penberg merged 64 commits intotursodatabase:mainfrom
glommer:custom-types

Conversation

@glommer
Copy link
Contributor

@glommer glommer commented Feb 14, 2026

Description

This PR adds custom types for SQLite. Strict tables are required, and custom types can only be used with strict tables. Users can define their own custom types by adding a SQLite expression, but we seed the table with some initial ones.

There is a new fuzzer that is introduced with this that tries to make sure that the expressions we generate are valid with sql expressions like order by, indexes, etc.

I can now run this fuzzer for hours without issues.

Note that this depends on PR #5207, which is included here.

Motivation and context

We live in 2026. Types won.

Description of AI Usage

This was over a week of Claude Coding. I focused a lot on validation, and aside from the fuzzer loop, used reviewing agents extensively.

Copy link

@turso-bot turso-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review @pereman2

@glommer
Copy link
Contributor Author

glommer commented Feb 15, 2026

Update for maintainers: I am currently going through a list of bugs found by @LeMikaelF 's clanker.

Will update the PR as I make progress.

@glommer
Copy link
Contributor Author

glommer commented Feb 23, 2026

too many conflicts so I stashed everything into one commit (FYI @penberg )

glommer and others added 27 commits February 24, 2026 07:17
Per the manual, CAST(x AS custom_type) should produce the stored
(encoded) representation. Previously it applied both encode and decode,
making it a no-op for symmetric encode/decode pairs like cents
(value*100 / value/100 = identity).

Now CAST(42 AS cents) correctly returns 4200 (the encoded form),
and CAST('hello' AS reversed) returns 'olleh'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Custom type names like "doubled" contain substrings ("DOUB") that
SQLite's name-based rules map to REAL affinity. This caused integer
values to be stored as floats and returned with wrong typeof().

Fix: when resolving custom type columns, override both the Column's
Type bits and base affinity bits with values derived from the BASE
type name. This is done in handle_schema_row (for ParseSchema) and
resolve_all_custom_type_affinities (for schema reparse after loading
types from __turso_internal_types).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce emit_user_facing_column_value() — a single abstraction that
converts a stored column value to its user-facing form. For custom type
columns this applies the DECODE function; for regular columns it is a
plain copy. Both SELECT and RETURNING now go through this helper.

Previously RETURNING mapped column references directly to the write
registers which hold encoded storage values, so INSERT/UPDATE/DELETE
RETURNING showed raw encoded values (e.g. 4200 instead of 42 for a
cents type).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When reopening a database, make_from_btree loads table definitions from
the sqlite_schema btree but does not read the contents of
__turso_internal_types. This meant the type_registry was empty after
reopen, causing SELECT to return raw encoded values and PRAGMA
list_types to omit user-defined types.

Extract shared helpers Schema::load_type_definitions() and
Connection::query_stored_type_definitions() so both the initial open
path (lib.rs) and schema reparse path (connection.rs) load custom
types through the same code.

Add integration tests for reopen, schema change after reopen, and
new connection visibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The optimizer builds ephemeral auto-indexes for the inner table in joins.
These indexes store raw encoded column values. When building the seek key,
encode_seek_keys_for_custom_types re-encodes the decoded seek key to match
the index contents. However, for aliased tables (e.g. FROM t1 a, t1 b),
this function searched for the table by identifier ("a"/"b") while the
ephemeral index stored the base table name ("t1"). The lookup failed
silently, skipping the encode step, so the decoded seek key (e.g. 10)
could never match the encoded index value (e.g. 1000).

Fix: add find_table_by_table_name() that searches by the underlying table
name rather than the alias, and use it as a fallback when the identifier
lookup fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The existing fuzzer tested cross-table joins (t1 JOIN t2) but not
self-joins (t1 a JOIN t1 b). This gap meant the auto-index alias
lookup bug fixed in bdc308e would not have been caught by the fuzzer.

Add pattern 39 that performs a self-join on t1 custom type column
and verifies both the join condition (a.val == b.val) and that the
result count is at least t1_rows (each row matches at least itself).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In UPSERTs DO UPDATE path, excluded.column was reading from the
already-encoded insertion registers. When the UPSERT then encoded the
SET columns again, values were double-encoded (e.g. 50 became 5000
for a cents type with ENCODE value * 100).

Similarly, the WHERE clause in DO UPDATE was comparing against encoded
values from disk (e.g. WHERE t1.amount < 20 evaluated 1000 < 20
instead of 10 < 20).

The fix:
- Create decoded copies of current_start registers for WHERE/SET
  expressions (current_start itself stays encoded for trigger OLD
  registers)
- Create decoded copies of excluded (insertion) registers so
  excluded.column references see user-facing values
- Decode new_start in-place (was copied from encoded current_start)
- Encode ALL columns in new_start before writing (not just SET
  columns), since non-SET columns are now decoded too

Add emit_custom_type_decode_columns helper (mirrors the existing
encode helper) and extend rewrite_expr_to_registers to accept
decoded excluded register base for proper column resolution.

Fixes Bugs 7, 13, 15, 22, 27 from the custom types bug list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SQLite already has special treatment for boolean: it accepts true/false
as literals that map to 1/0, and CAST(1 AS boolean) returns 1. Our
int_to_boolean DECODE function was returning text 'true'/'false', which
diverged from SQLite behavior and caused CAST(1 AS boolean) to break
after the encode-only CAST fix.

Change DECODE to `CASE WHEN value THEN 1 ELSE 0 END` so boolean
columns display as 0/1, matching SQLite semantics. The ENCODE function
(boolean_to_int) is unchanged and still validates/normalizes user input.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Disable constant hoisting for SET expressions on columns with custom
type encode functions. The encode is applied in-place to the target
register inside the update loop, so a hoisted constant would be
encoded repeatedly on each iteration (99 → 9900 → 990000 → ...).

We disable hoisting rather than working around it because:
1. Encode functions may be non-deterministic (e.g. datetime('now'))
2. Even for deterministic encodes, hoisting the pre-encode value and
   encoding in-place causes progressive double-encoding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In STRICT tables, CHECK constraint comparisons are now type-checked at
CREATE TABLE time. Every comparison operand must have a determinable,
compatible type. If the type cannot be determined (e.g. function calls),
the user must use an explicit CAST.

This prevents Bug 10 (CHECK constraints seeing encoded custom type values)
by rejecting the problematic pattern entirely: comparing a custom type
column against a raw literal is a type error. The user must write
CHECK(amount < CAST(50 AS cents)) instead of CHECK(amount < 50).

Type compatibility rules:
- INTEGER and REAL are mutually compatible (numeric)
- TEXT only with TEXT, BLOB only with BLOB
- ANY compatible with everything, NULL compatible with everything
- Custom types only compatible with the same custom type
- Function calls require CAST (return type unknown)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VACUUM INTO failed with three separate errors when the source database
had custom types: (a) creating __turso_internal_types in the dest was
rejected by the reserved-name check, (b) INSERT INTO it was rejected
by may-not-be-modified, and (c) CREATE TABLE for STRICT tables failed
because custom types were not registered in the dest connection schema.

Fixed by temporarily marking the dest connection as nested during
prepare() for internal tables and copying the source type registry
into the dest after creating the types table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CREATE TYPE allowed names like INTEGER, TEXT, REAL, BLOB, ANY, and INT,
which shadow the column type system and create undropable types. Now
rejected at CREATE TYPE time with a clear error message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Foreign key CASCADE and SET NULL actions failed with custom types because
the key registers used for parent index probes contained decoded values
while the index stored encoded values. Add decode_fk_key_registers helper
that decodes FK key registers before comparison, applied in delete actions,
update actions, and drop table checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Custom type columns in STRICT tables bypassed TypeCheck because encoding
ran before validation, silently converting wrong-type inputs (e.g.
'hello' * 100 = 0 for a cents column). Add typed parameter syntax to
CREATE TYPE so types can declare the expected input type for the value
parameter, e.g. CREATE TYPE cents(value integer) BASE integer.

A pre-encode TypeCheck now validates user input against the declared
value type before encoding runs. The existing post-encode TypeCheck
remains to validate encoded output against the BASE storage type.

Updated all 13 built-in types with typed params (uuid expects text,
boolean expects any, numeric expects any, etc.). Backward compatible:
old untyped params default to any.

Also moves child-side FK checks in UPDATE to after encoding so that
new values probed against parent indexes are properly encoded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop the type parameter from OPERATOR syntax (now OPERATOR '+' func_name
instead of OPERATOR '+' (type) -> func_name). Old syntax still parses
for backward compatibility.

Operators now only fire when both operands are the same custom type, or
when one is a custom type column and the other is a compatible literal
(matching the type's value parameter). Literals are encoded before being
passed to the operator function so both args are in the same form.

Fixes three bugs: operators firing for wrong types, register clobbering
when encoding literals across loop iterations, and reversed comparisons
when the literal appears on the LHS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
INSERT and UPDATE paths computed expression index values from encoded
(storage) column registers, while SELECT/DELETE computed from decoded
(user-facing) values. This mismatch caused index lookups to fail and
DELETE to corrupt the database with "IdxDelete: no matching index entry
found" errors.

Add decode_custom_type_registers_in_expr() that walks rewritten
expression trees, decodes custom type Expr::Register nodes into
temporary registers before evaluation, ensuring all paths produce
consistent index keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NOT NULL was checked on the encoded (stored) value, allowing "ghost NULLs"
where ENCODE produces a non-NULL value but DECODE returns NULL. The user
would see NULL in a NOT NULL column, violating the constraint contract.

Now emit_notnulls() decodes custom type values into a temporary register
before the NULL check, ensuring the user-facing value is verified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-deterministic functions like random(), changes(), and
last_insert_rowid() in ENCODE expressions produce different stored
values for the same user input, breaking UNIQUE constraints, equality
lookups, index seeks, and JOIN matching.

The existing validate_type_expr() now checks is_deterministic() on
resolved built-in functions and rejects them at CREATE TYPE time.
External (extension) functions are excluded from the check since they
default to non-deterministic but may be deterministic (e.g. uuid_blob).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ALTER TABLE ADD COLUMN with NOT NULL rejected the operation when the
column had no explicit DEFAULT, even if the custom type defined one.
The type-level DEFAULT was not consulted for either the NOT NULL
feasibility check or the actual column default value.

Now the NOT NULL check considers the type-level DEFAULT, and for
NOT NULL columns without an explicit DEFAULT, the type default is
propagated to the column definition so existing rows use it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
INSERT INTO t DEFAULT VALUES was not consulting type-level defaults
defined via CREATE TYPE ... DEFAULT expr. When a column had no explicit
column-level DEFAULT but its custom type had a DEFAULT clause, the
DEFAULT VALUES path produced NULL instead of the type default value.

This was inconsistent with column-list INSERT (e.g. INSERT INTO t(id)
VALUES (1)) which correctly applied type-level defaults for omitted
columns.

Fixed both DefaultValues code paths in bind_insert() and
init_source_emission() to check type-level defaults when no column
default exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… NULL

The built-in date, time, and timestamp types used bare date()/time()/
datetime() as their ENCODE expressions. These SQLite functions return
NULL for invalid input rather than raising an error, which meant
invalid values like 'not-a-date' were silently converted to NULL on
STRICT tables — a data integrity violation.

Changed the ENCODE expressions to wrap the function calls in a CASE
that checks for NULL output (when input was non-NULL) and raises an
ABORT error with a descriptive message. NULL input is still passed
through unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e DDL

.dump was emitting CREATE TABLE __turso_internal_types(...) and INSERT
statements for the internal metadata table. When this output was fed
back into a fresh database, it failed with 'Object name reserved for
internal use' because direct creation of __turso_internal_types is
blocked.

Now .dump emits the original CREATE TYPE statements (extracted from
the sql column of __turso_internal_types) before any table DDL. The
internal metadata table itself is skipped in the table dump loop.
This produces a clean, restorable SQL dump that recreates custom types
via the proper CREATE TYPE syntax.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In the UPSERT DO UPDATE path, BEFORE UPDATE triggers received NEW
registers that were already encoded (post custom type encoding). The
trigger context was created with new_encoded=false, so fire_trigger's
decode_trigger_registers skipped decoding. This caused NEW.column
references in trigger bodies to show raw encoded values (e.g. 2000
instead of 20 for a cents type with ENCODE value * 100).

Fixed by using new_after_with_override_conflict (which sets
new_encoded=true) instead of new_with_override_conflict for the
BEFORE trigger context in the UPSERT path. This matches the actual
state of the registers at that point and lets the existing decode
logic in fire_trigger handle decoding correctly.

Also updated pragma-list-types test expectations to reflect the new
date/time/timestamp ENCODE expressions from the Bug 30 fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… OPERATOR '<'

Custom types with encode/decode but no OPERATOR '<' were silently producing
wrong ORDER BY results — the sorter compared encoded (on-disk) values, which
may have completely different ordering than the user expects. This commit makes
the behavior explicit and safe.

Key changes:

1. **Naked OPERATOR '<' syntax**: Types can now declare `OPERATOR '<'` without
   a function name, meaning "use the base type's built-in comparison on encoded
   values." With a function name (`OPERATOR '<' func`), the comparator transforms
   encoded values before comparing.

2. **Block ORDER BY** on custom type columns without any '<' operator, with a
   clear error: "cannot ORDER BY column 'X' of type 'Y': type does not declare
   OPERATOR '<'".

3. **Block CREATE INDEX** on non-orderable custom type columns (expression
   indexes like `CREATE INDEX idx ON t(length(val))` remain allowed).

4. **Add OPERATOR '<' to built-in types** where base type comparison on encoded
   values produces correct ordering: date, time, timestamp, varchar, smallint,
   boolean, uuid, bytea. Types without meaningful ordering (json, jsonb, inet)
   are left without '<'.

5. **Sort keys are always encoded values**: Sorting operates on the encoded
   (on-disk) representation, never decoded values. DECODE is purely a
   presentation layer. For deduplicated columns (where the sort key IS the
   result column), DECODE is applied after sorting for display.

6. **Replace test_reverse_encode/decode with string_reverse**: The two identical
   test functions that just reversed strings were consolidated into a single
   `string_reverse` scalar function, which is genuinely useful as both a
   function and a sort comparator.

Parser changes (parser/src/ast.rs, parser.rs, ast/fmt.rs):
- TypeOperator.func_name changed from String to Option<String>
- Parser accepts three syntaxes: naked, named, and old (type)->func
- SQL serialization conditionally emits function name

Core changes (schema.rs, order_by.rs, index.rs, expr.rs, execute.rs):
- Built-in type definitions updated with OPERATOR '<'
- ORDER BY validation added in init_order_by
- CREATE INDEX validation added in translate_create_index
- Naked operators fall through to standard comparison in expr dispatch
- string_reverse added as both scalar function and sort comparator
- All custom type sort keys suppress decode (not just types without '<')
- Post-sort decode restored for deduplicated columns

Tests: 27 new ordering tests (custom_type_ordering.sqltest) covering error
cases, naked '<' with identity/monotonic/non-monotonic encodings, custom
comparators, built-in types, and index verification (each sort test duplicated
with/without index to ensure identical results).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Custom types (CREATE TYPE, DROP TYPE, built-in types, sqlite_turso_types
virtual table) were unconditionally available even without the
--experimental-strict flag. Since custom types only work with STRICT
tables, they should be gated behind the same flag.

Changes:
- Schema::with_options(enable_strict) conditionally bootstraps built-in
  custom types and registers the sqlite_turso_types virtual table
- CREATE TYPE and DROP TYPE return a clear error when strict is disabled
- PRAGMA list_types shows only base types (INTEGER, REAL, TEXT, BLOB,
  ANY) when strict is disabled; shows all types when enabled
- Custom type loading from __turso_internal_types during database open
  and schema refresh is skipped when strict is disabled
- Updated documentation to note the --experimental-strict requirement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decouple custom types from --experimental-strict: a new
--experimental-custom-types flag independently gates CREATE TYPE /
DROP TYPE. Fix post-rebase compilation errors from upstream API
changes. Fix affinity bug where affinity_with_strict() used
name-based affinity instead of respecting base affinity override
for custom type columns. Add @requires-file custom_types annotations
to all custom type test files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two changes in the custom-types work affect snapshot bytecode:

1. RETURNING clause (all 7 returning snapshots): emit_returning_results()
   now calls emit_user_facing_column_value() which allocates fresh
   registers and emits Copy instructions so custom-type columns can be
   decoded before being returned. For regular columns this is a no-op
   copy but the instructions are still emitted.

2. Integrity check table order (2 multi-table snapshots): adding
   sqlite_turso_types to the Schema tables HashMap changes the
   HashMap iteration order used by integrity_check when enumerating
   tables, resulting in t1 being visited before t2 instead of after.
   The check is correct regardless of order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@penberg penberg merged commit dc4916f into tursodatabase:main Feb 24, 2026
88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants