Skip to content

[Bug] FDW OPTIONS 'encoding' silently parses non-numeric names (e.g. 'UTF8') as SQL_ASCII via atoi() #1726

@talmacschen-arch

Description

@talmacschen-arch

Apache Cloudberry version

main (also reproduces on 1.x). Verified against current main at commit ef0b6fb3462.

What happened

When defining a foreign table via the FDW path with a symbolic encoding name in OPTIONS:

CREATE FOREIGN TABLE ext_t (c text)
SERVER gp_exttable_server
OPTIONS (
    formatter     'pxfwritable_import',
    format        'custom',
    format_type   'b',
    location_uris 'pxf://t?PROFILE=jdbc&SERVER=s',
    encoding      'UTF8'
);

the DDL succeeds, but the encoding actually stored on the foreign table is SQL_ASCII (0), not UTF8 (PG_UTF8 = 6).

Root cause — both the FDW catalog reader and the option validator parse encoding with atoi():

  • src/backend/access/external/external.c:280
    if (pg_strcasecmp(def->defname, "encoding") == 0)
    {
        extentry->encoding = atoi(defGetString(def));   /* 'UTF8' → 0 */
        ...
    }
  • gpcontrib/gp_exttable_fdw/option.c:136-143
    else if (pg_strcasecmp(def->defname, "encoding") == 0)
    {
        char *encoding = (char *) defGetString(def);
        if (!PG_VALID_ENCODING(atoi(encoding)))         /* atoi('UTF8') == 0 == SQL_ASCII, passes */
            ereport(ERROR, ...);
    }

atoi("UTF8") returns 0, and PG_VALID_ENCODING(0) is true (PG_SQL_ASCII), so validation silently passes. Anyone who writes encoding 'UTF8', encoding 'utf-8', encoding 'GBK', etc. ends up with SQL_ASCII.

Confirm with:

SELECT ftoptions FROM pg_foreign_table WHERE ftrelid = 'ext_t'::regclass;
-- {... ,encoding=UTF8, ...}     -- stored verbatim; atoi() at read time → 0

Downstream consequences:

  1. On the PXF binary path (format_type='b'), the PXF server emits UTF-8 bytes via String.getBytes(StandardCharsets.UTF_8). The segment's gpdbwritable formatter checks the foreign-table encoding and rejects anything other than UTF-8, surfacing as:

    gpdbwritable formatter can only import UTF8 formatted data. Define the external table with ENCODING UTF8...

    The error message is misleading because the user's DDL does say ENCODING UTF8 — but it has been silently turned into SQL_ASCII by atoi().

  2. On other formats and protocols, the misconfiguration is silent: multi-byte text gets stored as opaque bytes, and length() / substring() / LIKE / downstream re-encoding misbehave only when CJK or accented characters appear.

By contrast, the legacy CREATE EXTERNAL TABLE ... ENCODING 'UTF8' path is fine because src/backend/commands/exttablecmds.c:209-216 resolves the name with pg_char_to_encoding() before persisting it as a numeric string into OPTIONS. Only the FDW OPTIONS entry point bypasses that translation.

What you think should happen instead

encoding in FDW OPTIONS should accept the same forms accepted everywhere else in Postgres / Cloudberry — symbolic names (UTF8, GBK, LATIN1, …) and numeric IDs — and reject everything else with a clear error. Symbolic names should be resolved via pg_char_to_encoding(), just like the legacy CREATE EXTERNAL TABLE path.

Specifically:

OPTIONS spelling Today Expected
encoding '6' UTF8 ✅ UTF8 ✅ (unchanged)
encoding 'UTF8' SQL_ASCII ❌ (silent) UTF8 ✅
encoding 'utf-8' SQL_ASCII ❌ (silent) UTF8 ✅
encoding 'GBK' SQL_ASCII ❌ (silent) GBK ✅
encoding 'bogus' SQL_ASCII ❌ (silent) ERROR — unrecognized encoding ✅

Implementation note (FDW validator interface limitation)

The legacy CREATE EXTERNAL TABLE path achieves write-time normalization via a custom dispatch (DefineExternalRelation, src/backend/commands/exttablecmds.c:76) that pre-resolves the encoding and writes a numeric string into stmt->options before calling CreateForeignTable. The standard FDW validator contract (src/backend/commands/foreigncmds.c:121-216 transformGenericOptions — the array to be persisted is built at line 190 before the validator is called at line 202-212, and the validator returns void) does not permit option mutation. So the FDW path needs a different injection point than just the validator. Two implementations are viable:

  • Write-time normalization, either via a ProcessUtility_hook registered by gp_exttable_fdw or via a small site-specific branch in CreateForeignTable keyed on the gp_exttable_fdw validator. The read path (external.c:280) keeps its atoi. Existing rows with encoding 'UTF8' retain SQL_ASCII semantics on upgrade and only get fixed when the user re-issues ALTER … OPTIONS (SET encoding 'UTF8') — i.e. no silent meaning change on upgrade, which is the safer behavior. pg_dump output is also stabilized to numeric form.
  • Read-side name-awareness in GetExtFromForeignTableOptions (external.c:280) and the validator (option.c:139-142). Simpler (~30 LOC), but existing rows that today silently render as SQL_ASCII would start being interpreted as actual UTF8 on upgrade — this must be called out in release notes.

Update (2026-05-09): direction settled — going with the second shape.

After implementing the first shape (write-time normalization via ProcessUtility_hook registered in gp_exttable_fdw's _PG_init), the author hit a load-order issue that makes it unworkable without changing deployment expectations:

  • The hook is installed by _PG_init, which only runs when the extension's shared library is dlopen-ed.
  • That happens lazily, when one of the extension's PG_FUNCTION_INFO_V1 functions is first invoked. None of them is called during connection setup — the first to fire is the validator (gp_exttable_permission_check), called from inside transformGenericOptions (src/backend/commands/foreigncmds.c:202-212) on the user's first CREATE FOREIGN TABLE.
  • By the time the validator triggers dlopen and _PG_init registers our ProcessUtility_hook, the hook check for the current statement has already been passed (it happens at the top of ProcessUtility). So the very first CREATE FOREIGN TABLE … encoding 'UTF8' of every fresh backend would still bypass the hook — the exact reproduction documented at the top of this issue.
  • Subsequent statements in the same session would work, but that defeats the goal: the bug bites at table-creation time, not later.

The standard cure in Cloudberry / PG (which pg_hint_plan follows — every one of its tests starts with LOAD 'pg_hint_plan'; and its README requires session_preload_libraries) is to require the extension to be preloaded. That changes the deployment story for gp_exttable_fdw, which today does not need preloading, and was therefore not chosen here.

The PR will take the second shape: read-side name-awareness in both GetExtFromForeignTableOptions (external.c:280) and the validator (option.c:139-142), via a small shared helper parse_fdw_encoding_option(const char *) that resolves names through pg_char_to_encoding() and falls back to a strict numeric form (no atoi, since atoi("UTF8")==0 is the bug). On-disk values stay as the user wrote them; correctness is established at read time.

Acknowledged trade-off (called out as a release-note item in the PR): existing rows whose ftoptions already contains a literal encoding=UTF8 and which therefore silently render as SQL_ASCII today will start being interpreted as the requested encoding on upgrade. Since such rows are already misconfigured, "upgrade fixes them" is the desired behavior for nearly all affected users — but it is a behavior change. A detection query will ship in the PR body so operators who, for any reason, want to pin a specific table to its old (broken) SQL_ASCII semantics can re-issue ALTER FOREIGN TABLE … OPTIONS (SET encoding '<numeric>') before upgrade.

How to reproduce

CREATE EXTENSION IF NOT EXISTS gp_exttable_fdw;

CREATE FOREIGN TABLE ext_utf8 (c text)
SERVER gp_exttable_server
OPTIONS (
    format        'text',
    format_type   't',
    location_uris 'file:///tmp/x.txt',
    encoding      'UTF8'
);

-- Bug: encoding is stored verbatim and atoi('UTF8')==0 → SQL_ASCII
SELECT ftoptions FROM pg_foreign_table WHERE ftrelid = 'ext_utf8'::regclass;
SELECT e.encoding, pg_encoding_to_char(e.encoding)
FROM   pg_exttable e
JOIN   pg_class c ON c.oid = e.reloid
WHERE  c.relname = 'ext_utf8';
--  encoding | pg_encoding_to_char
-- ----------+---------------------
--         0 | SQL_ASCII

-- Causal chain made explicit:
--   user wrote     :  encoding 'UTF8'
--   stored in      :  pg_foreign_table.ftoptions  →  encoding=UTF8   (literal, see ftoptions output above)
--   atoi("UTF8")   :  0                                              (stops at first non-digit)
--   effective enc  :  SQL_ASCII                                      (PG_SQL_ASCII = 0)

The same misbehavior applies whenever pxf:// URIs are used together with SERVER gp_exttable_server (which dispatches through gp_exttable_fdw); in that combination the segment-side gpdbwritable formatter raises the misleading Define the external table with ENCODING UTF8 error at SELECT time.

Operating System

Linux 5.14.0-503.38.1.el9_5.x86_64 (Rocky / RHEL 9.5). Issue is OS-independent.

Anything else

  • Affected scope: the gp_exttable_fdw FDW (used by the built-in gp_exttable_server).
    • Bug sites: src/backend/access/external/external.c:278-283 (read path) and gpcontrib/gp_exttable_fdw/option.c:136-143 (validator).
    • The standalone pxf_fdw FDW is not affected: its validator routes encoding through ProcessCopyOptions (see gpcontrib/pxf_fdw/pxf_option.c:367), which natively accepts symbolic names. The PXF binary symptom described above shows up specifically when pxf:// URIs are used with SERVER gp_exttable_server, which dispatches through gp_exttable_fdw, not through the standalone pxf_fdw FDW.
  • The runtime "Define the external table with ENCODING UTF8" message lives in the separate apache/cloudberry-pxf-server repository (gpdbwritable formatter, Java); a follow-up issue/PR will land there to point users at the OPTIONS clause once the root cause here is fixed.
  • pg_char_to_encoding() is already declared in src/include/mb/pg_wchar.h:567 and used by the legacy CREATE EXTERNAL TABLE path, so no new dependency is needed.

Are you willing to submit PR?

Yes — a follow-up PR is in preparation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions