Apache Cloudberry version
main (also reproduces on 1.x). Verified against current main at commit ef0b6fb3462.
What happened
When defining a foreign table via the FDW path with a symbolic encoding name in OPTIONS:
CREATE FOREIGN TABLE ext_t (c text)
SERVER gp_exttable_server
OPTIONS (
formatter 'pxfwritable_import',
format 'custom',
format_type 'b',
location_uris 'pxf://t?PROFILE=jdbc&SERVER=s',
encoding 'UTF8'
);
the DDL succeeds, but the encoding actually stored on the foreign table is SQL_ASCII (0), not UTF8 (PG_UTF8 = 6).
Root cause — both the FDW catalog reader and the option validator parse encoding with atoi():
src/backend/access/external/external.c:280
if (pg_strcasecmp(def->defname, "encoding") == 0)
{
extentry->encoding = atoi(defGetString(def)); /* 'UTF8' → 0 */
...
}
gpcontrib/gp_exttable_fdw/option.c:136-143
else if (pg_strcasecmp(def->defname, "encoding") == 0)
{
char *encoding = (char *) defGetString(def);
if (!PG_VALID_ENCODING(atoi(encoding))) /* atoi('UTF8') == 0 == SQL_ASCII, passes */
ereport(ERROR, ...);
}
atoi("UTF8") returns 0, and PG_VALID_ENCODING(0) is true (PG_SQL_ASCII), so validation silently passes. Anyone who writes encoding 'UTF8', encoding 'utf-8', encoding 'GBK', etc. ends up with SQL_ASCII.
Confirm with:
SELECT ftoptions FROM pg_foreign_table WHERE ftrelid = 'ext_t'::regclass;
-- {... ,encoding=UTF8, ...} -- stored verbatim; atoi() at read time → 0
Downstream consequences:
-
On the PXF binary path (format_type='b'), the PXF server emits UTF-8 bytes via String.getBytes(StandardCharsets.UTF_8). The segment's gpdbwritable formatter checks the foreign-table encoding and rejects anything other than UTF-8, surfacing as:
gpdbwritable formatter can only import UTF8 formatted data. Define the external table with ENCODING UTF8...
The error message is misleading because the user's DDL does say ENCODING UTF8 — but it has been silently turned into SQL_ASCII by atoi().
-
On other formats and protocols, the misconfiguration is silent: multi-byte text gets stored as opaque bytes, and length() / substring() / LIKE / downstream re-encoding misbehave only when CJK or accented characters appear.
By contrast, the legacy CREATE EXTERNAL TABLE ... ENCODING 'UTF8' path is fine because src/backend/commands/exttablecmds.c:209-216 resolves the name with pg_char_to_encoding() before persisting it as a numeric string into OPTIONS. Only the FDW OPTIONS entry point bypasses that translation.
What you think should happen instead
encoding in FDW OPTIONS should accept the same forms accepted everywhere else in Postgres / Cloudberry — symbolic names (UTF8, GBK, LATIN1, …) and numeric IDs — and reject everything else with a clear error. Symbolic names should be resolved via pg_char_to_encoding(), just like the legacy CREATE EXTERNAL TABLE path.
Specifically:
| OPTIONS spelling |
Today |
Expected |
encoding '6' |
UTF8 ✅ |
UTF8 ✅ (unchanged) |
encoding 'UTF8' |
SQL_ASCII ❌ (silent) |
UTF8 ✅ |
encoding 'utf-8' |
SQL_ASCII ❌ (silent) |
UTF8 ✅ |
encoding 'GBK' |
SQL_ASCII ❌ (silent) |
GBK ✅ |
encoding 'bogus' |
SQL_ASCII ❌ (silent) |
ERROR — unrecognized encoding ✅ |
Implementation note (FDW validator interface limitation)
The legacy CREATE EXTERNAL TABLE path achieves write-time normalization via a custom dispatch (DefineExternalRelation, src/backend/commands/exttablecmds.c:76) that pre-resolves the encoding and writes a numeric string into stmt->options before calling CreateForeignTable. The standard FDW validator contract (src/backend/commands/foreigncmds.c:121-216 transformGenericOptions — the array to be persisted is built at line 190 before the validator is called at line 202-212, and the validator returns void) does not permit option mutation. So the FDW path needs a different injection point than just the validator. Two implementations are viable:
- Write-time normalization, either via a
ProcessUtility_hook registered by gp_exttable_fdw or via a small site-specific branch in CreateForeignTable keyed on the gp_exttable_fdw validator. The read path (external.c:280) keeps its atoi. Existing rows with encoding 'UTF8' retain SQL_ASCII semantics on upgrade and only get fixed when the user re-issues ALTER … OPTIONS (SET encoding 'UTF8') — i.e. no silent meaning change on upgrade, which is the safer behavior. pg_dump output is also stabilized to numeric form.
- Read-side name-awareness in
GetExtFromForeignTableOptions (external.c:280) and the validator (option.c:139-142). Simpler (~30 LOC), but existing rows that today silently render as SQL_ASCII would start being interpreted as actual UTF8 on upgrade — this must be called out in release notes.
Update (2026-05-09): direction settled — going with the second shape.
After implementing the first shape (write-time normalization via ProcessUtility_hook registered in gp_exttable_fdw's _PG_init), the author hit a load-order issue that makes it unworkable without changing deployment expectations:
- The hook is installed by
_PG_init, which only runs when the extension's shared library is dlopen-ed.
- That happens lazily, when one of the extension's
PG_FUNCTION_INFO_V1 functions is first invoked. None of them is called during connection setup — the first to fire is the validator (gp_exttable_permission_check), called from inside transformGenericOptions (src/backend/commands/foreigncmds.c:202-212) on the user's first CREATE FOREIGN TABLE.
- By the time the validator triggers
dlopen and _PG_init registers our ProcessUtility_hook, the hook check for the current statement has already been passed (it happens at the top of ProcessUtility). So the very first CREATE FOREIGN TABLE … encoding 'UTF8' of every fresh backend would still bypass the hook — the exact reproduction documented at the top of this issue.
- Subsequent statements in the same session would work, but that defeats the goal: the bug bites at table-creation time, not later.
The standard cure in Cloudberry / PG (which pg_hint_plan follows — every one of its tests starts with LOAD 'pg_hint_plan'; and its README requires session_preload_libraries) is to require the extension to be preloaded. That changes the deployment story for gp_exttable_fdw, which today does not need preloading, and was therefore not chosen here.
The PR will take the second shape: read-side name-awareness in both GetExtFromForeignTableOptions (external.c:280) and the validator (option.c:139-142), via a small shared helper parse_fdw_encoding_option(const char *) that resolves names through pg_char_to_encoding() and falls back to a strict numeric form (no atoi, since atoi("UTF8")==0 is the bug). On-disk values stay as the user wrote them; correctness is established at read time.
Acknowledged trade-off (called out as a release-note item in the PR): existing rows whose ftoptions already contains a literal encoding=UTF8 and which therefore silently render as SQL_ASCII today will start being interpreted as the requested encoding on upgrade. Since such rows are already misconfigured, "upgrade fixes them" is the desired behavior for nearly all affected users — but it is a behavior change. A detection query will ship in the PR body so operators who, for any reason, want to pin a specific table to its old (broken) SQL_ASCII semantics can re-issue ALTER FOREIGN TABLE … OPTIONS (SET encoding '<numeric>') before upgrade.
How to reproduce
CREATE EXTENSION IF NOT EXISTS gp_exttable_fdw;
CREATE FOREIGN TABLE ext_utf8 (c text)
SERVER gp_exttable_server
OPTIONS (
format 'text',
format_type 't',
location_uris 'file:///tmp/x.txt',
encoding 'UTF8'
);
-- Bug: encoding is stored verbatim and atoi('UTF8')==0 → SQL_ASCII
SELECT ftoptions FROM pg_foreign_table WHERE ftrelid = 'ext_utf8'::regclass;
SELECT e.encoding, pg_encoding_to_char(e.encoding)
FROM pg_exttable e
JOIN pg_class c ON c.oid = e.reloid
WHERE c.relname = 'ext_utf8';
-- encoding | pg_encoding_to_char
-- ----------+---------------------
-- 0 | SQL_ASCII
-- Causal chain made explicit:
-- user wrote : encoding 'UTF8'
-- stored in : pg_foreign_table.ftoptions → encoding=UTF8 (literal, see ftoptions output above)
-- atoi("UTF8") : 0 (stops at first non-digit)
-- effective enc : SQL_ASCII (PG_SQL_ASCII = 0)
The same misbehavior applies whenever pxf:// URIs are used together with SERVER gp_exttable_server (which dispatches through gp_exttable_fdw); in that combination the segment-side gpdbwritable formatter raises the misleading Define the external table with ENCODING UTF8 error at SELECT time.
Operating System
Linux 5.14.0-503.38.1.el9_5.x86_64 (Rocky / RHEL 9.5). Issue is OS-independent.
Anything else
- Affected scope: the
gp_exttable_fdw FDW (used by the built-in gp_exttable_server).
- Bug sites:
src/backend/access/external/external.c:278-283 (read path) and gpcontrib/gp_exttable_fdw/option.c:136-143 (validator).
- The standalone
pxf_fdw FDW is not affected: its validator routes encoding through ProcessCopyOptions (see gpcontrib/pxf_fdw/pxf_option.c:367), which natively accepts symbolic names. The PXF binary symptom described above shows up specifically when pxf:// URIs are used with SERVER gp_exttable_server, which dispatches through gp_exttable_fdw, not through the standalone pxf_fdw FDW.
- The runtime "Define the external table with ENCODING UTF8" message lives in the separate
apache/cloudberry-pxf-server repository (gpdbwritable formatter, Java); a follow-up issue/PR will land there to point users at the OPTIONS clause once the root cause here is fixed.
pg_char_to_encoding() is already declared in src/include/mb/pg_wchar.h:567 and used by the legacy CREATE EXTERNAL TABLE path, so no new dependency is needed.
Are you willing to submit PR?
Yes — a follow-up PR is in preparation.
Apache Cloudberry version
main (also reproduces on 1.x). Verified against current
mainat commitef0b6fb3462.What happened
When defining a foreign table via the FDW path with a symbolic encoding name in
OPTIONS:the DDL succeeds, but the encoding actually stored on the foreign table is
SQL_ASCII(0), notUTF8(PG_UTF8 = 6).Root cause — both the FDW catalog reader and the option validator parse
encodingwithatoi():src/backend/access/external/external.c:280gpcontrib/gp_exttable_fdw/option.c:136-143atoi("UTF8")returns0, andPG_VALID_ENCODING(0)is true (PG_SQL_ASCII), so validation silently passes. Anyone who writesencoding 'UTF8',encoding 'utf-8',encoding 'GBK', etc. ends up withSQL_ASCII.Confirm with:
Downstream consequences:
On the PXF binary path (
format_type='b'), the PXF server emits UTF-8 bytes viaString.getBytes(StandardCharsets.UTF_8). The segment's gpdbwritable formatter checks the foreign-table encoding and rejects anything other than UTF-8, surfacing as:The error message is misleading because the user's DDL does say
ENCODING UTF8— but it has been silently turned intoSQL_ASCIIbyatoi().On other formats and protocols, the misconfiguration is silent: multi-byte text gets stored as opaque bytes, and
length()/substring()/LIKE/ downstream re-encoding misbehave only when CJK or accented characters appear.By contrast, the legacy
CREATE EXTERNAL TABLE ... ENCODING 'UTF8'path is fine becausesrc/backend/commands/exttablecmds.c:209-216resolves the name withpg_char_to_encoding()before persisting it as a numeric string into OPTIONS. Only the FDW OPTIONS entry point bypasses that translation.What you think should happen instead
encodingin FDWOPTIONSshould accept the same forms accepted everywhere else in Postgres / Cloudberry — symbolic names (UTF8,GBK,LATIN1, …) and numeric IDs — and reject everything else with a clear error. Symbolic names should be resolved viapg_char_to_encoding(), just like the legacyCREATE EXTERNAL TABLEpath.Specifically:
encoding '6'encoding 'UTF8'encoding 'utf-8'encoding 'GBK'encoding 'bogus'Implementation note (FDW validator interface limitation)
The legacy
CREATE EXTERNAL TABLEpath achieves write-time normalization via a custom dispatch (DefineExternalRelation,src/backend/commands/exttablecmds.c:76) that pre-resolves the encoding and writes a numeric string intostmt->optionsbefore callingCreateForeignTable. The standard FDW validator contract (src/backend/commands/foreigncmds.c:121-216transformGenericOptions— the array to be persisted is built at line 190 before the validator is called at line 202-212, and the validator returns void) does not permit option mutation. So the FDW path needs a different injection point than just the validator. Two implementations are viable:ProcessUtility_hookregistered bygp_exttable_fdwor via a small site-specific branch inCreateForeignTablekeyed on thegp_exttable_fdwvalidator. The read path (external.c:280) keeps itsatoi. Existing rows withencoding 'UTF8'retainSQL_ASCIIsemantics on upgrade and only get fixed when the user re-issuesALTER … OPTIONS (SET encoding 'UTF8')— i.e. no silent meaning change on upgrade, which is the safer behavior.pg_dumpoutput is also stabilized to numeric form.GetExtFromForeignTableOptions(external.c:280) and the validator (option.c:139-142). Simpler (~30 LOC), but existing rows that today silently render asSQL_ASCIIwould start being interpreted as actual UTF8 on upgrade — this must be called out in release notes.Update (2026-05-09): direction settled — going with the second shape.
After implementing the first shape (write-time normalization via
ProcessUtility_hookregistered ingp_exttable_fdw's_PG_init), the author hit a load-order issue that makes it unworkable without changing deployment expectations:_PG_init, which only runs when the extension's shared library isdlopen-ed.PG_FUNCTION_INFO_V1functions is first invoked. None of them is called during connection setup — the first to fire is the validator (gp_exttable_permission_check), called from insidetransformGenericOptions(src/backend/commands/foreigncmds.c:202-212) on the user's firstCREATE FOREIGN TABLE.dlopenand_PG_initregisters ourProcessUtility_hook, the hook check for the current statement has already been passed (it happens at the top ofProcessUtility). So the very firstCREATE FOREIGN TABLE … encoding 'UTF8'of every fresh backend would still bypass the hook — the exact reproduction documented at the top of this issue.The standard cure in Cloudberry / PG (which
pg_hint_planfollows — every one of its tests starts withLOAD 'pg_hint_plan';and its README requiressession_preload_libraries) is to require the extension to be preloaded. That changes the deployment story forgp_exttable_fdw, which today does not need preloading, and was therefore not chosen here.The PR will take the second shape: read-side name-awareness in both
GetExtFromForeignTableOptions(external.c:280) and the validator (option.c:139-142), via a small shared helperparse_fdw_encoding_option(const char *)that resolves names throughpg_char_to_encoding()and falls back to a strict numeric form (noatoi, sinceatoi("UTF8")==0is the bug). On-disk values stay as the user wrote them; correctness is established at read time.Acknowledged trade-off (called out as a release-note item in the PR): existing rows whose
ftoptionsalready contains a literalencoding=UTF8and which therefore silently render asSQL_ASCIItoday will start being interpreted as the requested encoding on upgrade. Since such rows are already misconfigured, "upgrade fixes them" is the desired behavior for nearly all affected users — but it is a behavior change. A detection query will ship in the PR body so operators who, for any reason, want to pin a specific table to its old (broken)SQL_ASCIIsemantics can re-issueALTER FOREIGN TABLE … OPTIONS (SET encoding '<numeric>')before upgrade.How to reproduce
The same misbehavior applies whenever
pxf://URIs are used together withSERVER gp_exttable_server(which dispatches throughgp_exttable_fdw); in that combination the segment-side gpdbwritable formatter raises the misleadingDefine the external table with ENCODING UTF8error at SELECT time.Operating System
Linux 5.14.0-503.38.1.el9_5.x86_64 (Rocky / RHEL 9.5). Issue is OS-independent.
Anything else
gp_exttable_fdwFDW (used by the built-ingp_exttable_server).src/backend/access/external/external.c:278-283(read path) andgpcontrib/gp_exttable_fdw/option.c:136-143(validator).pxf_fdwFDW is not affected: its validator routesencodingthroughProcessCopyOptions(seegpcontrib/pxf_fdw/pxf_option.c:367), which natively accepts symbolic names. The PXF binary symptom described above shows up specifically whenpxf://URIs are used withSERVER gp_exttable_server, which dispatches throughgp_exttable_fdw, not through the standalonepxf_fdwFDW.apache/cloudberry-pxf-serverrepository (gpdbwritable formatter, Java); a follow-up issue/PR will land there to point users at the OPTIONS clause once the root cause here is fixed.pg_char_to_encoding()is already declared insrc/include/mb/pg_wchar.h:567and used by the legacyCREATE EXTERNAL TABLEpath, so no new dependency is needed.Are you willing to submit PR?
Yes — a follow-up PR is in preparation.