Skip to content

Fix: FDW OPTIONS encoding accepts symbolic names (issue #1726)#1727

Open
talmacschen-arch wants to merge 3 commits into
apache:mainfrom
talmacschen-arch:fix/fdw-options-encoding-1726
Open

Fix: FDW OPTIONS encoding accepts symbolic names (issue #1726)#1727
talmacschen-arch wants to merge 3 commits into
apache:mainfrom
talmacschen-arch:fix/fdw-options-encoding-1726

Conversation

@talmacschen-arch
Copy link
Copy Markdown

Fixes #1726

What does this PR do?

Make the gp_exttable_fdw FDW accept both symbolic encoding names (UTF8, utf-8, GBK, …) and numeric encoding IDs in the encoding OPTIONS entry, mirroring what the legacy CREATE EXTERNAL TABLE … ENCODING … path has always accepted, and what the rest of PostgreSQL accepts everywhere else (client_encoding, pg_dump, pg_conversion, CREATE DATABASE, …).

Before this PR, both the FDW catalog reader (src/backend/access/external/external.c:280) and the option validator (gpcontrib/gp_exttable_fdw/option.c:139,142) parsed the value with atoi(). atoi("UTF8") silently returns 0 (SQL_ASCII) and PG_VALID_ENCODING(0) is true, so:

OPTIONS spelling Before After
encoding '6' UTF8 ✅ UTF8 ✅ (unchanged)
encoding 'UTF8' SQL_ASCII (silent ❌) UTF8 ✅
encoding 'utf-8' SQL_ASCII (silent ❌) UTF8 ✅
encoding 'GBK' SQL_ASCII (silent ❌) GBK ✅
encoding 'bogus' SQL_ASCII (silent ❌) ERROR — unrecognized ✅

Affected scope is gp_exttable_fdw (used by gp_exttable_server). The standalone pxf_fdw is unaffected — its validator routes encoding through ProcessCopyOptions, which is already name-aware.

Implementation

A small shared helper parse_fdw_encoding_option(const char *) is added in src/backend/access/external/external.c (declared in src/include/access/external.h):

  1. Try pg_char_to_encoding(value) — the same translation the legacy CREATE EXTERNAL TABLE path uses.
  2. Otherwise try a strict numeric form (strtol + end-of-string + PG_VALID_ENCODING checks). atoi is intentionally avoided: atoi("UTF8") silently mistranslates and that is the bug being fixed.
  3. Otherwise ereport(ERROR).

Both the FDW validator (gp_exttable_permission_check) and the read path (GetExtFromForeignTableOptions) call this helper. No on-disk format change: values are stored verbatim in pg_foreign_table.ftoptions exactly as the user wrote them, and resolved at read time. This keeps pg_dump output round-trip-safe and avoids silently rewriting catalog rows during upgrade.

The bypass on the runtime PXF error message ("Define the external table with ENCODING UTF8...") lives in the separate apache/cloudberry-pxf-server repository and will be addressed in a follow-up PR there.

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Behavior change on upgrade — see "User-facing changes" below
  • Documentation update

Test Plan

Added cases to gpcontrib/gp_exttable_fdw/input/gp_exttable_fdw.source:

  • encoding '6' → resolves to UTF8 (control)
  • encoding 'UTF8' / encoding 'utf-8' → resolves to UTF8 (the fix)
  • encoding 'GBK' → resolves to GBK
  • encoding 'bogus' → ereports
  • ALTER FOREIGN TABLE … OPTIONS (SET encoding 'UTF8') on a table that
    was originally created with encoding '0' → reads as UTF8 after the ALTER

The pre-existing encoding '-1' ereport test was kept; only the error message text was updated to the new "\"-1\" is not a valid encoding name or code" form.

  • Unit tests added/updated (gpcontrib/gp_exttable_fdw/)
  • Integration tests added/updated
  • Passed make installcheck (to be run by reviewer / CI)
  • Passed make -C src/test installcheck-cbdb-parallel (to be run)

Impact

User-facing changes (release-note worthy ⚠️):

Foreign tables that were created before this fix with a non-numeric encoding option (e.g. encoding 'UTF8') had the literal name silently stored in pg_foreign_table.ftoptions. With the old atoi() reader they took effect as SQL_ASCII. After this fix, the same on-disk row is interpreted as the named encoding (e.g. UTF8) — i.e. the encoding actually requested by the user.

For most affected users this is the desired correction (data was already being silently mishandled). Operators who want to identify which existing tables are about to flip semantics can run:

SELECT n.nspname, c.relname,
       (SELECT split_part(opt, '=', 2)
        FROM   unnest(ftoptions) AS opt
        WHERE  opt LIKE 'encoding=%') AS stored_encoding
FROM   pg_foreign_table ft
JOIN   pg_class c ON c.oid = ft.ftrelid
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  ft.ftserver IN (SELECT oid FROM pg_foreign_server
                       WHERE  srvname IN ('gp_exttable_server'))
       AND EXISTS (
         SELECT 1 FROM unnest(ftoptions) AS opt
         WHERE  opt LIKE 'encoding=%'
                AND substring(opt FROM 'encoding=(.*)') !~ '^[0-9]+$'
       );

If a table appears that should not flip, the operator can pin it by re-issuing
ALTER FOREIGN TABLE … OPTIONS (SET encoding '<numeric>') before upgrade.

Performance: none. Two extra string compares + one syscache-free name lookup on each foreign-table read; same for validation.

Dependencies: none. pg_char_to_encoding() is already declared in src/include/mb/pg_wchar.h:567 and used by the legacy path.

Checklist

  • Followed contribution guide
  • Added/updated documentation (release-note line; doc PR not required)
  • Reviewed code for security implications (no new external input surfaces; helper ereports on invalid input)
  • Requested review from cloudberry committers

Additional Context

The design discussion in the issue (#1726) considered a write-time normalization variant via ProcessUtility_hook, where the option value would be rewritten to its numeric form before persisting into ftoptions. That approach was abandoned because the hook is only installed by the extension's _PG_init, which does not run until the first call into the extension's shared library — and on the very first CREATE FOREIGN TABLE of a fresh backend, the hook check has already been passed by the time the validator triggers dlopen. Adopting it would require listing gp_exttable_fdw in session_preload_libraries, which changes deployment expectations. The chosen read-side resolution works without any preload requirement.

Both the FDW catalog reader (src/backend/access/external/external.c)
and the gp_exttable_fdw option validator
(gpcontrib/gp_exttable_fdw/option.c) parsed the "encoding" OPTIONS value
with atoi(). atoi("UTF8") returns 0 (PG_SQL_ASCII) and PG_VALID_ENCODING(0)
is true, so symbolic names like 'UTF8', 'utf-8', 'GBK' silently fell through
validation and were stored as SQL_ASCII at read time. By contrast, the
legacy CREATE EXTERNAL TABLE ... ENCODING ... path resolves names via
pg_char_to_encoding() and persists a numeric form into OPTIONS — only the
FDW OPTIONS entry point bypassed that translation.

Add a small shared helper parse_fdw_encoding_option(const char *) in
src/backend/access/external/external.c (declared in
src/include/access/external.h):

  - first try pg_char_to_encoding(name) — same logic as the legacy path;
  - otherwise try a strict numeric form via strtol() with end-of-string
    and PG_VALID_ENCODING() checks (atoi is intentionally avoided, since
    atoi("UTF8")==0 is the bug being fixed);
  - otherwise ereport(ERROR).

Both the validator and GetExtFromForeignTableOptions() call this helper.
On-disk values in pg_foreign_table.ftoptions are stored verbatim as the
user wrote them; correctness is established at read time. This avoids a
ProcessUtility_hook approach, which is unworkable here because the
extension's _PG_init runs lazily on the first dlopen, after the current
statement's hook check has already passed.

Affected scope: gp_exttable_fdw (used by gp_exttable_server). The
standalone pxf_fdw is unaffected — its validator already routes encoding
through ProcessCopyOptions, which is name-aware.

Behavior change on upgrade: existing rows whose ftoptions literally contain
encoding=<name> have, until now, been silently interpreted as SQL_ASCII.
After this fix they are interpreted as the named encoding. This will be
called out in the release notes; a detection query is provided in the PR
description for operators who wish to pin specific tables to numeric form
before upgrade.

Tests added in gpcontrib/gp_exttable_fdw/{input,output}/gp_exttable_fdw.source
cover encoding '6' / 'UTF8' / 'utf-8' / 'GBK' / 'bogus' and an
ALTER FOREIGN TABLE ... OPTIONS (SET encoding 'UTF8') path. The pre-existing
encoding '-1' error case has its expected error message updated to match
the new helper's wording.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @talmacschen-arch welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

The new tests added in the previous commit had column header lines
without the trailing-space padding that psql's aligned output emits
to match the separator. The pre-existing ext_special_uri header
(' a | b') was also unintentionally stripped of its trailing space
during the same edit.

Pure whitespace fix. No behavior change.
pg_regress diffs the expected and actual .out files strictly, including
the final newline count. The new encoding test block ended with a
stray empty line (";\n\n") while psql produces ";\n", causing a 1-line
diff at end-of-file. Pure whitespace fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] FDW OPTIONS 'encoding' silently parses non-numeric names (e.g. 'UTF8') as SQL_ASCII via atoi()

2 participants