NIFI-15754 Add Google Cloud Storage Provider for Iceberg#11052
NIFI-15754 Add Google Cloud Storage Provider for Iceberg#11052exceptionfactory wants to merge 1 commit intoapache:mainfrom
Conversation
- Added Iceberg GCS module and NAR - Added Iceberg FileIO implementation using Java HttpClient for GCS REST operations
|
|
||
| try { | ||
| final HttpResponse<Void> response = httpClientProvider.send(request, HttpResponse.BodyHandlers.discarding()); | ||
| return response.statusCode() == HTTP_OK; |
There was a problem hiding this comment.
Isn't it risky here to only check for HTTP_OK? what if you get something like error 500 or error 429 and assume that the file does not exist and you then overwrite the file (I mean OutputFile.create() path, which calls toInputFile().exists() to guard against overwriting an existing file).
There was a problem hiding this comment.
That's a good point, I'll make an adjustment to throw an exception for status codes other than 200 and 404.
| */ | ||
| @Override | ||
| public void close() { | ||
| httpClientProvider.close(); |
There was a problem hiding this comment.
Should we have a null check here? (if close() is called before initialize() is re-invoked)
There was a problem hiding this comment.
Although initialize() is required before use, yes, adding a null check is a reasonable safeguard.
| try { | ||
| return httpClient.send(request, bodyHandler); | ||
| } catch (final InterruptedException e) { |
There was a problem hiding this comment.
Should we implement retry with backoff for transient failures (429, 500, 503, ...)?
There was a problem hiding this comment.
Yes, it looks like that would more closely align with the Google Cloud Storage library behavior, I will implement some basic retry and backoff strategy.
| static final String DEFAULT_SERVICE_HOST = "https://storage.googleapis.com"; | ||
| static final int DEFAULT_WRITE_CHUNK_SIZE = 8 * 1024 * 1024; | ||
| static final int DEFAULT_READ_CHUNK_SIZE = 2 * 1024 * 1024; | ||
| static final int MINIMUM_CHUNK_SIZE = 256 * 1024; |
There was a problem hiding this comment.
We never seem to be enforcing this compared to what we get with WRITE_CHUNK_SIZE_BYTES. I understand that if the catalog is well configured that should not be an issue but thought I'd flag.
There was a problem hiding this comment.
Good catch, that was leftover from some earlier iteration. I will remove this value to avoid confusion.
Summary
NIFI-15754 Adds Google Cloud Storage support to Iceberg modules with a
GCSIcebergFileIOProviderController Service implementation.The new Controller Service is packaged in a separate
nifi-iceberg-gcsmodule and bundled innifi-iceberg-gcs-nar, following the pattern of the AWS S3 and Azure Data Lake Storage implementations.Although the iceberg-gcp module from the Apache Iceberg project includes a GCSFileIO implementation, the class depends on the google-cloud-storage library which has dozens of direct and transitive dependencies.
Instead of using the
GCSFileIOclass, the Controller Service package includes a direct implementation of the IcebergFileIOinterface namedGoogleCloudStorageFileIO. The implementation uses the Java HttpClient for REST operations with the GCS API. Supporting Bearer Token authentication with vended credentials from an Iceberg REST Catalog, theFileIOimplementation avoids multiple layers of dependencies. The new implementation reads the same properties defined in the Iceberg GCPProperties to support compatibility with Iceberg Catalogs.The
InputStreamandOutputStreamimplementations handle direct interaction with Google Cloud Storage, including support for resumable uploads.Tests for the
FileIOimplementation include interaction with the OkHttp MockWebServer to verify expected HTTP requests and responses.Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation