feat(dataproc): Add dataproc source and list/get clusters/jobs tools by dborowitz · Pull Request #2407 · googleapis/genai-toolbox

dborowitz · 2026-02-03T05:47:19Z

Description

Add a new source for Dataproc, which is closely related to Serverless Spark. Similar to get/list batches, we have get/list clusters and jobs, with minor API differences.

PR Checklist

Thank you for opening a Pull Request! Before submitting your PR, there are a
few things you can do to make sure it goes smoothly:

Make sure you reviewed
CONTRIBUTING.md
Make sure to open an issue as a
bug/issue
before writing your code! That way we can discuss the change, evaluate
designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)
Make sure to add ! if this involve a breaking change

🛠️ Part of #2405

gemini-code-assist · 2026-02-03T05:47:43Z

Summary of Changes

Hello @dborowitz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by integrating with Google Cloud Dataproc. It provides a new data source and a suite of tools that enable users to programmatically interact with Dataproc clusters and jobs, offering functionalities such as retrieving specific resource details and listing resources with filtering options. This enhancement mirrors existing functionalities for Serverless Spark, providing a consistent experience for managing big data processing environments.

Highlights

New Dataproc Source: Introduced a new source type for Google Cloud Dataproc, allowing the system to interact with Dataproc clusters and jobs.
Dataproc Cluster Management Tools: Added tools to retrieve details of a specific Dataproc cluster (get_cluster) and list/filter multiple clusters (list_clusters).
Dataproc Job Management Tools: Implemented tools to fetch information about a particular Dataproc job (get_job) and list/filter various jobs (list_jobs).
Comprehensive Documentation and Testing: Included extensive documentation for the new Dataproc source and its associated tools, along with new integration tests to validate functionality.

Changelog

.ci/integration.cloudbuild.yaml
- Added a new integration test step for Dataproc to the CI pipeline.
cmd/root.go
- Imported new Dataproc tool and source packages to register them with the command-line interface.
cmd/root_test.go
- Updated prebuilt tools tests to include the new Dataproc configuration and toolset.
docs/en/reference/prebuilt-tools.md
- Added documentation for the new Dataproc prebuilt tools, detailing environment variables, required permissions, and available tools.
docs/en/resources/sources/dataproc.md
- Created new documentation describing the Dataproc source, its purpose, available tools, and IAM requirements.
docs/en/resources/tools/dataproc/_index.md
- Added an index file for the new Dataproc tools documentation.
docs/en/resources/tools/dataproc/dataproc-get-cluster.md
- Documented the dataproc-get-cluster tool, including its parameters, compatible sources, and response format.
docs/en/resources/tools/dataproc/dataproc-get-job.md
- Documented the dataproc-get-job tool, including its parameters, compatible sources, and response format.
docs/en/resources/tools/dataproc/dataproc-list-clusters.md
- Documented the dataproc-list-clusters tool, including its parameters, compatible sources, and response format.
docs/en/resources/tools/dataproc/dataproc-list-jobs.md
- Documented the dataproc-list-jobs tool, including its parameters, compatible sources, and response format.
internal/prebuiltconfigs/tools/dataproc.yaml
- Added YAML configuration for the Dataproc source and its associated list_clusters, get_cluster, list_jobs, and get_job tools.
internal/sources/dataproc/dataproc.go
- Implemented the core Dataproc source, including client initialization for Dataproc ClusterController and JobController services.
internal/sources/dataproc/dataproc_test.go
- Added unit tests for parsing Dataproc source configurations from YAML, covering success and failure cases.
internal/tools/dataproc/common/urls.go
- Introduced utility functions for generating Google Cloud Console URLs for Dataproc clusters and jobs, including links to monitoring and logs.
internal/tools/dataproc/common/urls_test.go
- Added unit tests for the Dataproc URL generation utility functions.
internal/tools/dataproc/dataprocgetcluster/dataprocgetcluster.go
- Implemented the dataproc-get-cluster tool, which retrieves details for a specified Dataproc cluster.
internal/tools/dataproc/dataprocgetcluster/dataprocgetcluster_test.go
- Added unit tests for parsing the dataproc-get-cluster tool configuration.
internal/tools/dataproc/dataprocgetjob/dataprocgetjob.go
- Implemented the dataproc-get-job tool, which retrieves details for a specified Dataproc job.
internal/tools/dataproc/dataprocgetjob/dataprocgetjob_test.go
- Added unit tests for parsing the dataproc-get-job tool configuration.
internal/tools/dataproc/dataproclistclusters/dataproclistclusters.go
- Implemented the dataproc-list-clusters tool, enabling listing and filtering of Dataproc clusters.
internal/tools/dataproc/dataproclistclusters/dataproclistclusters_test.go
- Added unit tests for parsing the dataproc-list-clusters tool configuration.
internal/tools/dataproc/dataproclistjobs/dataproclistjobs.go
- Implemented the dataproc-list-jobs tool, enabling listing and filtering of Dataproc jobs.
internal/tools/dataproc/dataproclistjobs/dataproclistjobs_test.go
- Added unit tests for parsing the dataproc-list-jobs tool configuration.
tests/dataproc/dataproc_integration_test.go
- Added comprehensive integration tests for all new Dataproc tools, covering get-cluster, get-job, list-clusters, and list-jobs functionality, including error handling and authentication.

Activity

The pull request introduces a new feature to integrate with Google Cloud Dataproc.
New Go packages and files were added to define the Dataproc source and its associated tools.
Configuration files were updated to include the new Dataproc source and tools.
Documentation was created for all new Dataproc-related components.
Integration tests were developed to ensure the correct functionality and robustness of the new Dataproc features.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new Dataproc source and associated tools for interacting with Dataproc clusters and jobs. While the implementation is generally solid, several critical security and stability issues have been identified. Specifically, the URL generation for Cloud Logging is vulnerable to filter injection if resource names contain double quotes, and the Invoke methods in the list tools use unsafe type assertions on user-supplied parameters, which can lead to application panics and Denial of Service (DoS). Additionally, there are areas for improvement regarding resource handling, data correctness in one of the tools, and inconsistencies in documentation and tests. Addressing these points will significantly enhance the robustness, clarity, and security of the new features.

internal/tools/dataproc/dataproclistclusters/dataproclistclusters.go

internal/tools/dataproc/common/urls.go

internal/tools/dataproc/dataproclistclusters/dataproclistclusters.go

internal/tools/dataproc/dataproclistjobs/dataproclistjobs.go

docs/en/resources/sources/dataproc.md

docs/en/resources/tools/dataproc/dataproc-get-cluster.md

internal/sources/dataproc/dataproc.go

tests/dataproc/dataproc_integration_test.go

dborowitz · 2026-02-03T16:55:28Z

Integration tests are failing because the Cloud Build SA is missing Dataproc Viewer/Editor IAM role.

dborowitz · 2026-02-03T17:58:47Z

I made a mistake basing my original PR on a several-week-old local repo without syncing first. I cribbed from the Serverless implementation, but missed this important refactoring, guess I need to go rewrite this now...

dborowitz · 2026-02-03T18:50:08Z

I made a mistake basing my original PR on a several-week-old local repo without syncing first. I cribbed from the Serverless implementation, but missed this important refactoring, guess I need to go rewrite this now...

Done, as a separate commit. I would have tried harder to change the old commits in order if the tests weren't already passing, but they were, this is a pure refactoring.

Very closely analogous to the serverless-spark source and serverless-spark-list-batches tool; these are separate APIs for two closely related GCP products. There are of course minor differences in the APIs, for example Dataproc genearlly supports only regions, not general locations. One wrinkle is that a KI with the list jobs RPC causes it to be very slow in a project with many serverless batches (like the test project), unless filtering by cluster. This is mentioned in the param description so LLMs can provide it; in the tests, we always add it based on an env var. Unlike other env vars, the cluster name in the test project is arbitrary but not a secret.

The previous commits adding Dataproc tool implementations were based on an old version of main, before rebasing past 0691a6f, which refactored serverlessspark to use the new pattern of putting business logic into the Source. This commit refactors all Dataproc tools at once, rather than refactoring incrementally and dealing with the rebase of the commits in the pending PR.

github-actions · 2026-02-17T15:10:20Z

🔎 Preview at https://googleapis.github.io/genai-toolbox/previews/PR-2407/

duwenxin99 · 2026-02-17T15:29:22Z

docs/en/resources/tools/dataproc/dataproc-list-clusters.md

+  and `labels`. For example: `status.state = ACTIVE AND clusterName = mycluster`.
+  Supported `status.state` values are: `ACTIVE`, `INACTIVE`, `CREATING`, `RUNNING`,
+  `ERROR`, `DELETING`, `UPDATING`, `STOPPING`, `STOPPED`.
+- **`pageSize`** (optional): The maximum number of clusters to return in a single


Can we also specify the default value here?

duwenxin99 · 2026-02-17T15:33:17Z