Skip to content

feat(dataproc): Add dataproc source and list/get clusters/jobs tools#2407

Open
dborowitz wants to merge 5 commits intogoogleapis:mainfrom
dborowitz:dataproc
Open

feat(dataproc): Add dataproc source and list/get clusters/jobs tools#2407
dborowitz wants to merge 5 commits intogoogleapis:mainfrom
dborowitz:dataproc

Conversation

@dborowitz
Copy link
Contributor

Description

Add a new source for Dataproc, which is closely related to Serverless Spark. Similar to get/list batches, we have get/list clusters and jobs, with minor API differences.

PR Checklist

Thank you for opening a Pull Request! Before submitting your PR, there are a
few things you can do to make sure it goes smoothly:

  • Make sure you reviewed
    CONTRIBUTING.md
  • Make sure to open an issue as a
    bug/issue
    before writing your code! That way we can discuss the change, evaluate
    designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
  • Make sure to add ! if this involve a breaking change

🛠️ Part of #2405

@dborowitz dborowitz requested a review from a team as a code owner February 3, 2026 05:47
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dborowitz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by integrating with Google Cloud Dataproc. It provides a new data source and a suite of tools that enable users to programmatically interact with Dataproc clusters and jobs, offering functionalities such as retrieving specific resource details and listing resources with filtering options. This enhancement mirrors existing functionalities for Serverless Spark, providing a consistent experience for managing big data processing environments.

Highlights

  • New Dataproc Source: Introduced a new source type for Google Cloud Dataproc, allowing the system to interact with Dataproc clusters and jobs.
  • Dataproc Cluster Management Tools: Added tools to retrieve details of a specific Dataproc cluster (get_cluster) and list/filter multiple clusters (list_clusters).
  • Dataproc Job Management Tools: Implemented tools to fetch information about a particular Dataproc job (get_job) and list/filter various jobs (list_jobs).
  • Comprehensive Documentation and Testing: Included extensive documentation for the new Dataproc source and its associated tools, along with new integration tests to validate functionality.
Changelog
  • .ci/integration.cloudbuild.yaml
    • Added a new integration test step for Dataproc to the CI pipeline.
  • cmd/root.go
    • Imported new Dataproc tool and source packages to register them with the command-line interface.
  • cmd/root_test.go
    • Updated prebuilt tools tests to include the new Dataproc configuration and toolset.
  • docs/en/reference/prebuilt-tools.md
    • Added documentation for the new Dataproc prebuilt tools, detailing environment variables, required permissions, and available tools.
  • docs/en/resources/sources/dataproc.md
    • Created new documentation describing the Dataproc source, its purpose, available tools, and IAM requirements.
  • docs/en/resources/tools/dataproc/_index.md
    • Added an index file for the new Dataproc tools documentation.
  • docs/en/resources/tools/dataproc/dataproc-get-cluster.md
    • Documented the dataproc-get-cluster tool, including its parameters, compatible sources, and response format.
  • docs/en/resources/tools/dataproc/dataproc-get-job.md
    • Documented the dataproc-get-job tool, including its parameters, compatible sources, and response format.
  • docs/en/resources/tools/dataproc/dataproc-list-clusters.md
    • Documented the dataproc-list-clusters tool, including its parameters, compatible sources, and response format.
  • docs/en/resources/tools/dataproc/dataproc-list-jobs.md
    • Documented the dataproc-list-jobs tool, including its parameters, compatible sources, and response format.
  • internal/prebuiltconfigs/tools/dataproc.yaml
    • Added YAML configuration for the Dataproc source and its associated list_clusters, get_cluster, list_jobs, and get_job tools.
  • internal/sources/dataproc/dataproc.go
    • Implemented the core Dataproc source, including client initialization for Dataproc ClusterController and JobController services.
  • internal/sources/dataproc/dataproc_test.go
    • Added unit tests for parsing Dataproc source configurations from YAML, covering success and failure cases.
  • internal/tools/dataproc/common/urls.go
    • Introduced utility functions for generating Google Cloud Console URLs for Dataproc clusters and jobs, including links to monitoring and logs.
  • internal/tools/dataproc/common/urls_test.go
    • Added unit tests for the Dataproc URL generation utility functions.
  • internal/tools/dataproc/dataprocgetcluster/dataprocgetcluster.go
    • Implemented the dataproc-get-cluster tool, which retrieves details for a specified Dataproc cluster.
  • internal/tools/dataproc/dataprocgetcluster/dataprocgetcluster_test.go
    • Added unit tests for parsing the dataproc-get-cluster tool configuration.
  • internal/tools/dataproc/dataprocgetjob/dataprocgetjob.go
    • Implemented the dataproc-get-job tool, which retrieves details for a specified Dataproc job.
  • internal/tools/dataproc/dataprocgetjob/dataprocgetjob_test.go
    • Added unit tests for parsing the dataproc-get-job tool configuration.
  • internal/tools/dataproc/dataproclistclusters/dataproclistclusters.go
    • Implemented the dataproc-list-clusters tool, enabling listing and filtering of Dataproc clusters.
  • internal/tools/dataproc/dataproclistclusters/dataproclistclusters_test.go
    • Added unit tests for parsing the dataproc-list-clusters tool configuration.
  • internal/tools/dataproc/dataproclistjobs/dataproclistjobs.go
    • Implemented the dataproc-list-jobs tool, enabling listing and filtering of Dataproc jobs.
  • internal/tools/dataproc/dataproclistjobs/dataproclistjobs_test.go
    • Added unit tests for parsing the dataproc-list-jobs tool configuration.
  • tests/dataproc/dataproc_integration_test.go
    • Added comprehensive integration tests for all new Dataproc tools, covering get-cluster, get-job, list-clusters, and list-jobs functionality, including error handling and authentication.
Activity
  • The pull request introduces a new feature to integrate with Google Cloud Dataproc.
  • New Go packages and files were added to define the Dataproc source and its associated tools.
  • Configuration files were updated to include the new Dataproc source and tools.
  • Documentation was created for all new Dataproc-related components.
  • Integration tests were developed to ensure the correct functionality and robustness of the new Dataproc features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Dataproc source and associated tools for interacting with Dataproc clusters and jobs. While the implementation is generally solid, several critical security and stability issues have been identified. Specifically, the URL generation for Cloud Logging is vulnerable to filter injection if resource names contain double quotes, and the Invoke methods in the list tools use unsafe type assertions on user-supplied parameters, which can lead to application panics and Denial of Service (DoS). Additionally, there are areas for improvement regarding resource handling, data correctness in one of the tools, and inconsistencies in documentation and tests. Addressing these points will significantly enhance the robustness, clarity, and security of the new features.

@dborowitz
Copy link
Contributor Author

Integration tests are failing because the Cloud Build SA is missing Dataproc Viewer/Editor IAM role.

@dborowitz dborowitz force-pushed the dataproc branch 2 times, most recently from 182297a to d0c3c41 Compare February 3, 2026 17:21
@dborowitz
Copy link
Contributor Author

I made a mistake basing my original PR on a several-week-old local repo without syncing first. I cribbed from the Serverless implementation, but missed this important refactoring, guess I need to go rewrite this now...

@dborowitz
Copy link
Contributor Author

I made a mistake basing my original PR on a several-week-old local repo without syncing first. I cribbed from the Serverless implementation, but missed this important refactoring, guess I need to go rewrite this now...

Done, as a separate commit. I would have tried harder to change the old commits in order if the tests weren't already passing, but they were, this is a pure refactoring.

@dborowitz dborowitz closed this Feb 3, 2026
@dborowitz dborowitz reopened this Feb 3, 2026
@dborowitz dborowitz added the tests: run Label to trigger Github Action tests. label Feb 3, 2026
@github-actions github-actions bot removed the tests: run Label to trigger Github Action tests. label Feb 3, 2026
Very closely analogous to the serverless-spark source and
serverless-spark-list-batches tool; these are separate APIs for two
closely related GCP products. There are of course minor differences in
the APIs, for example Dataproc genearlly supports only regions, not
general locations.

One wrinkle is that a KI with the list jobs RPC causes it to be very
slow in a project with many serverless batches (like the test project),
unless filtering by cluster. This is mentioned in the param description
so LLMs can provide it; in the tests, we always add it based on an env
var. Unlike other env vars, the cluster name in the test project is
arbitrary but not a secret.
The previous commits adding Dataproc tool implementations were based on
an old version of main, before rebasing past 0691a6f, which refactored
serverlessspark to use the new pattern of putting business logic into
the Source.

This commit refactors all Dataproc tools at once, rather than
refactoring incrementally and dealing with the rebase of the commits in
the pending PR.
@dborowitz dborowitz added the tests: run Label to trigger Github Action tests. label Feb 4, 2026
@github-actions github-actions bot removed the tests: run Label to trigger Github Action tests. label Feb 4, 2026
@duwenxin99 duwenxin99 added the docs: deploy-preview Label to trigger Github Action docs preview. label Feb 17, 2026
@github-actions
Copy link
Contributor

and `labels`. For example: `status.state = ACTIVE AND clusterName = mycluster`.
Supported `status.state` values are: `ACTIVE`, `INACTIVE`, `CREATING`, `RUNNING`,
`ERROR`, `DELETING`, `UPDATING`, `STOPPING`, `STOPPED`.
- **`pageSize`** (optional): The maximum number of clusters to return in a single
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also specify the default value here?

func (t Tool) Invoke(ctx context.Context, resourceMgr tools.SourceProvider, params parameters.ParamValues, accessToken tools.AccessToken) (any, error) {
source, err := tools.GetCompatibleSource[compatibleSource](resourceMgr, t.Config.Source, t.Name, kind)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just had a refactor of the Tool Invoke() function. We need to return util.ClientServerError for developer errors and util.AgentError for agent errors that can be self-corrected. Let me know if there's confusion around it.

paramMap := params.AsMap()
name, ok := paramMap["clusterName"].(string)
if !ok {
return nil, fmt.Errorf("missing required parameter: clusterName")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("missing required parameter: clusterName")
return nil, util.NewAgentError("invalid or missing 'clusterName' parameter", nil)

func (t Tool) Invoke(ctx context.Context, resourceMgr tools.SourceProvider, params parameters.ParamValues, accessToken tools.AccessToken) (any, error) {
source, err := tools.GetCompatibleSource[compatibleSource](resourceMgr, t.Config.Source, t.Name, kind)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, err
return nil, util.NewClientServerError("source used is not compatible with the tool", http.StatusInternalServerError, err)

return nil, fmt.Errorf("missing required parameter: clusterName")
}
if strings.Contains(name, "/") {
return nil, fmt.Errorf("clusterName must be a short name without '/': %s", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, util.NewAgentError(fmt.Sprintf("clusterName must be a short name without '/': %s", name), nil)

}

// Invoke executes the tool's operation.
func (t Tool) Invoke(ctx context.Context, resourceMgr tools.SourceProvider, params parameters.ParamValues, accessToken tools.AccessToken) (any, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (t Tool) Invoke(ctx context.Context, resourceMgr tools.SourceProvider, params parameters.ParamValues, accessToken tools.AccessToken) (any, error) {
func (t Tool) Invoke(ctx context.Context, resourceMgr tools.SourceProvider, params parameters.ParamValues, accessToken tools.AccessToken) (any, util.ToolboxError) {

@duwenxin99
Copy link
Contributor

Hi @dborowitz, other than some refactor and nits, the PR LGTM. Let me know when the updates are made and I'll approve it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs: deploy-preview Label to trigger Github Action docs preview.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments