Skip to content

Conversation

@rafiss
Copy link
Collaborator

@rafiss rafiss commented Dec 24, 2025

Previously, INSPECT jobs without a historical AS OF SYSTEM TIME clause
would not create protected timestamp records, but still used an AOST
clause with the current timestamp. If span processing took a long time
(especially with BulkLowQoS admission control), garbage collection could
occur before the query completed, resulting in "batch timestamp must be
after replica GC threshold" errors.

This change adds per-span protected timestamp protection when INSPECT
uses "now" as the AOST. The implementation uses a coordinator-based
approach where:

  1. When a processor starts processing a span and picks "now" as the
    timestamp, it sends a new "span started" progress message containing
    the span and timestamp via InspectProcessorProgress.

  2. The coordinator's progress tracker receives this message and calls
    TryToProtectBeforeGC for the relevant tables in that span. This
    waits until 80% of the GC TTL has elapsed before creating a PTS,
    avoiding unnecessary PTS creation for quick operations.

  3. When span processing completes (existing behavior), the coordinator
    cleans up the PTS for that span. Any remaining PTS records are
    cleaned up when the tracker terminates (e.g., on job cancellation).

This coordinator-based design keeps PTS management centralized rather
than distributed across processors, simplifying cleanup and error
handling. PTS failures are logged but don't fail the job since the
protection is best-effort.

sql/inspect: use minimum timestamp for PTS protection

Previously, the INSPECT job called TryToProtectBeforeGC per span with
different timestamps. Since the job only stores one PTS record, each
new span's call to Protect would update the existing record's timestamp
via UpdateTimestamp, which removes protection for older spans.

To address this, this patch changes the PTS strategy to track the
minimum (oldest) timestamp across all active spans and protect only at
that timestamp. Since PROTECT_AFTER mode protects all data at or after
the specified timestamp, protecting at the minimum covers all active
spans. When the oldest span completes, the PTS is updated to the new
minimum timestamp, allowing GC of data between the old and new minimum.

Resolves: #159866
Epic: None

Release note: None

@rafiss rafiss requested a review from spilchen December 24, 2025 22:14
@rafiss rafiss added the backport-26.1.x Flags PRs that need to be backported to 26.1 label Dec 24, 2025
@rafiss rafiss requested a review from a team as a code owner December 24, 2025 22:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

@fqazi reviewed 3 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @spilchen).

@rafiss rafiss force-pushed the inspect-per-span-now-pts branch from 36f90d0 to 62a680e Compare December 30, 2025 06:36
@rafiss rafiss requested review from a team as code owners December 30, 2025 06:36
@rafiss rafiss requested review from kev-cao and removed request for a team December 30, 2025 06:36
@rafiss rafiss force-pushed the inspect-per-span-now-pts branch from 62a680e to 07daa4b Compare December 30, 2025 07:12
Previously, INSPECT jobs without a historical AS OF SYSTEM TIME clause
would not create protected timestamp records, but still used an AOST
clause with the current timestamp. If span processing took a long time
(especially with BulkLowQoS admission control), garbage collection could
occur before the query completed, resulting in "batch timestamp must be
after replica GC threshold" errors.

This change adds per-span protected timestamp protection when INSPECT
uses "now" as the AOST. The implementation uses a coordinator-based
approach where:

1. When a processor starts processing a span and picks "now" as the
    timestamp, it sends a new "span started" progress message containing
    the span and timestamp via InspectProcessorProgress.

2. The coordinator's progress tracker receives this message and calls
    TryToProtectBeforeGC for the relevant tables in that span. This
    waits until 80% of the GC TTL has elapsed before creating a PTS,
    avoiding unnecessary PTS creation for quick operations.

3. When span processing completes (existing behavior), the coordinator
    cleans up the PTS for that span. Any remaining PTS records are
    cleaned up when the tracker terminates (e.g., on job cancellation).

This coordinator-based design keeps PTS management centralized rather
than distributed across processors, simplifying cleanup and error
handling. PTS failures are logged but don't fail the job since the
protection is best-effort.

Resolves: cockroachdb#159866

Release note: None
In addition to checkpointing in the job, now we also log progress to
text logs periodically in order to enhance observability.

Release note: None
@dt
Copy link
Contributor

dt commented Dec 30, 2025

Moving from span to object ID in TryToProtectBeforeGC seems good to me, since the PTS system has been defined in terms of objects rather than spans for a while now. Could we go even further and use table ID in the message from the workers to the coordinator? I'd also imagine we only need to protect a given table once, even if there are multiple processors running on (spans of) it?

That said, doesn't this clobber the PTS ID stored in the job each time it is run? The switch in jobsprotectedts.Protect is actually on the list of to-be-deleted anti-patterns in jobs, at least the part that switches over specific job types to record the PTS ID in their individual legacy progress, since how any one job stores its own state should be wholly confined to its own implementation (and should stop using legacy progress). Do we need to persist the individual PTS separately in an info key (updated transactionally with creating them) to ensure they're all durably recorded for cleanup rather than the in-memory map of cleanups?

Aside, speaking of persisting: `InspectProcessorProgress isn't persisted, right? so it should be in execinfrapb instead of jobspb?

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Jan 5, 2026
@rafiss
Copy link
Collaborator Author

rafiss commented Jan 5, 2026

doesn't this clobber the PTS ID stored in the job each time it is run?

Thanks for pointing that out. I didn't realize that the PTS system would clobber. I had thought it would protect all the timestamps until each one had its cleaner called, but on further reflection the way that it works now makes sense.

I have reworked this (currently in a separate commit, but I can squash if that's preferred) so that INSPECT only protects the minimum timestamp that is currently being used by all processors. When that timestamp is done, that PTS record will be cleaned up and the coordinator finds the next smallest timestamp that needs to be protected.

Could we go even further and use table ID in the message from the workers to the coordinator?

I don't think so. We still need to know when a specific span starts being processed and is done being processed, since the timestamp that needs to be protected then later cleaned up is associated with that specific span.

@rafiss rafiss changed the title sql/inspect: add per-span protected timestamp for "now" AOST case sql/inspect: add protected timestamp for "now" AOST case Jan 5, 2026
@rafiss rafiss requested a review from fqazi January 5, 2026 20:20
Previously, TryToProtectBeforeGC accepted a catalog.TableDescriptor
parameter but only used it to call GetID() in two places. This was
unnecessarily restrictive and forced callers to load a full table
descriptor just to pass the ID.

This change simplifies the function signature to accept a descpb.ID
directly. The most significant improvement is in inspect/progress.go,
where this eliminates an unnecessary DescsTxn call that was only used
to load the descriptor for its ID.

Release note: None
Epic: None
@rafiss rafiss force-pushed the inspect-per-span-now-pts branch from 4e881fe to c344d07 Compare January 5, 2026 20:22
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@rafiss rafiss added the O-AI-Review-Not-Helpful AI reviewer produced result which was incorrect or unhelpful label Jan 5, 2026
Copy link
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just a couple of minor nits.
:lgtm:

@spilchen partially reviewed 7 files and made 3 comments.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @fqazi, @kev-cao, and @rafiss).


pkg/sql/inspect/progress.go line 342 at r8 (raw file):

	if !needsNewPTS {
		log.VEventf(ctx, 2, "INSPECT: span %s at %s covered by existing PTS at %s",
			spanStarted, tsToProtect, t.mu.currentPTSTimestamp)

do we have a mutex held for t.mu.currentPTSTimestamp?


pkg/sql/inspect/progress.go line 361 at r8 (raw file):

		_, tableID, err := t.codec.DecodeTablePrefix(spanStarted.Key)
		if err != nil {
			log.Dev.Warningf(ctx, "failed to decode table ID from span %s: %v", spanStarted, err)

do we need to remove the span from activeSpanTimestamps if we get in here? Or can we extract the tableID sooner, right at the very start?

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafiss made 2 comments.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @fqazi, @kev-cao, and @spilchen).


pkg/sql/inspect/progress.go line 342 at r8 (raw file):

Previously, spilchen wrote…

do we have a mutex held for t.mu.currentPTSTimestamp?

nice catch; fixed


pkg/sql/inspect/progress.go line 361 at r8 (raw file):

Previously, spilchen wrote…

do we need to remove the span from activeSpanTimestamps if we get in here? Or can we extract the tableID sooner, right at the very start?

i don't think we need to remove it. keeping it in activeSpanTimestamps actually seems desirable, since the span is still getting actively processed, even if we fail to add the protected timestamp (or fail to decode the key)

@rafiss rafiss force-pushed the inspect-per-span-now-pts branch from c344d07 to 3224624 Compare January 6, 2026 21:52
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

Previously, the INSPECT job called TryToProtectBeforeGC per span with
different timestamps. Since the job only stores one PTS record, each
new span's call to Protect would update the existing record's timestamp
via UpdateTimestamp, which removes protection for older spans.

To address this, this patch changes the PTS strategy to track the
minimum (oldest) timestamp across all active spans and protect only at
that timestamp. Since PROTECT_AFTER mode protects all data at or after
the specified timestamp, protecting at the minimum covers all active
spans. When the oldest span completes, the PTS is updated to the new
minimum timestamp, allowing GC of data between the old and new minimum.

Release note: None
@rafiss rafiss force-pushed the inspect-per-span-now-pts branch from 3224624 to 1e07873 Compare January 6, 2026 22:01
@rafiss
Copy link
Collaborator Author

rafiss commented Jan 6, 2026

tftr!

bors r+

craig bot pushed a commit that referenced this pull request Jan 6, 2026
160138: sql/inspect: add protected timestamp for "now" AOST case r=rafiss a=rafiss

Previously, INSPECT jobs without a historical AS OF SYSTEM TIME clause
would not create protected timestamp records, but still used an AOST
clause with the current timestamp. If span processing took a long time
(especially with BulkLowQoS admission control), garbage collection could
occur before the query completed, resulting in "batch timestamp must be
after replica GC threshold" errors.

This change adds per-span protected timestamp protection when INSPECT
uses "now" as the AOST. The implementation uses a coordinator-based
approach where:

1. When a processor starts processing a span and picks "now" as the
    timestamp, it sends a new "span started" progress message containing
    the span and timestamp via InspectProcessorProgress.

2. The coordinator's progress tracker receives this message and calls
    TryToProtectBeforeGC for the relevant tables in that span. This
    waits until 80% of the GC TTL has elapsed before creating a PTS,
    avoiding unnecessary PTS creation for quick operations.

3. When span processing completes (existing behavior), the coordinator
    cleans up the PTS for that span. Any remaining PTS records are
    cleaned up when the tracker terminates (e.g., on job cancellation).

This coordinator-based design keeps PTS management centralized rather
than distributed across processors, simplifying cleanup and error
handling. PTS failures are logged but don't fail the job since the
protection is best-effort.

###  sql/inspect: use minimum timestamp for PTS protection

Previously, the INSPECT job called TryToProtectBeforeGC per span with
different timestamps. Since the job only stores one PTS record, each
new span's call to Protect would update the existing record's timestamp
via UpdateTimestamp, which removes protection for older spans.

To address this, this patch changes the PTS strategy to track the
minimum (oldest) timestamp across all active spans and protect only at
that timestamp. Since PROTECT_AFTER mode protects all data at or after
the specified timestamp, protecting at the minimum covers all active
spans. When the oldest span completes, the PTS is updated to the new
minimum timestamp, allowing GC of data between the old and new minimum.

Resolves: #159866
Epic: None

Release note: None

160570: sql/ttl: enable TTL tests to run with secondary tenants r=rafiss a=rafiss

Previously, TTL tests used `TestIsForStuffThatShouldWorkWithSecondaryTenantsButDoesntYet` and manually controlled tenant creation. This prevented the tests from benefiting from the standard tenant randomization in the test framework.

This commit makes several changes to enable TTL tests to work with tenants:

1. Updates `newRowLevelTTLTestJobTestHelper` to use `ApplicationLayer(0)` instead of manually starting tenants, leveraging the built-in tenant randomization logic.

2. Fixes `SplitTable` in testcluster to use `TestingMakePrimaryIndexKeyForTenant` with the correct codec, so range splits work correctly for tenant tables.

3. Fixes external process tenant startup to propagate version settings from the parent, allowing tenants to start when the cluster is running at an older version (e.g., MinSupported).

4. Removes the `testMultiTenant` parameter from the test helper since tenant mode is now controlled by the framework's randomization.

Resolves: #109391

Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 6, 2026

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 7, 2026

@craig craig bot merged commit 2d75bf2 into cockroachdb:master Jan 7, 2026
22 of 24 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Jan 7, 2026

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0c988ff to blathers/backport-release-26.1-160138: POST https://api.github.com/repos/rafiss/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 26.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-26.1.x Flags PRs that need to be backported to 26.1 O-AI-Review-Not-Helpful AI reviewer produced result which was incorrect or unhelpful o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

inspect: GC threshold errors during scale test

5 participants