Skip to content

more traces#37833

Open
dorooleg wants to merge 2 commits intoydb-platform:mainfrom
dorooleg:more_traces
Open

more traces#37833
dorooleg wants to merge 2 commits intoydb-platform:mainfrom
dorooleg:more_traces

Conversation

@dorooleg
Copy link
Copy Markdown
Collaborator

@dorooleg dorooleg commented Apr 9, 2026

Changelog entry

...

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

...

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

2026-04-09 21:56:25 UTC Pre-commit check linux-x86_64-release-asan for c8aacee has started.
2026-04-09 21:57:05 UTC Artifacts will be uploaded here
2026-04-09 21:59:12 UTC ya make is running...
🟡 2026-04-09 23:44:28 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18744 18693 0 11 32 8

🟢 2026-04-09 23:44:37 UTC Build successful.
🟡 2026-04-09 23:45:01 UTC ydbd size 3.9 GiB changed* by +163.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: c000db8 merge: c8aacee diff diff %
ydbd size 4 194 302 128 Bytes 4 194 469 488 Bytes +163.4 KiB +0.004%
ydbd stripped size 1 568 909 464 Bytes 1 569 006 104 Bytes +94.4 KiB +0.006%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@ydbot
Copy link
Copy Markdown
Collaborator

ydbot commented Apr 9, 2026

Run Extra Tests

Run additional tests for this PR. You can customize:

  • Test Size: small, medium, large (default: all)
  • Test Targets: any directory path (default: ydb/)
  • Sanitizers: ASAN, MSAN, TSAN
  • Coredumps: enable for debugging (default: off)
  • Additional args: custom ya make arguments

▶  Run tests

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

2026-04-09 21:59:55 UTC Pre-commit check linux-x86_64-relwithdebinfo for c8aacee has started.
2026-04-09 22:00:14 UTC Artifacts will be uploaded here
2026-04-09 22:02:47 UTC ya make is running...
🟡 2026-04-10 00:28:12 UTC Some tests failed, follow the links below. Going to retry failed tests...

Details

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
49442 45846 0 4 3536 56

2026-04-10 00:28:28 UTC ya make is running... (failed tests rerun, try 2)
🟢 2026-04-10 00:30:16 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
264 (only retried tests) 264 0 0 0 0

🟢 2026-04-10 00:30:20 UTC Build successful.
🟡 2026-04-10 00:30:37 UTC ydbd size 2.4 GiB changed* by +101.6 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: c000db8 merge: c8aacee diff diff %
ydbd size 2 572 835 448 Bytes 2 572 939 496 Bytes +101.6 KiB +0.004%
ydbd stripped size 545 959 592 Bytes 546 015 528 Bytes +54.6 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@dorooleg dorooleg marked this pull request as ready for review April 10, 2026 13:05
@dorooleg dorooleg requested a review from a team as a code owner April 10, 2026 13:05
Copilot AI review requested due to automatic review settings April 10, 2026 13:05
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

2026-04-10 13:06:39 UTC Pre-commit check linux-x86_64-relwithdebinfo for bf0a165 has started.
2026-04-10 13:06:57 UTC Artifacts will be uploaded here
2026-04-10 13:09:21 UTC ya make is running...
🟡 2026-04-10 15:39:31 UTC Some tests failed, follow the links below. Going to retry failed tests...

Details

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
49320 45767 0 4 3536 13

2026-04-10 15:40:15 UTC ya make is running... (failed tests rerun, try 2)
🟢 2026-04-10 15:44:10 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
65 (only retried tests) 65 0 0 0 0

🟢 2026-04-10 15:44:13 UTC Build successful.
🟡 2026-04-10 15:44:30 UTC ydbd size 2.4 GiB changed* by +341.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: a0853bc merge: bf0a165 diff diff %
ydbd size 2 573 990 944 Bytes 2 574 340 568 Bytes +341.4 KiB +0.014%
ydbd stripped size 546 037 864 Bytes 546 147 752 Bytes +107.3 KiB +0.020%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

2026-04-10 13:07:05 UTC Pre-commit check linux-x86_64-release-asan for bf0a165 has started.
2026-04-10 13:07:24 UTC Artifacts will be uploaded here
2026-04-10 13:09:48 UTC ya make is running...
🟡 2026-04-10 15:02:39 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
18697 18630 0 27 34 6

🟢 2026-04-10 15:02:48 UTC Build successful.
🟡 2026-04-10 15:03:17 UTC ydbd size 3.9 GiB changed* by +571.4 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: a0853bc merge: bf0a165 diff diff %
ydbd size 4 195 909 192 Bytes 4 196 494 304 Bytes +571.4 KiB +0.014%
ydbd stripped size 1 569 419 704 Bytes 1 569 679 576 Bytes +253.8 KiB +0.017%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands ColumnShard scan tracing by introducing a dedicated LWTrace provider for scan-level events and propagating an NLWTrace::TOrbit through scan construction, metadata selection, and per-source processing to enable richer end-to-end trace correlation.

Changes:

  • Added new YDB_CS_SCAN LWTrace provider and migrated/extended scan-related probes (scan start/finish, per-source start/finish, send result, ColumnEngine selection).
  • Propagated a scan Orbit + ScanId through TReadDescription → metadata accessor → IColumnEngine::Select() → scan actor/context.
  • Added per-source identifiers to results/task events (SourceId) and enriched data-source/fetching step tracing.

Reviewed changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ydb/core/tx/columnshard/tracing/probes.h Removes ColumnEngineForLogsSelect probe from the ColumnShard-level provider (moved under scan tracing).
ydb/core/tx/columnshard/engines/reader/transaction/tx_scan.cpp Creates scan orbit, sets it on TReadDescription, emits StartScan, and passes orbit/pathId into TColumnShardScan.
ydb/core/tx/columnshard/engines/reader/transaction/tx_internal_scan.cpp Adds scan tracing usage and passes orbit into TColumnShardScan (but currently too late for metadata selection).
ydb/core/tx/columnshard/engines/reader/tracing/probes.h Introduces YDB_CS_SCAN probes (scan lifecycle, per-source, send result, ColumnEngine selection).
ydb/core/tx/columnshard/engines/reader/tracing/probes.cpp Renames provider definition to YDB_CS_SCAN.
ydb/core/tx/columnshard/engines/reader/tracing/data_source_probes.h Expands data-source probe surface (program step probes, step-level probes, lifecycle).
ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/sync_points/result.cpp Propagates per-source id into TPartialReadResult.
ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/sync_points/aggr.h Propagates per-source id into TPartialReadResult for aggregated results.
ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/scanner.cpp Calls OnStartProcessing() when a source begins processing.
ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/fetching.h Adds per-step tracing hooks (ReportTracing) across multiple fetching steps.
ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/fetching.cpp Emits detailed step-level probes and forwards source id through task processed results.
ydb/core/tx/columnshard/engines/reader/plain_reader/iterator/source.cpp Calls OnStartProcessing() on first interval registration.
ydb/core/tx/columnshard/engines/reader/common/result.h Adds SourceId to TPartialReadResult to correlate results with sources.
ydb/core/tx/columnshard/engines/reader/common/result.cpp Stores the new SourceId field in TPartialReadResult.
ydb/core/tx/columnshard/engines/reader/common/description.h Adds ScanId and Orbit to TReadDescription for trace propagation.
ydb/core/tx/columnshard/engines/reader/common/conveyor_task.h Adds GetSourceId() to IApplyAction for downstream correlation.
ydb/core/tx/columnshard/engines/reader/common/conveyor_task.cpp Includes source id when emitting TEvTaskProcessedResult.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/ya.make Adds build dependency on tracing module.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/source.h Declares OnStartProcessing() for common reader sources.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/source.cpp Implements OnStartProcessing() and emits scan/data-source lifecycle probes; adjusts source-finished tracing ids.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetching.h Adds cached source id in step tasks and tracing helpers for program steps.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetching.cpp Refactors program-step tracing into per-processor probes and propagates source id from tasks.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetch_steps.h Adds tracing helper methods for key fetching steps.
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetch_steps.cpp Consolidates start/finish probes into single-step probes and switches ids to portion/source id.
ydb/core/tx/columnshard/engines/reader/common_reader/constructor/read_metadata.cpp Passes orbit into metadata selection context.
ydb/core/tx/columnshard/engines/reader/actor/actor.h Extends scan actor to store orbit/pathId and include source id in SendResult().
ydb/core/tx/columnshard/engines/reader/actor/actor.cpp Emits scan lifecycle/send-result probes and forwards source id into trace events.
ydb/core/tx/columnshard/engines/reader/abstract/read_context.h Stores scan orbit in TReadContext and extends ctor signature.
ydb/core/tx/columnshard/engines/reader/abstract/read_context.cpp Wires scan orbit into TReadContext construction.
ydb/core/tx/columnshard/engines/metadata_accessor.h Extends selection context to carry orbit for downstream selection tracing.
ydb/core/tx/columnshard/engines/metadata_accessor.cpp Passes orbit/txId/scanId into IColumnEngine::Select().
ydb/core/tx/columnshard/engines/column_engine.h Extends IColumnEngine::Select() signature to accept orbit + ids for tracing.
ydb/core/tx/columnshard/engines/column_engine_logs.h Updates TColumnEngineForLogs::Select() override signature to match interface changes.
ydb/core/tx/columnshard/engines/column_engine_logs.cpp Emits ColumnEngineForLogsSelect via scan provider using orbit; adds selection context fields.
ydb/core/tx/columnshard/columnshard.cpp Registers YDB_CS_SCAN probes in the LWTrace probe registry.
ydb/core/tx/columnshard/columnshard_private_events.h Extends TEvTaskProcessedResult with SourceId accessor/storage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +97 to 99
AFL_VERIFY(Orbit);
LWTRACK(ColumnEngineForLogsSelect, *Orbit, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(),
TotalPortionsCount, TotalFilteredPortionsCount, Result.size());
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPortionsSelector can abort when ColumnEngineForLogsSelect is enabled but orbit is null: CalculateProbe becomes true via LWPROBE_ENABLED(...), then AFL_VERIFY(Orbit) triggers. This is currently reachable (e.g., internal scans build read metadata without setting TReadDescription::Orbit). Consider either requiring/passing a non-null orbit on all Select() call paths, or making probe emission conditional on Orbit (and using LWPROBE when Orbit is null) so tracing can't crash production code.

Suggested change
AFL_VERIFY(Orbit);
LWTRACK(ColumnEngineForLogsSelect, *Orbit, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(),
TotalPortionsCount, TotalFilteredPortionsCount, Result.size());
if (Orbit) {
LWTRACK(ColumnEngineForLogsSelect, *Orbit, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(),
TotalPortionsCount, TotalFilteredPortionsCount, Result.size());
} else {
LWPROBE(ColumnEngineForLogsSelect, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(),
TotalPortionsCount, TotalFilteredPortionsCount, Result.size());
}

Copilot uses AI. Check for mistakes.
Comment on lines 118 to +126
if (AppDataVerified().ColumnShardConfig.GetEnableDiagnostics()) {
scanDiagnosticsEvent->RequestId = requestCookie;
ctx.Send(Self->ScanDiagnosticsActorId, std::move(scanDiagnosticsEvent));
}
auto orbit = std::make_shared<NLWTrace::TOrbit>();
LWTRACK(StartScan, *orbit, request.GetPathId().GetInternalPathId().GetRawValue(), Self->TabletID(), request.GetLockId().value_or(0), ScanId);
auto scanActorId = ctx.Register(new TColumnShardScan(Self->SelfId(), scanComputeActor, Self->ScanDiagnosticsActorId, Self->GetStoragesManager(),
Self->DataAccessorsManager.GetObjectPtrVerified(), Self->ColumnDataManager.GetObjectPtrVerified(), TComputeShardingPolicy(), ScanId, request.GetLockId().value_or(0), ScanGen, requestCookie,
Self->TabletID(), TDuration::Max(), readMetadataRange, NKikimrDataEvents::FORMAT_ARROW, Self->Counters.GetScanCounters(), {}));
Self->TabletID(), TDuration::Max(), readMetadataRange, NKikimrDataEvents::FORMAT_ARROW, Self->Counters.GetScanCounters(), {}, std::move(orbit)));
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orbit is created only after read metadata is built, but TReadMetadata::Init() now passes readDescription.Orbit into metadata selection (and ultimately into IColumnEngine::Select()). With the current flow, readDescription.Orbit is null during portion selection, which can lead to an assertion in TColumnEngineForLogs::Select() when ColumnEngineForLogsSelect tracing is enabled. Create/assign the orbit to TReadDescription (and set read.ScanId/read.TxId if needed) before building readMetadataRange so selection has the orbit available.

Copilot uses AI. Check for mistakes.
const std::shared_ptr<NDataAccessorControl::IDataAccessorsManager>& dataAccessorsManager,
const std::shared_ptr<NColumnFetching::TColumnDataManager>& columnDataManager, const NColumnShard::TConcreteScanCounters& counters,
const TReadMetadataBase::TConstPtr& readMetadata, const TActorId& scanActorId, const TActorId& resourceSubscribeActorId,
const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TReadContext ctor signature now requires scanOrbit, but at least one call site still uses the old signature (e.g. ydb/core/tx/columnshard/engines/reader/simple_reader/duplicates/ut/ut_manager.cpp:291 constructs TReadContext without the new argument). This will break the build/tests; please update remaining call sites (and decide what orbit to pass in unit tests, e.g. std::make_shared<NLWTrace::TOrbit>() or nullptr if the code tolerates it).

Suggested change
const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId,
const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId,
const NConveyorComposite::TCPULimitsConfig& cpuLimits)
: TReadContext(storagesManager, dataAccessorsManager, columnDataManager, counters, readMetadata, scanActorId,
resourceSubscribeActorId, readCoordinatorActorId, computeShardingPolicy, scanId, cpuLimits, nullptr) {
}
TReadContext(const std::shared_ptr<IStoragesManager>& storagesManager,
const std::shared_ptr<NDataAccessorControl::IDataAccessorsManager>& dataAccessorsManager,
const std::shared_ptr<NColumnFetching::TColumnDataManager>& columnDataManager, const NColumnShard::TConcreteScanCounters& counters,
const TReadMetadataBase::TConstPtr& readMetadata, const TActorId& scanActorId, const TActorId& resourceSubscribeActorId,
const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId,

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +67
const auto& currentCategoryName = iterator->GetCurrentNode().GetSignalCategoryName();
const TString tracingName = source->GetExecutionContext().GetPrevCategoryName() + " - " + currentCategoryName;
const TString tracingExecutionResult = source->GetExecutionContext().GetPrevExecutionResult() + " - " + currentExecutionResult;
const TDuration finishDurationMs = source->GetAndResetWaitDuration();
const auto& processor = iterator->GetProcessorVerified();
const auto processorType = processor->GetProcessorType();
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TProgramStep::ReportTracing() builds details = processor->DebugJson().GetStringRobust() and multiple concatenated TStrings unconditionally, even when no relevant LWTrace probes are enabled and the orbit has no shuttles. This can add significant overhead to the scan hot path. Consider guarding the whole method with a fast check (e.g., if (!(source->GetDataSourceOrbit().HasShuttles() || LWPROBE_ENABLED(ProgramConst) || ...)) return;) and only computing details/strings inside the enabled path.

Suggested change
const auto& currentCategoryName = iterator->GetCurrentNode().GetSignalCategoryName();
const TString tracingName = source->GetExecutionContext().GetPrevCategoryName() + " - " + currentCategoryName;
const TString tracingExecutionResult = source->GetExecutionContext().GetPrevExecutionResult() + " - " + currentExecutionResult;
const TDuration finishDurationMs = source->GetAndResetWaitDuration();
const auto& processor = iterator->GetProcessorVerified();
const auto processorType = processor->GetProcessorType();
const auto& processor = iterator->GetProcessorVerified();
const auto processorType = processor->GetProcessorType();
bool shouldTrace = source->GetDataSourceOrbit().HasShuttles();
if (!shouldTrace) {
switch (processorType) {
case NArrow::NSSA::EProcessorType::Const:
shouldTrace = LWPROBE_ENABLED(ProgramConst);
break;
case NArrow::NSSA::EProcessorType::Calculation:
shouldTrace = LWPROBE_ENABLED(ProgramCalculation);
break;
case NArrow::NSSA::EProcessorType::Projection:
shouldTrace = LWPROBE_ENABLED(ProgramProjection);
break;
case NArrow::NSSA::EProcessorType::Filter:
shouldTrace = LWPROBE_ENABLED(ProgramFilter);
break;
case NArrow::NSSA::EProcessorType::Aggregation:
shouldTrace = LWPROBE_ENABLED(ProgramAggregation);
break;
case NArrow::NSSA::EProcessorType::FetchOriginalData:
shouldTrace = LWPROBE_ENABLED(ProgramFetchOriginalData);
break;
case NArrow::NSSA::EProcessorType::AssembleOriginalData:
shouldTrace = LWPROBE_ENABLED(ProgramAssembleOriginalData);
break;
case NArrow::NSSA::EProcessorType::CheckIndexData:
shouldTrace = LWPROBE_ENABLED(ProgramCheckIndexData);
break;
case NArrow::NSSA::EProcessorType::CheckHeaderData:
shouldTrace = LWPROBE_ENABLED(ProgramCheckHeaderData);
break;
case NArrow::NSSA::EProcessorType::StreamLogic:
shouldTrace = LWPROBE_ENABLED(ProgramStreamLogic);
break;
case NArrow::NSSA::EProcessorType::ReserveMemory:
shouldTrace = LWPROBE_ENABLED(ProgramReserveMemory);
break;
case NArrow::NSSA::EProcessorType::Unknown:
break;
}
}
if (!shouldTrace) {
return;
}
const auto& currentCategoryName = iterator->GetCurrentNode().GetSignalCategoryName();
const TString tracingName = source->GetExecutionContext().GetPrevCategoryName() + " - " + currentCategoryName;
const TString tracingExecutionResult = source->GetExecutionContext().GetPrevExecutionResult() + " - " + currentExecutionResult;
const TDuration finishDurationMs = source->GetAndResetWaitDuration();

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants