Conversation
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
There was a problem hiding this comment.
Pull request overview
This PR expands ColumnShard scan tracing by introducing a dedicated LWTrace provider for scan-level events and propagating an NLWTrace::TOrbit through scan construction, metadata selection, and per-source processing to enable richer end-to-end trace correlation.
Changes:
- Added new
YDB_CS_SCANLWTrace provider and migrated/extended scan-related probes (scan start/finish, per-source start/finish, send result, ColumnEngine selection). - Propagated a scan
Orbit+ScanIdthroughTReadDescription→ metadata accessor →IColumnEngine::Select()→ scan actor/context. - Added per-source identifiers to results/task events (
SourceId) and enriched data-source/fetching step tracing.
Reviewed changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/core/tx/columnshard/tracing/probes.h | Removes ColumnEngineForLogsSelect probe from the ColumnShard-level provider (moved under scan tracing). |
| ydb/core/tx/columnshard/engines/reader/transaction/tx_scan.cpp | Creates scan orbit, sets it on TReadDescription, emits StartScan, and passes orbit/pathId into TColumnShardScan. |
| ydb/core/tx/columnshard/engines/reader/transaction/tx_internal_scan.cpp | Adds scan tracing usage and passes orbit into TColumnShardScan (but currently too late for metadata selection). |
| ydb/core/tx/columnshard/engines/reader/tracing/probes.h | Introduces YDB_CS_SCAN probes (scan lifecycle, per-source, send result, ColumnEngine selection). |
| ydb/core/tx/columnshard/engines/reader/tracing/probes.cpp | Renames provider definition to YDB_CS_SCAN. |
| ydb/core/tx/columnshard/engines/reader/tracing/data_source_probes.h | Expands data-source probe surface (program step probes, step-level probes, lifecycle). |
| ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/sync_points/result.cpp | Propagates per-source id into TPartialReadResult. |
| ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/sync_points/aggr.h | Propagates per-source id into TPartialReadResult for aggregated results. |
| ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/scanner.cpp | Calls OnStartProcessing() when a source begins processing. |
| ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/fetching.h | Adds per-step tracing hooks (ReportTracing) across multiple fetching steps. |
| ydb/core/tx/columnshard/engines/reader/simple_reader/iterator/fetching.cpp | Emits detailed step-level probes and forwards source id through task processed results. |
| ydb/core/tx/columnshard/engines/reader/plain_reader/iterator/source.cpp | Calls OnStartProcessing() on first interval registration. |
| ydb/core/tx/columnshard/engines/reader/common/result.h | Adds SourceId to TPartialReadResult to correlate results with sources. |
| ydb/core/tx/columnshard/engines/reader/common/result.cpp | Stores the new SourceId field in TPartialReadResult. |
| ydb/core/tx/columnshard/engines/reader/common/description.h | Adds ScanId and Orbit to TReadDescription for trace propagation. |
| ydb/core/tx/columnshard/engines/reader/common/conveyor_task.h | Adds GetSourceId() to IApplyAction for downstream correlation. |
| ydb/core/tx/columnshard/engines/reader/common/conveyor_task.cpp | Includes source id when emitting TEvTaskProcessedResult. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/ya.make | Adds build dependency on tracing module. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/source.h | Declares OnStartProcessing() for common reader sources. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/source.cpp | Implements OnStartProcessing() and emits scan/data-source lifecycle probes; adjusts source-finished tracing ids. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetching.h | Adds cached source id in step tasks and tracing helpers for program steps. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetching.cpp | Refactors program-step tracing into per-processor probes and propagates source id from tasks. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetch_steps.h | Adds tracing helper methods for key fetching steps. |
| ydb/core/tx/columnshard/engines/reader/common_reader/iterator/fetch_steps.cpp | Consolidates start/finish probes into single-step probes and switches ids to portion/source id. |
| ydb/core/tx/columnshard/engines/reader/common_reader/constructor/read_metadata.cpp | Passes orbit into metadata selection context. |
| ydb/core/tx/columnshard/engines/reader/actor/actor.h | Extends scan actor to store orbit/pathId and include source id in SendResult(). |
| ydb/core/tx/columnshard/engines/reader/actor/actor.cpp | Emits scan lifecycle/send-result probes and forwards source id into trace events. |
| ydb/core/tx/columnshard/engines/reader/abstract/read_context.h | Stores scan orbit in TReadContext and extends ctor signature. |
| ydb/core/tx/columnshard/engines/reader/abstract/read_context.cpp | Wires scan orbit into TReadContext construction. |
| ydb/core/tx/columnshard/engines/metadata_accessor.h | Extends selection context to carry orbit for downstream selection tracing. |
| ydb/core/tx/columnshard/engines/metadata_accessor.cpp | Passes orbit/txId/scanId into IColumnEngine::Select(). |
| ydb/core/tx/columnshard/engines/column_engine.h | Extends IColumnEngine::Select() signature to accept orbit + ids for tracing. |
| ydb/core/tx/columnshard/engines/column_engine_logs.h | Updates TColumnEngineForLogs::Select() override signature to match interface changes. |
| ydb/core/tx/columnshard/engines/column_engine_logs.cpp | Emits ColumnEngineForLogsSelect via scan provider using orbit; adds selection context fields. |
| ydb/core/tx/columnshard/columnshard.cpp | Registers YDB_CS_SCAN probes in the LWTrace probe registry. |
| ydb/core/tx/columnshard/columnshard_private_events.h | Extends TEvTaskProcessedResult with SourceId accessor/storage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| AFL_VERIFY(Orbit); | ||
| LWTRACK(ColumnEngineForLogsSelect, *Orbit, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(), | ||
| TotalPortionsCount, TotalFilteredPortionsCount, Result.size()); |
There was a problem hiding this comment.
TPortionsSelector can abort when ColumnEngineForLogsSelect is enabled but orbit is null: CalculateProbe becomes true via LWPROBE_ENABLED(...), then AFL_VERIFY(Orbit) triggers. This is currently reachable (e.g., internal scans build read metadata without setting TReadDescription::Orbit). Consider either requiring/passing a non-null orbit on all Select() call paths, or making probe emission conditional on Orbit (and using LWPROBE when Orbit is null) so tracing can't crash production code.
| AFL_VERIFY(Orbit); | |
| LWTRACK(ColumnEngineForLogsSelect, *Orbit, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(), | |
| TotalPortionsCount, TotalFilteredPortionsCount, Result.size()); | |
| if (Orbit) { | |
| LWTRACK(ColumnEngineForLogsSelect, *Orbit, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(), | |
| TotalPortionsCount, TotalFilteredPortionsCount, Result.size()); | |
| } else { | |
| LWPROBE(ColumnEngineForLogsSelect, PathId.GetRawValue(), TabletId, TxId, ScanId, timeOfInserted.MilliSeconds(), timeOfCommitted.MilliSeconds(), | |
| TotalPortionsCount, TotalFilteredPortionsCount, Result.size()); | |
| } |
| if (AppDataVerified().ColumnShardConfig.GetEnableDiagnostics()) { | ||
| scanDiagnosticsEvent->RequestId = requestCookie; | ||
| ctx.Send(Self->ScanDiagnosticsActorId, std::move(scanDiagnosticsEvent)); | ||
| } | ||
| auto orbit = std::make_shared<NLWTrace::TOrbit>(); | ||
| LWTRACK(StartScan, *orbit, request.GetPathId().GetInternalPathId().GetRawValue(), Self->TabletID(), request.GetLockId().value_or(0), ScanId); | ||
| auto scanActorId = ctx.Register(new TColumnShardScan(Self->SelfId(), scanComputeActor, Self->ScanDiagnosticsActorId, Self->GetStoragesManager(), | ||
| Self->DataAccessorsManager.GetObjectPtrVerified(), Self->ColumnDataManager.GetObjectPtrVerified(), TComputeShardingPolicy(), ScanId, request.GetLockId().value_or(0), ScanGen, requestCookie, | ||
| Self->TabletID(), TDuration::Max(), readMetadataRange, NKikimrDataEvents::FORMAT_ARROW, Self->Counters.GetScanCounters(), {})); | ||
| Self->TabletID(), TDuration::Max(), readMetadataRange, NKikimrDataEvents::FORMAT_ARROW, Self->Counters.GetScanCounters(), {}, std::move(orbit))); |
There was a problem hiding this comment.
orbit is created only after read metadata is built, but TReadMetadata::Init() now passes readDescription.Orbit into metadata selection (and ultimately into IColumnEngine::Select()). With the current flow, readDescription.Orbit is null during portion selection, which can lead to an assertion in TColumnEngineForLogs::Select() when ColumnEngineForLogsSelect tracing is enabled. Create/assign the orbit to TReadDescription (and set read.ScanId/read.TxId if needed) before building readMetadataRange so selection has the orbit available.
| const std::shared_ptr<NDataAccessorControl::IDataAccessorsManager>& dataAccessorsManager, | ||
| const std::shared_ptr<NColumnFetching::TColumnDataManager>& columnDataManager, const NColumnShard::TConcreteScanCounters& counters, | ||
| const TReadMetadataBase::TConstPtr& readMetadata, const TActorId& scanActorId, const TActorId& resourceSubscribeActorId, | ||
| const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId, |
There was a problem hiding this comment.
TReadContext ctor signature now requires scanOrbit, but at least one call site still uses the old signature (e.g. ydb/core/tx/columnshard/engines/reader/simple_reader/duplicates/ut/ut_manager.cpp:291 constructs TReadContext without the new argument). This will break the build/tests; please update remaining call sites (and decide what orbit to pass in unit tests, e.g. std::make_shared<NLWTrace::TOrbit>() or nullptr if the code tolerates it).
| const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId, | |
| const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId, | |
| const NConveyorComposite::TCPULimitsConfig& cpuLimits) | |
| : TReadContext(storagesManager, dataAccessorsManager, columnDataManager, counters, readMetadata, scanActorId, | |
| resourceSubscribeActorId, readCoordinatorActorId, computeShardingPolicy, scanId, cpuLimits, nullptr) { | |
| } | |
| TReadContext(const std::shared_ptr<IStoragesManager>& storagesManager, | |
| const std::shared_ptr<NDataAccessorControl::IDataAccessorsManager>& dataAccessorsManager, | |
| const std::shared_ptr<NColumnFetching::TColumnDataManager>& columnDataManager, const NColumnShard::TConcreteScanCounters& counters, | |
| const TReadMetadataBase::TConstPtr& readMetadata, const TActorId& scanActorId, const TActorId& resourceSubscribeActorId, | |
| const TActorId& readCoordinatorActorId, const TComputeShardingPolicy& computeShardingPolicy, const ui64 scanId, |
| const auto& currentCategoryName = iterator->GetCurrentNode().GetSignalCategoryName(); | ||
| const TString tracingName = source->GetExecutionContext().GetPrevCategoryName() + " - " + currentCategoryName; | ||
| const TString tracingExecutionResult = source->GetExecutionContext().GetPrevExecutionResult() + " - " + currentExecutionResult; | ||
| const TDuration finishDurationMs = source->GetAndResetWaitDuration(); | ||
| const auto& processor = iterator->GetProcessorVerified(); | ||
| const auto processorType = processor->GetProcessorType(); |
There was a problem hiding this comment.
TProgramStep::ReportTracing() builds details = processor->DebugJson().GetStringRobust() and multiple concatenated TStrings unconditionally, even when no relevant LWTrace probes are enabled and the orbit has no shuttles. This can add significant overhead to the scan hot path. Consider guarding the whole method with a fast check (e.g., if (!(source->GetDataSourceOrbit().HasShuttles() || LWPROBE_ENABLED(ProgramConst) || ...)) return;) and only computing details/strings inside the enabled path.
| const auto& currentCategoryName = iterator->GetCurrentNode().GetSignalCategoryName(); | |
| const TString tracingName = source->GetExecutionContext().GetPrevCategoryName() + " - " + currentCategoryName; | |
| const TString tracingExecutionResult = source->GetExecutionContext().GetPrevExecutionResult() + " - " + currentExecutionResult; | |
| const TDuration finishDurationMs = source->GetAndResetWaitDuration(); | |
| const auto& processor = iterator->GetProcessorVerified(); | |
| const auto processorType = processor->GetProcessorType(); | |
| const auto& processor = iterator->GetProcessorVerified(); | |
| const auto processorType = processor->GetProcessorType(); | |
| bool shouldTrace = source->GetDataSourceOrbit().HasShuttles(); | |
| if (!shouldTrace) { | |
| switch (processorType) { | |
| case NArrow::NSSA::EProcessorType::Const: | |
| shouldTrace = LWPROBE_ENABLED(ProgramConst); | |
| break; | |
| case NArrow::NSSA::EProcessorType::Calculation: | |
| shouldTrace = LWPROBE_ENABLED(ProgramCalculation); | |
| break; | |
| case NArrow::NSSA::EProcessorType::Projection: | |
| shouldTrace = LWPROBE_ENABLED(ProgramProjection); | |
| break; | |
| case NArrow::NSSA::EProcessorType::Filter: | |
| shouldTrace = LWPROBE_ENABLED(ProgramFilter); | |
| break; | |
| case NArrow::NSSA::EProcessorType::Aggregation: | |
| shouldTrace = LWPROBE_ENABLED(ProgramAggregation); | |
| break; | |
| case NArrow::NSSA::EProcessorType::FetchOriginalData: | |
| shouldTrace = LWPROBE_ENABLED(ProgramFetchOriginalData); | |
| break; | |
| case NArrow::NSSA::EProcessorType::AssembleOriginalData: | |
| shouldTrace = LWPROBE_ENABLED(ProgramAssembleOriginalData); | |
| break; | |
| case NArrow::NSSA::EProcessorType::CheckIndexData: | |
| shouldTrace = LWPROBE_ENABLED(ProgramCheckIndexData); | |
| break; | |
| case NArrow::NSSA::EProcessorType::CheckHeaderData: | |
| shouldTrace = LWPROBE_ENABLED(ProgramCheckHeaderData); | |
| break; | |
| case NArrow::NSSA::EProcessorType::StreamLogic: | |
| shouldTrace = LWPROBE_ENABLED(ProgramStreamLogic); | |
| break; | |
| case NArrow::NSSA::EProcessorType::ReserveMemory: | |
| shouldTrace = LWPROBE_ENABLED(ProgramReserveMemory); | |
| break; | |
| case NArrow::NSSA::EProcessorType::Unknown: | |
| break; | |
| } | |
| } | |
| if (!shouldTrace) { | |
| return; | |
| } | |
| const auto& currentCategoryName = iterator->GetCurrentNode().GetSignalCategoryName(); | |
| const TString tracingName = source->GetExecutionContext().GetPrevCategoryName() + " - " + currentCategoryName; | |
| const TString tracingExecutionResult = source->GetExecutionContext().GetPrevExecutionResult() + " - " + currentExecutionResult; | |
| const TDuration finishDurationMs = source->GetAndResetWaitDuration(); |
Changelog entry
...
Changelog category
Description for reviewers
...