Align graph rendering with existing z-score pipeline#63
Align graph rendering with existing z-score pipeline#63d2r3v wants to merge 41 commits intoRefactor/vertical-graph-layoutfrom
Conversation
Resolved conflict in scripts/app.py: - HEAD had accidentally placed REPO_OWNER/REPO_NAMES configuration block inside the run_batch_extraction function docstring. - Kept the correct docstring from Refactor/vertical-graph-layout. - Configuration block remains only in __main__ where it belongs.
Implement adaptive 3-pass edge pruning for Markov graph visualisation.
## Changes
### process_model/graphing.py
**New: prune_edges_by_zscore(G, z_min, top_k)** (Pass 1)
- Groups outgoing edges by source node and computes per-node mean/sigma.
- Keeps edge (u->v) if z >= z_min OR edge is in the top-K highest-weight
outgoing edges for u (deterministic tie-break: desc weight, asc target name).
- Edge cases: sigma==0 -> top-K only; < min_out_edges_to_zscore -> keep all.
**New: _DSU** - Disjoint Set Union (Union-Find) with path compression and
union-by-rank for O(alpha) component tracking.
**New: prune_edges_connectivity_preserving(G, z_min, top_k)** (Pass 1+2+3)
- Pass 1: delegates to prune_edges_by_zscore.
- Pass 2 (connectivity repair): builds a DSU over all nodes; greedily adds
back the minimum number of highest-weight original edges (bridge candidates)
needed to reduce the pruned graph to a single weakly-connected component.
Uses (u,v) string tie-breaking for determinism.
- Pass 3 (orphan guardrail): any node with degree-0 in the kept set gets its
strongest incident edge (in or out) restored from the original graph.
- Prints a structured diagnostic line per graph:
[PRUNE] Label: N edges -> P1:X -> Conn:Y (+B bridges) -> Final:Z (+O orphan fixes)
| components: C1->C2 | z_min=.. top_k=..
**�uild_markov_graph**: wired to call prune_edges_connectivity_preserving by
default (preserve_connectivity=True); falls back to raw z-score with
--no-preserve-connectivity flag.
**New CLI flags (main)**:
--z-min FLOAT z-score threshold (default: 1.0)
--top-k INT always keep top-K edges per node (default: 1)
--no-preserve-connectivity disable Pass 2+3; raw z-score pruning only
### test/process_model/test_graphing.py
Added TestPruneEdgesByZScore (11 tests) and TestPruneEdgesConnectivityPreserving
(4 tests) covering:
- sigma==0 with top-K=1 and top-K=2
- single outgoing edge always kept
- standout edge + top-K fallback
- tie-breaking determinism
- 2-component repair with best bridge selection
- orphan node restoration via Pass 3
- deterministic bridge tie-breaking
- minimality: exactly (K-1) bridges for K initial components
All 19 tests pass.
# Conflicts: # event_labelling/CodeStructure_Branching/main.py # event_labelling/PR/get_clean_pr_label.py
…om IDF pruning
Team avg-session graphs now use filter_edges_by_zscore (from clustering.py)
on the pre-computed zscores.csv instead of a custom runtime IDF scorer.
Removed:
- build_edge_idf_map, score_team_edges, prune_team_edges_distinctive
- idf_map / n_teams params from build_markov_graph, render_team_graphs,
render_cluster_graphs
- --z-min, --score-min, --top-k CLI flags
Simplified:
- repair_connectivity / fix_orphans: weight-only sorting (no scores dict)
- build_markov_graph: Pass 1 removed; Pass 2+3 on pre-filtered input
- render_team_graphs: takes avg_zscored_df (pre-filtered by caller)
- render_cluster_graphs: no idf params
- main(): loads zscores.csv, applies filter_edges_by_zscore, new
--z-threshold flag (default 1.645, matches clustering.Z_THRESHOLD)
Tests updated: TestIdfPruning replaced with TestConnectivityRepair (5 tests)
covering repair_connectivity, fix_orphans, and full pipeline determinism.
Refactor: Clean Scripts into Utility
AdaraPutri
left a comment
There was a problem hiding this comment.
hey Dhruv thanks for fixing the pruning methods. Pass 2 and Pass 3 sounds like an interesting and solid approach, especially for higher z-score thresholds where the avg-session graphs get really sparse and it’s easy to end up with disconnected components / floating nodes.
I ran the code across all 22 teams (PR labels) with the recommended setting of --z-threshold 0.5 and only one team (Team 11) got flagged as disjoint + “fixed” by Pass 2 (components: 2->1, +1 bridge). But visually the graph looked the same with vs without --no-preserve-connectivity:
To double check, I added a debug statement right after the Pass 2/3 block in build_markov_graph() (after final_count = len(keep_set)) to compare what we’re rendering vs what the passes are actually keeping. For Team 11 it showed:

So it looks like Pass 2 is adding the bridge edge into keep_set, but it never gets added into G (and we render using for u, v, data in G.edges(data=True)), which means the “bridge” can’t actually appear in the PNG. That would explain why the logs show +1 bridge but the diagram doesn’t change.
I tried again with a higher threshold (--z-threshold 1.645) and it becomes more obvious: 17 teams were identified as disconnected / got +1 bridge, but again the graphs looked identical with vs without preservation, which I think is the same root issue (edges added to keep_set, but not present in G so they aren’t drawable).
I think this can be fixed pretty cleanly by syncing G with keep_set after Pass 2/3:
- for any
(u, v)inkeep_setthat isn’t already inG, add it intoGusing the weight fromG_original - then recompute probabilities (at least for the affected source nodes, or just recompute all
probvalues once after the repair)
That way the preservation passes actually affect what gets rendered.
One more thought: I’m not totally sure we need Pass 3 (orphans) in its current form. In theory it’s meant to reattach isolated nodes, but since all_nodes is derived from set(G.nodes()) (i.e., nodes that survived the filtered edge list), nodes that lose all edges due to pruning usually won’t be present to “fix” in the first place. So it might be doing very little in practice unless we expand the node scope, or we can keep it as a guardrail but it may not fire often.
Feature/comm label
…, fix cluster aggregation
f3c1d5e to
83c4d19
Compare
aliyahnurdafika
left a comment
There was a problem hiding this comment.
Great work on this PR! I gained a lot of new insights from these updates. I’ve left a few questions and comments to deepen my understanding. Good job!
|
|
||
| - Extracts and normalizes usernames from comment author fields | ||
| - Removes malformed or missing author entries | ||
| - Overwrites original CSV files (backup originals if you need raw data) |
There was a problem hiding this comment.
Nice documentation update! I have a question, would it help to mention exactly which script is responsible for that overwrite step?
| if "pr_id" not in df.columns or "event" not in df.columns: | ||
| raise KeyError("communication labels CSV must include at least: pr_id, event") |
There was a problem hiding this comment.
Good work on this validation change! We only require the essential columns (pr_id, event) for the timestamp logic consistency.
| "pr_id": row.get("pr_id"), | ||
| "timestamp": ts, | ||
| # keep EXACT same event cell as in original CSV (string form) | ||
| "event": row.get("event"), |
There was a problem hiding this comment.
I have a question, in here we normalize the event cell for timestamp selection,
events = _parse_event_cell(row.get("event"))
ts = _pick_timestamp(row, events)but we still write the original raw event value:
"event": row.get("event"),Is the clean output intentionally preserving the original event representation, or should it write the normalized event value instead? Thanks!
| sigma == 0, so z is undefined. Only the top-K edge (alphabetically | ||
| first target when weights tie) must be kept. |
There was a problem hiding this comment.
These explanations are really helpful! Not only explaining what is being tested, but also why that behavior is expected.
| ("A1", "S2", 5), # bridge 1 | ||
| ("A2", "S3", 5), # bridge 2 | ||
| ]) | ||
| kept = prune_edges_connectivity_preserving(G, z_min=1.0, top_k=1) |
There was a problem hiding this comment.
I think it would be helpful to briefly document why specific z_min values (e.g., 1.0, 999) are used in these tests. Maybe a brief comment at the top of the file could clarify the purpose of these values.
…uie/processAnalysis into feature/edge-zscore-emphasis
2fbd799 to
925678d
Compare


Summary
This PR removes the custom edge scoring/pruning logic in
graphing.pyand aligns the graph rendering pipeline with the existing z-score workflow already implemented in the repository.Previously, graph rendering used a separate IDF-based pruning approach. The repository already computes transition z-scores in
zscore_calculation.pyand uses them inclustering.pyto filter statistically significant transitions. This change reuses that pipeline instead of maintaining a second pruning system.Key Changes
1. Reuse existing z-score pipeline
graphing.pynow consumes the z-score outputs produced byzscore_calculation.py.Edges for avg-session team graphs are filtered using:
Filtering is applied using the existing
filter_edges_by_zscorelogic fromclustering.py.2. Remove custom IDF pruning
Removed the following functions from
graphing.py:build_edge_idf_mapscore_team_edgesprune_team_edges_distinctiveThis eliminates the duplicate pruning system and simplifies the graphing code.
3. Structural repair retained
The structural safeguards remain:
These operate only on the displayed node set and use raw edges as candidate bridges.
4. CLI configuration
Added a configurable filtering threshold:
Default:
This default was chosen because stricter statistical cutoffs (e.g.
1.645) produced overly sparse graphs in this dataset. The threshold remains configurable for stricter filtering if needed.Resulting Pipeline
This keeps the statistical filtering logic consistent with the rest of the pipeline while ensuring graphs remain structurally interpretable.
Verification
Tested using multiple thresholds:
Connectivity repair ensures filtered graphs remain usable when filtering removes bridge edges.
Notes
Cluster graphs may still appear disconnected if the aggregated cluster edge data itself contains multiple components. In those cases the warning reflects the underlying data rather than a rendering error.