Skip to content

Align graph rendering with existing z-score pipeline#63

Open
d2r3v wants to merge 41 commits intoRefactor/vertical-graph-layoutfrom
feature/edge-zscore-emphasis
Open

Align graph rendering with existing z-score pipeline#63
d2r3v wants to merge 41 commits intoRefactor/vertical-graph-layoutfrom
feature/edge-zscore-emphasis

Conversation

@d2r3v
Copy link
Copy Markdown
Collaborator

@d2r3v d2r3v commented Jan 27, 2026

Summary

This PR removes the custom edge scoring/pruning logic in graphing.py and aligns the graph rendering pipeline with the existing z-score workflow already implemented in the repository.

Previously, graph rendering used a separate IDF-based pruning approach. The repository already computes transition z-scores in zscore_calculation.py and uses them in clustering.py to filter statistically significant transitions. This change reuses that pipeline instead of maintaining a second pruning system.


Key Changes

1. Reuse existing z-score pipeline

graphing.py now consumes the z-score outputs produced by zscore_calculation.py.

Edges for avg-session team graphs are filtered using:

abs(z_score) >= z_threshold

Filtering is applied using the existing filter_edges_by_zscore logic from clustering.py.


2. Remove custom IDF pruning

Removed the following functions from graphing.py:

  • build_edge_idf_map
  • score_team_edges
  • prune_team_edges_distinctive

This eliminates the duplicate pruning system and simplifies the graphing code.


3. Structural repair retained

The structural safeguards remain:

  • repair_connectivity – restores minimal bridge edges from the raw graph if filtering disconnects the graph
  • fix_orphans – restores a minimal incident edge for isolated nodes

These operate only on the displayed node set and use raw edges as candidate bridges.


4. CLI configuration

Added a configurable filtering threshold:

--z-threshold FLOAT

Default:

0.5

This default was chosen because stricter statistical cutoffs (e.g. 1.645) produced overly sparse graphs in this dataset. The threshold remains configurable for stricter filtering if needed.


Resulting Pipeline

raw avg-session edges
      ↓
z-score computation (existing step)
      ↓
z-score filtering (abs(z) >= threshold)
      ↓
build filtered graph
      ↓
connectivity repair (raw edges as candidates)
      ↓
orphan repair

This keeps the statistical filtering logic consistent with the rest of the pipeline while ensuring graphs remain structurally interpretable.


Verification

Tested using multiple thresholds:

Threshold Result
0 No filtering (baseline graph)
0.5 Balanced sparsity and readability
1.0+ Graphs often became overly sparse

Connectivity repair ensures filtered graphs remain usable when filtering removes bridge edges.


Notes

Cluster graphs may still appear disconnected if the aggregated cluster edge data itself contains multiple components. In those cases the warning reflects the underlying data rather than a rendering error.

@d2r3v d2r3v self-assigned this Jan 27, 2026
@d2r3v d2r3v added the enhancement New feature or request label Jan 27, 2026
@d2r3v d2r3v marked this pull request as draft January 27, 2026 02:30
@d2r3v d2r3v linked an issue Jan 27, 2026 that may be closed by this pull request
@d2r3v d2r3v changed the title WIP: edge zscore emphasis Edge zscore emphasis Feb 2, 2026
Mahatav and others added 18 commits February 2, 2026 07:32
Resolved conflict in scripts/app.py:
- HEAD had accidentally placed REPO_OWNER/REPO_NAMES configuration block
  inside the run_batch_extraction function docstring.
- Kept the correct docstring from Refactor/vertical-graph-layout.
- Configuration block remains only in __main__ where it belongs.
Implement adaptive 3-pass edge pruning for Markov graph visualisation.

## Changes

### process_model/graphing.py

**New: prune_edges_by_zscore(G, z_min, top_k)** (Pass 1)
- Groups outgoing edges by source node and computes per-node mean/sigma.
- Keeps edge (u->v) if z >= z_min OR edge is in the top-K highest-weight
  outgoing edges for u (deterministic tie-break: desc weight, asc target name).
- Edge cases: sigma==0 -> top-K only; < min_out_edges_to_zscore -> keep all.

**New: _DSU** - Disjoint Set Union (Union-Find) with path compression and
union-by-rank for O(alpha) component tracking.

**New: prune_edges_connectivity_preserving(G, z_min, top_k)** (Pass 1+2+3)
- Pass 1: delegates to prune_edges_by_zscore.
- Pass 2 (connectivity repair): builds a DSU over all nodes; greedily adds
  back the minimum number of highest-weight original edges (bridge candidates)
  needed to reduce the pruned graph to a single weakly-connected component.
  Uses (u,v) string tie-breaking for determinism.
- Pass 3 (orphan guardrail): any node with degree-0 in the kept set gets its
  strongest incident edge (in or out) restored from the original graph.
- Prints a structured diagnostic line per graph:
    [PRUNE] Label: N edges -> P1:X -> Conn:Y (+B bridges) -> Final:Z (+O orphan fixes)
             |  components: C1->C2  |  z_min=.. top_k=..

**�uild_markov_graph**: wired to call prune_edges_connectivity_preserving by
default (preserve_connectivity=True); falls back to raw z-score with
--no-preserve-connectivity flag.

**New CLI flags (main)**:
  --z-min FLOAT              z-score threshold (default: 1.0)
  --top-k INT                always keep top-K edges per node (default: 1)
  --no-preserve-connectivity disable Pass 2+3; raw z-score pruning only

### test/process_model/test_graphing.py

Added TestPruneEdgesByZScore (11 tests) and TestPruneEdgesConnectivityPreserving
(4 tests) covering:
  - sigma==0 with top-K=1 and top-K=2
  - single outgoing edge always kept
  - standout edge + top-K fallback
  - tie-breaking determinism
  - 2-component repair with best bridge selection
  - orphan node restoration via Pass 3
  - deterministic bridge tie-breaking
  - minimality: exactly (K-1) bridges for K initial components

All 19 tests pass.
@d2r3v d2r3v changed the title Edge zscore emphasis feat: Connectivity-Preserving Z-Score Edge Pruning Feb 23, 2026
@d2r3v d2r3v marked this pull request as ready for review February 23, 2026 04:41
@d2r3v d2r3v requested review from AdaraPutri and Mahatav February 23, 2026 04:42
# Conflicts:
#	event_labelling/CodeStructure_Branching/main.py
#	event_labelling/PR/get_clean_pr_label.py
@d2r3v d2r3v changed the title feat: Connectivity-Preserving Z-Score Edge Pruning Adaptive & Distinctiveness-Based Edge Pruning Pipeline Mar 2, 2026
@Mahatav Mahatav deleted the branch Refactor/vertical-graph-layout March 2, 2026 17:27
@Mahatav Mahatav closed this Mar 2, 2026
@Mahatav Mahatav reopened this Mar 2, 2026
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

d2r3v added 4 commits March 9, 2026 00:39
…om IDF pruning

Team avg-session graphs now use filter_edges_by_zscore (from clustering.py)
on the pre-computed zscores.csv instead of a custom runtime IDF scorer.

Removed:
  - build_edge_idf_map, score_team_edges, prune_team_edges_distinctive
  - idf_map / n_teams params from build_markov_graph, render_team_graphs,
    render_cluster_graphs
  - --z-min, --score-min, --top-k CLI flags

Simplified:
  - repair_connectivity / fix_orphans: weight-only sorting (no scores dict)
  - build_markov_graph: Pass 1 removed; Pass 2+3 on pre-filtered input
  - render_team_graphs: takes avg_zscored_df (pre-filtered by caller)
  - render_cluster_graphs: no idf params
  - main(): loads zscores.csv, applies filter_edges_by_zscore, new
    --z-threshold flag (default 1.645, matches clustering.Z_THRESHOLD)

Tests updated: TestIdfPruning replaced with TestConnectivityRepair (5 tests)
covering repair_connectivity, fix_orphans, and full pipeline determinism.
@d2r3v d2r3v changed the title Adaptive & Distinctiveness-Based Edge Pruning Pipeline Align graph rendering with existing z-score pipeline Mar 9, 2026
Refactor: Clean Scripts into Utility
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Dhruv thanks for fixing the pruning methods. Pass 2 and Pass 3 sounds like an interesting and solid approach, especially for higher z-score thresholds where the avg-session graphs get really sparse and it’s easy to end up with disconnected components / floating nodes.

I ran the code across all 22 teams (PR labels) with the recommended setting of --z-threshold 0.5 and only one team (Team 11) got flagged as disjoint + “fixed” by Pass 2 (components: 2->1, +1 bridge). But visually the graph looked the same with vs without --no-preserve-connectivity:

Before:
Image
After:
Image

To double check, I added a debug statement right after the Pass 2/3 block in build_markov_graph() (after final_count = len(keep_set)) to compare what we’re rendering vs what the passes are actually keeping. For Team 11 it showed:
Image

So it looks like Pass 2 is adding the bridge edge into keep_set, but it never gets added into G (and we render using for u, v, data in G.edges(data=True)), which means the “bridge” can’t actually appear in the PNG. That would explain why the logs show +1 bridge but the diagram doesn’t change.

I tried again with a higher threshold (--z-threshold 1.645) and it becomes more obvious: 17 teams were identified as disconnected / got +1 bridge, but again the graphs looked identical with vs without preservation, which I think is the same root issue (edges added to keep_set, but not present in G so they aren’t drawable).

I think this can be fixed pretty cleanly by syncing G with keep_set after Pass 2/3:

  • for any (u, v) in keep_set that isn’t already in G, add it into G using the weight from G_original
  • then recompute probabilities (at least for the affected source nodes, or just recompute all prob values once after the repair)

That way the preservation passes actually affect what gets rendered.

One more thought: I’m not totally sure we need Pass 3 (orphans) in its current form. In theory it’s meant to reattach isolated nodes, but since all_nodes is derived from set(G.nodes()) (i.e., nodes that survived the filtered edge list), nodes that lose all edges due to pruning usually won’t be present to “fix” in the first place. So it might be doing very little in practice unless we expand the node scope, or we can keep it as a guardrail but it may not fire often.

@d2r3v d2r3v force-pushed the feature/edge-zscore-emphasis branch from f3c1d5e to 83c4d19 Compare March 16, 2026 07:36
@d2r3v d2r3v requested a review from AdaraPutri March 16, 2026 07:37
Copy link
Copy Markdown
Collaborator

@aliyahnurdafika aliyahnurdafika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this PR! I gained a lot of new insights from these updates. I’ve left a few questions and comments to deepen my understanding. Good job!


- Extracts and normalizes usernames from comment author fields
- Removes malformed or missing author entries
- Overwrites original CSV files (backup originals if you need raw data)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice documentation update! I have a question, would it help to mention exactly which script is responsible for that overwrite step?

Comment on lines +73 to +74
if "pr_id" not in df.columns or "event" not in df.columns:
raise KeyError("communication labels CSV must include at least: pr_id, event")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on this validation change! We only require the essential columns (pr_id, event) for the timestamp logic consistency.

"pr_id": row.get("pr_id"),
"timestamp": ts,
# keep EXACT same event cell as in original CSV (string form)
"event": row.get("event"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question, in here we normalize the event cell for timestamp selection,

events = _parse_event_cell(row.get("event"))
ts = _pick_timestamp(row, events)

but we still write the original raw event value:

"event": row.get("event"),

Is the clean output intentionally preserving the original event representation, or should it write the normalized event value instead? Thanks!

Comment thread test/process_model/test_graphing.py Outdated
Comment on lines +97 to +98
sigma == 0, so z is undefined. Only the top-K edge (alphabetically
first target when weights tie) must be kept.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These explanations are really helpful! Not only explaining what is being tested, but also why that behavior is expected.

Comment thread test/process_model/test_graphing.py Outdated
("A1", "S2", 5), # bridge 1
("A2", "S3", 5), # bridge 2
])
kept = prune_edges_connectivity_preserving(G, z_min=1.0, top_k=1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to briefly document why specific z_min values (e.g., 1.0, 999) are used in these tests. Maybe a brief comment at the top of the file could clarify the purpose of these values.

@d2r3v d2r3v force-pushed the feature/edge-zscore-emphasis branch from 2fbd799 to 925678d Compare March 23, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Z-score-based edge emphasis

4 participants