Align graph rendering with existing z-score pipeline by d2r3v · Pull Request #63 · bohuie/processAnalysis

d2r3v · 2026-01-27T02:30:07Z

Summary

This PR removes the custom edge scoring/pruning logic in graphing.py and aligns the graph rendering pipeline with the existing z-score workflow already implemented in the repository.

Previously, graph rendering used a separate IDF-based pruning approach. The repository already computes transition z-scores in zscore_calculation.py and uses them in clustering.py to filter statistically significant transitions. This change reuses that pipeline instead of maintaining a second pruning system.

Key Changes

1. Reuse existing z-score pipeline

graphing.py now consumes the z-score outputs produced by zscore_calculation.py.

Edges for avg-session team graphs are filtered using:

abs(z_score) >= z_threshold

Filtering is applied using the existing filter_edges_by_zscore logic from clustering.py.

2. Remove custom IDF pruning

Removed the following functions from graphing.py:

build_edge_idf_map
score_team_edges
prune_team_edges_distinctive

This eliminates the duplicate pruning system and simplifies the graphing code.

3. Structural repair retained

The structural safeguards remain:

repair_connectivity – restores minimal bridge edges from the raw graph if filtering disconnects the graph
fix_orphans – restores a minimal incident edge for isolated nodes

These operate only on the displayed node set and use raw edges as candidate bridges.

4. CLI configuration

Added a configurable filtering threshold:

--z-threshold FLOAT

Default:

0.5

This default was chosen because stricter statistical cutoffs (e.g. 1.645) produced overly sparse graphs in this dataset. The threshold remains configurable for stricter filtering if needed.

Resulting Pipeline

raw avg-session edges
      ↓
z-score computation (existing step)
      ↓
z-score filtering (abs(z) >= threshold)
      ↓
build filtered graph
      ↓
connectivity repair (raw edges as candidates)
      ↓
orphan repair

This keeps the statistical filtering logic consistent with the rest of the pipeline while ensuring graphs remain structurally interpretable.

Verification

Tested using multiple thresholds:

Threshold	Result
0	No filtering (baseline graph)
0.5	Balanced sparsity and readability
1.0+	Graphs often became overly sparse

Connectivity repair ensures filtered graphs remain usable when filtering removes bridge edges.

Notes

Cluster graphs may still appear disconnected if the aggregated cluster edge data itself contains multiple components. In those cases the warning reflects the underlying data rather than a rendering error.

…core

…move wrappers

Resolved conflict in scripts/app.py: - HEAD had accidentally placed REPO_OWNER/REPO_NAMES configuration block inside the run_batch_extraction function docstring. - Kept the correct docstring from Refactor/vertical-graph-layout. - Configuration block remains only in __main__ where it belongs.

Implement adaptive 3-pass edge pruning for Markov graph visualisation. ## Changes ### process_model/graphing.py **New: prune_edges_by_zscore(G, z_min, top_k)** (Pass 1) - Groups outgoing edges by source node and computes per-node mean/sigma. - Keeps edge (u->v) if z >= z_min OR edge is in the top-K highest-weight outgoing edges for u (deterministic tie-break: desc weight, asc target name). - Edge cases: sigma==0 -> top-K only; < min_out_edges_to_zscore -> keep all. **New: _DSU** - Disjoint Set Union (Union-Find) with path compression and union-by-rank for O(alpha) component tracking. **New: prune_edges_connectivity_preserving(G, z_min, top_k)** (Pass 1+2+3) - Pass 1: delegates to prune_edges_by_zscore. - Pass 2 (connectivity repair): builds a DSU over all nodes; greedily adds back the minimum number of highest-weight original edges (bridge candidates) needed to reduce the pruned graph to a single weakly-connected component. Uses (u,v) string tie-breaking for determinism. - Pass 3 (orphan guardrail): any node with degree-0 in the kept set gets its strongest incident edge (in or out) restored from the original graph. - Prints a structured diagnostic line per graph: [PRUNE] Label: N edges -> P1:X -> Conn:Y (+B bridges) -> Final:Z (+O orphan fixes) | components: C1->C2 | z_min=.. top_k=.. **�uild_markov_graph**: wired to call prune_edges_connectivity_preserving by default (preserve_connectivity=True); falls back to raw z-score with --no-preserve-connectivity flag. **New CLI flags (main)**: --z-min FLOAT z-score threshold (default: 1.0) --top-k INT always keep top-K edges per node (default: 1) --no-preserve-connectivity disable Pass 2+3; raw z-score pruning only ### test/process_model/test_graphing.py Added TestPruneEdgesByZScore (11 tests) and TestPruneEdgesConnectivityPreserving (4 tests) covering: - sigma==0 with top-K=1 and top-K=2 - single outgoing edge always kept - standout edge + top-K fallback - tie-breaking determinism - 2-component repair with best bridge selection - orphan node restoration via Pass 3 - deterministic bridge tie-breaking - minimality: exactly (K-1) bridges for K initial components All 19 tests pass.

# Conflicts: # event_labelling/CodeStructure_Branching/main.py # event_labelling/PR/get_clean_pr_label.py

…veness scoring

…by refactor

Mahatav

Looks good.

…om IDF pruning Team avg-session graphs now use filter_edges_by_zscore (from clustering.py) on the pre-computed zscores.csv instead of a custom runtime IDF scorer. Removed: - build_edge_idf_map, score_team_edges, prune_team_edges_distinctive - idf_map / n_teams params from build_markov_graph, render_team_graphs, render_cluster_graphs - --z-min, --score-min, --top-k CLI flags Simplified: - repair_connectivity / fix_orphans: weight-only sorting (no scores dict) - build_markov_graph: Pass 1 removed; Pass 2+3 on pre-filtered input - render_team_graphs: takes avg_zscored_df (pre-filtered by caller) - render_cluster_graphs: no idf params - main(): loads zscores.csv, applies filter_edges_by_zscore, new --z-threshold flag (default 1.645, matches clustering.Z_THRESHOLD) Tests updated: TestIdfPruning replaced with TestConnectivityRepair (5 tests) covering repair_connectivity, fix_orphans, and full pipeline determinism.

Refactor: Clean Scripts into Utility

AdaraPutri

hey Dhruv thanks for fixing the pruning methods. Pass 2 and Pass 3 sounds like an interesting and solid approach, especially for higher z-score thresholds where the avg-session graphs get really sparse and it’s easy to end up with disconnected components / floating nodes.

I ran the code across all 22 teams (PR labels) with the recommended setting of --z-threshold 0.5 and only one team (Team 11) got flagged as disjoint + “fixed” by Pass 2 (components: 2->1, +1 bridge). But visually the graph looked the same with vs without --no-preserve-connectivity:

Before:

After:

To double check, I added a debug statement right after the Pass 2/3 block in build_markov_graph() (after final_count = len(keep_set)) to compare what we’re rendering vs what the passes are actually keeping. For Team 11 it showed:

So it looks like Pass 2 is adding the bridge edge into keep_set, but it never gets added into G (and we render using for u, v, data in G.edges(data=True)), which means the “bridge” can’t actually appear in the PNG. That would explain why the logs show +1 bridge but the diagram doesn’t change.

I tried again with a higher threshold (--z-threshold 1.645) and it becomes more obvious: 17 teams were identified as disconnected / got +1 bridge, but again the graphs looked identical with vs without preservation, which I think is the same root issue (edges added to keep_set, but not present in G so they aren’t drawable).

I think this can be fixed pretty cleanly by syncing G with keep_set after Pass 2/3:

for any (u, v) in keep_set that isn’t already in G, add it into G using the weight from G_original
then recompute probabilities (at least for the affected source nodes, or just recompute all prob values once after the repair)

That way the preservation passes actually affect what gets rendered.

One more thought: I’m not totally sure we need Pass 3 (orphans) in its current form. In theory it’s meant to reattach isolated nodes, but since all_nodes is derived from set(G.nodes()) (i.e., nodes that survived the filtered edge list), nodes that lose all edges due to pruning usually won’t be present to “fix” in the first place. So it might be doing very little in practice unless we expand the node scope, or we can keep it as a guardrail but it may not fire often.

…ation

Feature/comm label

…, fix cluster aggregation

aliyahnurdafika

Great work on this PR! I gained a lot of new insights from these updates. I’ve left a few questions and comments to deepen my understanding. Good job!

aliyahnurdafika · 2026-03-17T17:05:05Z

+
 - Extracts and normalizes usernames from comment author fields
 - Removes malformed or missing author entries
 - Overwrites original CSV files (backup originals if you need raw data)


Nice documentation update! I have a question, would it help to mention exactly which script is responsible for that overwrite step?

aliyahnurdafika · 2026-03-17T17:17:13Z

+    if "pr_id" not in df.columns or "event" not in df.columns:
+        raise KeyError("communication labels CSV must include at least: pr_id, event")


Good work on this validation change! We only require the essential columns (pr_id, event) for the timestamp logic consistency.

aliyahnurdafika · 2026-03-17T17:21:57Z

                "pr_id": row.get("pr_id"),
                "timestamp": ts,
-                # keep EXACT same event cell as in original CSV (string form)
                "event": row.get("event"),


I have a question, in here we normalize the event cell for timestamp selection,

events = _parse_event_cell(row.get("event")) ts = _pick_timestamp(row, events)

but we still write the original raw event value:

"event": row.get("event"),

Is the clean output intentionally preserving the original event representation, or should it write the normalized event value instead? Thanks!

aliyahnurdafika · 2026-03-17T17:35:36Z

+        sigma == 0, so z is undefined.  Only the top-K edge (alphabetically
+        first target when weights tie) must be kept.


These explanations are really helpful! Not only explaining what is being tested, but also why that behavior is expected.

aliyahnurdafika · 2026-03-17T17:47:53Z

+            ("A1", "S2", 5),   # bridge 1
+            ("A2", "S3", 5),   # bridge 2
+        ])
+        kept = prune_edges_connectivity_preserving(G, z_min=1.0, top_k=1)


I think it would be helpful to briefly document why specific z_min values (e.g., 1.0, 999) are used in these tests. Maybe a brief comment at the top of the file could clarify the purpose of these values.

…uie/processAnalysis into feature/edge-zscore-emphasis

…e-emphasis

dhr3v added 2 commits January 26, 2026 00:06

Refactored Clean Scripts

f764b63

init

f9efedc

d2r3v self-assigned this Jan 27, 2026

d2r3v added the enhancement New feature or request label Jan 27, 2026

d2r3v marked this pull request as draft January 27, 2026 02:30

d2r3v linked an issue Jan 27, 2026 that may be closed by this pull request

Z-score-based edge emphasis #61

Open

d2r3v changed the title ~~WIP: edge zscore emphasis~~ Edge zscore emphasis Feb 2, 2026

Mahatav and others added 18 commits February 2, 2026 07:32

Add compute_elbow_scores function and save elbow scores to CSV

2f58fff

added helpers_comm.py

e996952

added prep_data.py

c91ac42

added llm_prompts.py

25c6fc2

added get_clean_comm_label.py

b6927f7

modified comm_label.py to use helper functions

36951c7

restored comm_label.py

89cde60

Refactor/Automatic-Clean

01c5077

Add checks for elbow score computation and CSV saving in main function

8be5381

Merge remote-tracking branch 'origin/dev' into feature/adding_elbow_s…

e10d4b1

…core

added temporary config for communication process model

bfa4bca

Refactor: Integrate cleaning logic into main labelling scripts and re…

85f1646

…move wrappers

Refactor: Moved Clean Script to Util

8f1b256

Refactor/ File Cleanup

112d503

Add elbow score plotting functionality and save plots in main function

6a720f1

merged with dev to insure evething is up to date

7ab182d

d2r3v changed the title ~~Edge zscore emphasis~~ feat: Connectivity-Preserving Z-Score Edge Pruning Feb 23, 2026

d2r3v marked this pull request as ready for review February 23, 2026 04:41

d2r3v requested review from AdaraPutri and Mahatav February 23, 2026 04:42

Merge remote-tracking branch 'origin/dev' into Refactor/Clean-Util

87a6eb0

# Conflicts: # event_labelling/CodeStructure_Branching/main.py # event_labelling/PR/get_clean_pr_label.py

d2r3v added 2 commits March 2, 2026 00:01

feat(graphing): replace z-score pruning with cross-team IDF distincti…

ab3e970

…veness scoring

chore(graphing): remove 226 lines of dead z-score function body left …

a769c77

…by refactor

d2r3v changed the title ~~feat: Connectivity-Preserving Z-Score Edge Pruning~~ Adaptive & Distinctiveness-Based Edge Pruning Pipeline Mar 2, 2026

Mahatav deleted the branch Refactor/vertical-graph-layout March 2, 2026 17:27

Mahatav closed this Mar 2, 2026

Mahatav reopened this Mar 2, 2026

Mahatav approved these changes Mar 2, 2026

View reviewed changes

d2r3v added 4 commits March 9, 2026 00:39

fix(graphing): draw bridge/orphan candidates from raw unfiltered graph

6b9b4ce

fix(graphing): scope bridge/orphan repair to displayed nodes only

f064cbb

feat(graphing): change default z-threshold to 0.5 for better readability

f3c1d5e

d2r3v changed the title ~~Adaptive & Distinctiveness-Based Edge Pruning Pipeline~~ Align graph rendering with existing z-score pipeline Mar 9, 2026

Merge pull request #54 from bohuie/Refactor/Clean-Util

1d6e841

Refactor: Clean Scripts into Utility

AdaraPutri requested changes Mar 9, 2026

View reviewed changes

AdaraPutri and others added 8 commits March 9, 2026 09:06

added .venv to gitignore

6d80c00

updated toggle removal for graphing and transition edges

d695edb

Merge remote-tracking branch 'origin/dev' into feature/comm-label

50a90ca

updated clustering and zscore_calculation configs to include communic…

db95e92

…ation

Merge pull request #67 from bohuie/feature/comm-label

fac69ff

Feature/comm label

Fix graph filtering bridge edge rendering and constraints

6681a6c

Replace --z-min/--top-k with --z-threshold (default 0.5)

d3c0c00

Merge origin/dev: add communication dataset, process_dataset refactor…

83c4d19

…, fix cluster aggregation

d2r3v force-pushed the feature/edge-zscore-emphasis branch from f3c1d5e to 83c4d19 Compare March 16, 2026 07:36

d2r3v requested a review from AdaraPutri March 16, 2026 07:37

aliyahnurdafika reviewed Mar 21, 2026

View reviewed changes

d2r3v added 2 commits March 23, 2026 00:03

Merge branch 'feature/edge-zscore-emphasis' of https://github.com/boh…

61fb446

…uie/processAnalysis into feature/edge-zscore-emphasis

Fixed Bugs

925678d

d2r3v force-pushed the feature/edge-zscore-emphasis branch from 2fbd799 to 925678d Compare March 23, 2026 07:14

d2r3v and others added 2 commits March 23, 2026 00:17

Merge branch 'Refactor/vertical-graph-layout' into feature/edge-zscor…

ec5aff0

…e-emphasis

Fixed Errors

ac73332

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align graph rendering with existing z-score pipeline#63

Align graph rendering with existing z-score pipeline#63
d2r3v wants to merge 41 commits intoRefactor/vertical-graph-layoutfrom
feature/edge-zscore-emphasis

d2r3v commented Jan 27, 2026 •

edited

Loading

Uh oh!

Mahatav left a comment

Uh oh!

AdaraPutri left a comment

Uh oh!

aliyahnurdafika left a comment

Uh oh!

aliyahnurdafika Mar 17, 2026

Uh oh!

aliyahnurdafika Mar 17, 2026

Uh oh!

aliyahnurdafika Mar 17, 2026

Uh oh!

aliyahnurdafika Mar 17, 2026

Uh oh!

aliyahnurdafika Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if "pr_id" not in df.columns or "event" not in df.columns:
		raise KeyError("communication labels CSV must include at least: pr_id, event")

		sigma == 0, so z is undefined. Only the top-K edge (alphabetically
		first target when weights tie) must be kept.

Conversation

d2r3v commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Reuse existing z-score pipeline

2. Remove custom IDF pruning

3. Structural repair retained

4. CLI configuration

Resulting Pipeline

Verification

Notes

Uh oh!

Mahatav left a comment

Choose a reason for hiding this comment

Uh oh!

AdaraPutri left a comment

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika left a comment

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

aliyahnurdafika Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

d2r3v commented Jan 27, 2026 •

edited

Loading