Skip to content

Unified mixs-legacy-diff tool: v4+ schema comparison with CI tests#1115

Open
turbomam wants to merge 25 commits intomainfrom
1114-legacy-diff-tool
Open

Unified mixs-legacy-diff tool: v4+ schema comparison with CI tests#1115
turbomam wants to merge 25 commits intomainfrom
1114-legacy-diff-tool

Conversation

@turbomam
Copy link
Copy Markdown
Member

@turbomam turbomam commented Feb 3, 2026

Summary

  • Unified mixs-legacy-diff CLI replaces both the old diff-releases script (2408 lines, LinkML-only) and the original format-agnostic diff tool into a single entry point
  • Supports v4 Excel (.xls), v5 Excel (.xlsx), and v6+ LinkML YAML — any combination
  • Scoped to v4+ only; pre-v4 formats (2009–2011 Word/Excel) removed as unstable — cataloged in assets/terminology_by_version.yaml for future reimplementation
  • Mapping hints (expected renames, splits, merges, deletions) unified into YAML format grouped by entity type

What changed

Area Before After
Diff tools Two separate tools with incompatible mapping formats One unified mixs-legacy-diff CLI
Mapping hints 5 TSV files (flat) Single mapping_config.yaml (grouped by slots/classes/enums/subsets)
Format support v4, v5, v6+, plus unstable pre-v4 v4, v5, v6+ (stable only)
CI tests None 30 tests on every PR touching diff code
Release workflow Hardcoded path that 404'd on pre-v6.2 tags Full GitHub spec inputs with correct paths
Test data Required local mixs-legacy checkout Auto-downloads from public GitHub repo

Release workflow change

The create-release-pr.yaml workflow inputs changed from bare refs to full GitHub specs:

Before (broken for pre-v6.2 tags):

diff_old: mixs6.0.0        # → constructs src/mixs/schema/mixs.yaml → 404!
diff_new: main

After (explicit paths, works for any tag):

diff_old: GenomicsStandardsConsortium/[email protected]:src/mixs/schema/mixs.yaml
diff_new: GenomicsStandardsConsortium/mixs@main:src/mixs/schema/mixs.yaml

To compare against mixs6.0.0, use: GenomicsStandardsConsortium/[email protected]:model/schema/mixs.yaml

The key insight: the schema path changed from model/schema/mixs.yaml (tags mixs6.0.0 through mixs6.1.1) to src/mixs/schema/mixs.yaml (tags v6.2.0 onward). The old workflow hardcoded the newer path, so comparing against pre-v6.2 tags would silently fail.

Usage

# Install dependencies
poetry install --with legacy-diff

# v4 → v5
mixs-legacy-diff --old ../mixs-legacy/mixs4/MIxS_210514.xls \
  --new ../mixs-legacy/mixs5/mixs_v5.xlsx

# v5 → current LinkML
mixs-legacy-diff --old ../mixs-legacy/mixs5/mixs_v5.xlsx \
  --new src/mixs/schema/mixs.yaml

# Tagged release → tagged release (fetches from GitHub)
mixs-legacy-diff \
  --old "GenomicsStandardsConsortium/[email protected]:src/mixs/schema/mixs.yaml" \
  --new "GenomicsStandardsConsortium/[email protected]:src/mixs/schema/mixs.yaml"

# With mapping hints for known renames/deletions
mixs-legacy-diff --old ... --new ... \
  --mappings-dir assets/between_diff_mappings/6_to_pre_7

Test plan

  • 30 automated tests (v4→v5, v5→v6, v6→v6, format detection, models)
  • Interactive: v4 xls → v5 xlsx (343 → 600 terms, 328 shared)
  • Interactive: v5 xlsx → v6 LinkML (600 → 1059 terms, 560 shared)
  • Interactive: v4 xls → v6 LinkML (343 → 1059 terms, 306 shared)
  • Interactive: GitHub tagged v6.2.0 → v6.2.3
  • Interactive: GitHub mixs6.0.0 → main with mapping config (804 → 1099 terms)
  • Release workflow simulation with new inputs

Closes #1114

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings February 3, 2026 16:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 3, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://GenomicsStandardsConsortium.github.io/mixs/pr-preview/pr-1115/

Built to branch gh-pages at 2026-03-10 23:02 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new mixs-legacy-diff CLI tool for comparing MIxS schemas across different historical formats and versions. The tool enables comparisons between Excel (.xls, .xlsx), Word (.docx), XSD (.xsd), and LinkML YAML formats through a configuration-driven approach using YAML profiles.

Changes:

  • Adds new src/mixs/diff/ module with readers for Excel, Word, XSD, and LinkML formats
  • Implements configuration-based Excel parsing using YAML profiles in assets/legacy_format_profiles/
  • Provides normalized schema representation through NormalizedTerm and NormalizedSchema classes for cross-format comparison
  • Adds optional dependency group legacy-diff containing openpyxl, xlrd, and python-docx

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
pyproject.toml Adds optional legacy-diff dependency group and CLI entry point
poetry.lock Updates lock file with new dependencies for legacy format support
src/mixs/diff/__init__.py Module initialization with clear warnings about internal-only data models
src/mixs/diff/models.py Defines normalized data structures for cross-format schema comparison
src/mixs/diff/cli.py CLI entry point with argument validation and error handling
src/mixs/diff/comparison.py Schema comparison logic with support for term name mappings
src/mixs/diff/output.py YAML and text output formatting for comparison results
src/mixs/diff/readers/base.py Abstract base reader with format detection
src/mixs/diff/readers/excel_reader.py Profile-driven Excel (.xls, .xlsx) reader
src/mixs/diff/readers/linkml_reader.py LinkML YAML reader with GitHub spec support
src/mixs/diff/readers/word_reader.py Word document (.docx) table parser
src/mixs/diff/readers/xsd_reader.py XML Schema (.xsd) parser
assets/legacy_format_profiles/*.yaml Configuration files for Excel format variations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@turbomam turbomam force-pushed the 1114-legacy-diff-tool branch from d6ce56c to adbaf5d Compare March 10, 2026 19:28
@turbomam turbomam changed the title Add mixs-legacy-diff CLI tool for comparing historical schema formats Unified mixs-legacy-diff tool: v4+ schema comparison with CI tests Mar 10, 2026
turbomam and others added 24 commits March 10, 2026 16:59
Implements a new CLI tool that enables comparison of MIxS schemas across
any combination of formats and versions (Excel, Word, XSD, LinkML YAML).

Features:
- Configuration-driven Excel parsing via YAML profiles in
  assets/legacy_format_profiles/ (no hardcoded version-specific logic)
- Readers for .xls, .xlsx, .docx, .xsd, and LinkML YAML
- Support for GitHub specs (owner/repo@commit:path)
- Name mapping support for tracking term renames across versions
- YAML and summary text output formats

New files:
- src/mixs/diff/ - Python module with readers and comparison logic
- assets/legacy_format_profiles/ - YAML configs for Excel format variations

Dependencies added as optional group (poetry install --with legacy-diff):
- openpyxl (xlsx), xlrd <2.0 (xls), python-docx (docx)

Closes #1114

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Includes:
- version_registry.yaml: machine-readable catalog of all known versions (2005-2026)
- mixs_version_timeline.md: human-readable timeline with methodology notes
- README.md: brief explanation of contents

Covers versions from earliest MIGS draft (2005) through v6.2.3, with dates
derived from GitHub releases API, xlsx/docx metadata, and publication DOIs.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Fix linkml_reader to download imported schema files for multi-file
  schemas like v6.0.0 and v6.1.x (recursively fetches imports to temp dir)
- Add v5 to v6.0.0 comparison results (94 vs 804 terms, 85 shared)
- Add v5 to v6.2.0 comparison results (94 vs 1059 terms, 82 shared)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Keep only v5_to_v6.0.0 for the initial v5 to v6 transition.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
v4 (55 terms) → v5 (94 terms): 47 shared, 8 removed, 47 added

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Notes are included in output metadata to document which files were
used and which were not (e.g., package-specific extracts).

Updated v4_to_v5 diff with notes documenting unused files.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
v4→v5 mappings:
- env_biome → env_broad_scale
- env_feature → env_local_scale
- env_material → env_medium
- annot_source → annot

v5→v6 mappings:
- health_disease_stat → host_disease_stat
- 16s_recover → x_16s_recover
- 16s_recover_software → x_16s_recover_software

Regenerated diff reports with mappings applied.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MappingConfig dataclass supporting renames, splits, and deletions
- mapping_config.yaml format for documenting term changes with reasons
- Update comparator to account for splits (1:N) and intentional deletions
- Update summary output to display splits and deletion reasons
- Fix v6.0.0 class detection (case-insensitive matching)
- Add votu→otu mappings for v5→v6

v4→v5: All 8 "removed" terms now explained (4 renames, 2 splits, 2 deletions)
v5→v6: All 9 "removed" terms now explained (6 renames, 3 deletions)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Fix package name normalization to handle hyphens and slashes
- Fix checklist name mapping for v6.0.0 names (e.g., "MIGS eukaryote" → "migs_eu")
- Allow mixin classes that match checklist patterns (v6.0.0 uses mixins)
- Consistent snake_case output for all package/checklist names

Package comparison: v5→v6.0.0 now shows 16 shared (was 4)
Checklist comparison: v5→v6.0.0 now shows 10 shared (was 0)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add "Structured comment name" to name column mappings
- Add "Checklist item" to item column mappings
- Add v2.x sheet names (migs_mims_miens, MIMARKS, etc.)

v2.1 files now correctly load 49 terms for comparison with v4.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Create v2.1_to_v4 diff report documenting schema evolution
- Add mapping config explaining term renames (country→geo_loc_name,
  source_mat_identifiers→source_mat_id, submit_to_insdc→submitted_to_insdc)
- Document environment→env_biome/env_feature/env_material split
- All 49 v2.1 terms now accounted for (0 unexplained removals)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add excel_2010.yaml profile for v2.0 files (headers at row 16)
- Update excel_2010_2011.yaml to only match /2011/ directory
- Add header_row support to Excel reader
- Add term merge support (N:1 mapping, opposite of splits)
- Create v2.0_to_v2.1 diff documenting:
  - 3 renames (geo_loc_name→country, etc.)
  - 1 merge (biome+feature+material→environment)
  - 4 temporary removals (re-added in v4)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add --old-additional and --new-additional CLI options for merging multiple
  files into a single schema representation
- Add NormalizedSchema.merge() method for combining schemas
- Update v2.0 to v2.1 diff to use pooled files:
  - v2.0: MIGS_MIMS + MIENS (55 terms)
  - v2.1: Migs_mims_miens + MIMARKS (57 terms)
- Update mapping config to reflect that MIMARKS kept old naming conventions
  (no renames/merges needed for pooled comparison)
- Only 2 terms genuinely removed (annot_source, assembly_name)
- 4 terms added by v2.1 core reorganization (country, environment, etc.)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update v2.1_to_v4 to use pooled v2.1 (Migs_mims_miens + MIMARKS)
- Add deletions for MIMARKS biome/feature/material terms (superseded by env_*)
- Now correctly tracks 57 v2.1 terms → 55 v4 terms
- Only 2 genuinely new terms in v4 (annot_source, assembly_name re-added)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Track term membership changes in checklists and packages
- Detect requirement level changes (mandatory <-> optional)
- Add package composition comparison (terms added/removed per package)
- Add --include-membership CLI flag (default: enabled)
- Create terminology_by_version.yaml glossary
- Regenerate all diff reports with membership data

New output sections:
- MEMBERSHIP CHANGES: tracks terms added/removed from groups
- PACKAGE COMPOSITION CHANGES: tracks term list changes per package
- YAML membership_differences and package_composition_differences

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Major changes:
- Add cache-based Word reader (removes API dependency)
- Document XSD as parallel branch, not part of main lineage
- Create mapping configs for all early transitions
- Add diff results for entire pre-Excel era

Mapping configs added:
- v1.1_to_v1.2: 1 rename, 8 deletions documented
- v1.2_to_may2009: 5 renames, 1 split, 2 deletions
- may2009_to_nov2009: rename tracking
- nov2009_to_oct2010: verbose→abbreviated naming

All diffs now have zero unexplained term removals.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
New features:
- Term evolution report (term_evolution.yaml) tracking all terms
  across versions with events (added, renamed, split, removed)
- Package composition now shows specific terms added/removed
- Definition changes show inline diffs with [-removed-]/[+added+]
- Terminology glossary updated with early version documentation

Files added:
- src/mixs/diff/evolution.py - term evolution tracking module
- assets/diff_results/term_evolution.yaml - 845 terms, 911 events

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Let the diff engine discover and report unexplained removals
rather than pre-documenting them with generic reasons like
"Removed in vX". This helps users see what terms were actually
lost between versions.

Changes:
- v5→v6: Keep only 3 structural deletions, remove 25+ generic ones
- v4→v5: Remove generic deletions, add alt_elev split and finishing_strategy merge
- Regenerate all affected diff results

Now shows:
- v4→v5: 11 unexplained removals (alt, elev, finishing_strategy, etc.)
- v5→v6: 28 unexplained removals (host_blood_press_diast, etc.)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The packages/environmental_packages sheets contain 500+ additional
terms that were not being loaded because the reader only updated
membership for existing terms. Now creates new NormalizedTerm
objects for terms found in packages sheets.

This fix increased term counts:
- v5: 94 → 600 terms
- v4: 55 → 343 terms

Extracts all available fields from package sheet rows:
- item, definition, expected_value, value_syntax
- example, section, preferred_unit, occurrence, mixs_id

Co-Authored-By: Claude Opus 4.5 <[email protected]>
deptry flagged all `from mixs import ...` in src/mixs/diff/ as
transitive dependency violations (DEP003) because it didn't
recognize `mixs` as the current project's own package. Adding
`known_first_party = ["mixs"]` to [tool.deptry] resolves all 23
false positives.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Move `import re` to module level in xsd_reader.py and excel_reader.py
- Remove unused typing imports across 5 files
- Wrap openpyxl workbook in try-finally for proper cleanup on error
- Use path_lower consistently in version detection checks

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The pre-v4 format readers (Word .docx, XSD, 2009-2011 Excel) had parsing
issues with headerless formats and inflated term counts. Rather than ship
broken code, scope the tool to the stable v4→v5→v6+ comparison chain and
catalog older formats in version_registry.yaml for historical reference.

Removed:
- Word reader, XSD reader, pre-extracted word_extracted_terms.json
- 5 pre-v4 Excel format profiles (2009, 2009_early, 2010, 2010_miens, 2010_2011)
- 8 pre-v4 diff result directories and mapping configs
- Cross-version evolution tracker (evolution.py)
- python-docx dependency

Added:
- 30 tests covering all three comparison paths (v4→v5, v5→v6, v6→v6)
  plus reader unit tests, format detection, and model tests

Retained:
- version_registry.yaml (full historical catalog from 2005 to present)
- terminology_by_version.yaml (naming convention reference)
- v4, v5, and default Excel format profiles
- 4_to_5, 5_to_6, and 6_to_pre_7 mapping configs and diff results
- Delete diff-releases script (2408 lines) — replaced by mixs-legacy-diff
- Convert 5 TSV mapping files to unified mapping_config.yaml (grouped by entity type)
- Update comparison.py to handle grouped renames/deletions format
- Fix release workflow: inputs now accept full GitHub specs with paths,
  fixing 404 when comparing pre-v6.2 tags (model/schema/ vs src/mixs/schema/)
- Add test-legacy-diff.yaml CI workflow that runs all 30 tests on PRs
- Add tests/conftest.py to auto-download Excel files from public mixs-legacy repo
- Remove unnecessary skip gates on LinkML-only tests

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add src/mixs/diff/README.md with tag/path reference table, usage
  examples, mapping hint format, and release workflow integration docs
- Delete contrib/docs/SCHEMA_DIFFING.md (referenced deleted script and
  personal Mac paths)
- Delete contrib/docs/diff_two_linkml_mixs_releases_analysis.md
  (described the 2700-line script that no longer exists)
- Delete assets/diff_results/releases.yaml and
  schema_comparison_results.yaml (outputs from old diff-releases tool)
- Update release/README.md: replace "working to develop robust methods"
  with pointer to mixs-legacy-diff tool
- Improve workflow input descriptions with path guidance and doc link

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@turbomam turbomam force-pushed the 1114-legacy-diff-tool branch from f5b8254 to 1ca3de5 Compare March 10, 2026 21:01
@turbomam
Copy link
Copy Markdown
Member Author

Rebase & cross-reference note (2026-03-10)

This PR was rebased onto current main along with #1090, #1110, and #1130.

Merge order: this PR should merge FIRST. It changes the release workflow parameters (unified mixs-legacy-diff tool) and replaces the diff result files that both #1110 and #1090 reference. Specifically:

The test-legacy-diff CI workflow is now running for the first time on this branch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 45 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +468 to +471
# Update term's package membership
req_idx = col_map.get("requirement")
requirement = str(row[req_idx]).strip() if req_idx and row[req_idx] else ""
schema.terms[term_name].package_membership[pkg_name_normalized] = requirement
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue in _process_packages_sheet_xls: if req_idx and row[req_idx] will skip requirement values when the requirement column is at index 0. Prefer req_idx is not None (and bounds check) to avoid dropping data.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid bug. On line 395 (xlsx) and line 470 (xls), if req_idx and row[req_idx] will incorrectly evaluate to falsy when req_idx == 0 (i.e., when the requirement column is the first column). Should be if req_idx is not None and row[req_idx]. Will fix.

Comment on lines +195 to +203
# Get readers
old_fmt = detect_format(old) if old_format == 'auto' else old_format
new_fmt = detect_format(new) if new_format == 'auto' else new_format

logger.info(f"Detected formats - Old: {old_fmt}, New: {new_fmt}")

old_reader = get_reader(old, profiles_dir=profile_dir)
new_reader = get_reader(new, profiles_dir=profile_dir)

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --old-format / --new-format options are computed (old_fmt/new_fmt) but not actually used to choose the reader: get_reader(old, ...) always re-detects from the path. As a result, forcing formats has no effect beyond logging. Wire the forced formats into reader selection (e.g., pass a format_override into get_reader() or select ExcelReader/LinkMLReader directly based on the option).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation. The --old-format and --new-format options compute old_fmt/new_fmt (lines 196-197) but then get_reader() at lines 201-202 uses the raw path for auto-detection, ignoring the format override. The format variables are only used for the log message at line 199. Will fix — get_reader() should accept a format override parameter, or we should select the reader based on the resolved format.

Comment on lines +7 to +9
import os
import urllib.request
from pathlib import Path
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import: os is imported but not used in this test fixture module. Remove it to avoid lint noise and keep dependencies clear.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — import os on line 7 of conftest.py is unused. Will remove.

Comment on lines +6 to +12
import logging
import os
import re
import shutil
import tempfile
from pathlib import Path
from typing import Optional
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import: os is imported but not referenced anywhere in this module. Removing it avoids lint noise and makes the module's dependencies clearer.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — import os on line 7 of linkml_reader.py is unused (was needed by the old temp-file approach that used os.unlink(), which was refactored to use shutil.rmtree() instead). Will remove.

paths:
- "src/mixs/diff/**"
- "tests/test_legacy_diff.py"
- "tests/conftest.py"
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the workflow triggers, pull_request.paths includes assets/legacy_format_profiles/** and assets/between_diff_mappings/**, but the push trigger only watches src/mixs/diff/** and tests. That means pushes to main that only change mapping/profile assets won't run this job. Consider mirroring the asset paths under push.paths as well for consistent coverage.

Suggested change
- "tests/conftest.py"
- "tests/conftest.py"
- "assets/legacy_format_profiles/**"
- "assets/between_diff_mappings/**"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — the push trigger (lines 12-17) is missing the asset paths that are present in the pull_request trigger (assets/legacy_format_profiles/** and assets/between_diff_mappings/**). Changes to format profiles or mapping configs pushed directly to main would not trigger the workflow. Will fix.

Comment on lines +393 to +396
# Update term's package membership
req_idx = col_map.get("requirement")
requirement = str(row[req_idx]).strip() if req_idx and row[req_idx] else ""
schema.terms[term_name].package_membership[pkg_name_normalized] = requirement
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _process_packages_sheet_xlsx, requirement = ... if req_idx and row[req_idx] else "" treats column index 0 as falsy, so a valid requirement column at index 0 would be ignored. Use an explicit req_idx is not None check (and bounds check) instead of truthiness.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same bug as comment #12if req_idx and row[req_idx] on line 395 (xlsx path) has the same falsy-zero problem. Will fix both together using req_idx is not None.

Comment on lines +640 to +648
# Load merges with descriptions (N:1 [old1, old2] -> new)
merges_data = data.get('merges', {}) or {}
for new_name, merge_info in merges_data.items():
if isinstance(merge_info, list):
config.merges[new_name] = merge_info
elif isinstance(merge_info, dict):
config.merges[new_name] = merge_info.get('sources', merge_info.get('old_names', []))
if 'description' in merge_info:
config.merge_descriptions[new_name] = merge_info['description']
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_mapping_config() doesn't parse the merge structure used in assets/between_diff_mappings/4_to_5/mapping_config.yaml (merges: <old_name>: {target: <new_name>, ...}). The current code expects merges to be keyed by the new name with a list/dict of source names, so merges from that config will silently become empty. Update the parser to accept the {target: ...} form (or change the mapping YAML structure to match the parser) so merges are actually applied.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the current code in load_mapping_config() (comparison.py lines 596-671), it does handle merges — see lines 641-648 which parse the merges key from the YAML config, supporting both list and dict formats with optional descriptions. The merge structure is parsed into config.merges (Dict[str, List[str]]) and config.merge_descriptions. If the comment is about a specific merge YAML structure that isn't being parsed correctly, could you share an example of the config that fails?

Comment on lines +164 to +172
def setUpClass(cls):
cls.v4_file, cls.v5_file = ensure_legacy_files()
cls.output_dir = Path("/tmp/test_legacy_diff_v4_to_v5")
if cls.output_dir.exists():
shutil.rmtree(cls.output_dir)
cls.result, cls.parsed, cls.summary_path = _run_comparison(
cls.v4_file, cls.v5_file, cls.output_dir,
mappings_dir=MAPPINGS_DIR / "4_to_5",
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests write to hard-coded /tmp/test_legacy_diff_* directories. This can cause cross-run conflicts (parallel test runs) and makes the suite less portable (non-POSIX environments). Prefer tempfile.TemporaryDirectory() (or pytest's tmp_path) to allocate unique per-run output dirs.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. The tests use hardcoded Path("/tmp/test_legacy_diff_*") paths (lines 166, 224, 271). Since these are unittest.TestCase classes (not pytest functions), the tmp_path fixture isn't available directly. However, the tests do clean up via tearDownClass with shutil.rmtree(). The pragmatic fix would be to switch to tempfile.mkdtemp() for cross-platform safety — /tmp doesn't exist on Windows. Will address.

Comment on lines +15 to +20
# Raw GitHub URLs for the two Excel files we need
_GITHUB_RAW = "https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs-legacy/master"
_EXCEL_FILES = {
"mixs4/MIxS_210514.xls": f"{_GITHUB_RAW}/mixs4/MIxS_210514.xls",
"mixs5/mixs_v5.xlsx": f"{_GITHUB_RAW}/mixs5/mixs_v5.xlsx",
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test data download URLs are pinned to the mixs-legacy repo's master branch. This makes tests non-deterministic if the upstream files change or if the default branch is renamed. Consider pinning to an immutable ref (tag/commit SHA) or mirroring the test fixtures into this repo to keep CI stable.

Copilot uses AI. Check for mistakes.
- Fix falsy-zero bug: use `req_idx is not None` instead of `req_idx` in
  both _process_packages_sheet_xlsx and _process_packages_sheet_xls
- Wire --old-format/--new-format CLI options through to get_reader() via
  new format_override parameter so format overrides actually take effect
- Remove unused `import os` from conftest.py and linkml_reader.py
- Add missing asset paths to push trigger in test-legacy-diff.yaml
- Replace hardcoded /tmp paths with tempfile.mkdtemp() in tests

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add mixs-legacy-diff CLI tool for comparing historical MIxS schema formats

2 participants