Unified mixs-legacy-diff tool: v4+ schema comparison with CI tests#1115
Unified mixs-legacy-diff tool: v4+ schema comparison with CI tests#1115
Conversation
|
There was a problem hiding this comment.
Pull request overview
This pull request adds a new mixs-legacy-diff CLI tool for comparing MIxS schemas across different historical formats and versions. The tool enables comparisons between Excel (.xls, .xlsx), Word (.docx), XSD (.xsd), and LinkML YAML formats through a configuration-driven approach using YAML profiles.
Changes:
- Adds new
src/mixs/diff/module with readers for Excel, Word, XSD, and LinkML formats - Implements configuration-based Excel parsing using YAML profiles in
assets/legacy_format_profiles/ - Provides normalized schema representation through
NormalizedTermandNormalizedSchemaclasses for cross-format comparison - Adds optional dependency group
legacy-diffcontaining openpyxl, xlrd, and python-docx
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Adds optional legacy-diff dependency group and CLI entry point |
poetry.lock |
Updates lock file with new dependencies for legacy format support |
src/mixs/diff/__init__.py |
Module initialization with clear warnings about internal-only data models |
src/mixs/diff/models.py |
Defines normalized data structures for cross-format schema comparison |
src/mixs/diff/cli.py |
CLI entry point with argument validation and error handling |
src/mixs/diff/comparison.py |
Schema comparison logic with support for term name mappings |
src/mixs/diff/output.py |
YAML and text output formatting for comparison results |
src/mixs/diff/readers/base.py |
Abstract base reader with format detection |
src/mixs/diff/readers/excel_reader.py |
Profile-driven Excel (.xls, .xlsx) reader |
src/mixs/diff/readers/linkml_reader.py |
LinkML YAML reader with GitHub spec support |
src/mixs/diff/readers/word_reader.py |
Word document (.docx) table parser |
src/mixs/diff/readers/xsd_reader.py |
XML Schema (.xsd) parser |
assets/legacy_format_profiles/*.yaml |
Configuration files for Excel format variations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d6ce56c to
adbaf5d
Compare
Implements a new CLI tool that enables comparison of MIxS schemas across any combination of formats and versions (Excel, Word, XSD, LinkML YAML). Features: - Configuration-driven Excel parsing via YAML profiles in assets/legacy_format_profiles/ (no hardcoded version-specific logic) - Readers for .xls, .xlsx, .docx, .xsd, and LinkML YAML - Support for GitHub specs (owner/repo@commit:path) - Name mapping support for tracking term renames across versions - YAML and summary text output formats New files: - src/mixs/diff/ - Python module with readers and comparison logic - assets/legacy_format_profiles/ - YAML configs for Excel format variations Dependencies added as optional group (poetry install --with legacy-diff): - openpyxl (xlsx), xlrd <2.0 (xls), python-docx (docx) Closes #1114 Co-Authored-By: Claude Opus 4.5 <[email protected]>
Includes: - version_registry.yaml: machine-readable catalog of all known versions (2005-2026) - mixs_version_timeline.md: human-readable timeline with methodology notes - README.md: brief explanation of contents Covers versions from earliest MIGS draft (2005) through v6.2.3, with dates derived from GitHub releases API, xlsx/docx metadata, and publication DOIs. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Fix linkml_reader to download imported schema files for multi-file schemas like v6.0.0 and v6.1.x (recursively fetches imports to temp dir) - Add v5 to v6.0.0 comparison results (94 vs 804 terms, 85 shared) - Add v5 to v6.2.0 comparison results (94 vs 1059 terms, 82 shared) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Keep only v5_to_v6.0.0 for the initial v5 to v6 transition. Co-Authored-By: Claude Opus 4.5 <[email protected]>
v4 (55 terms) → v5 (94 terms): 47 shared, 8 removed, 47 added Co-Authored-By: Claude Opus 4.5 <[email protected]>
Notes are included in output metadata to document which files were used and which were not (e.g., package-specific extracts). Updated v4_to_v5 diff with notes documenting unused files. Co-Authored-By: Claude Opus 4.5 <[email protected]>
v4→v5 mappings: - env_biome → env_broad_scale - env_feature → env_local_scale - env_material → env_medium - annot_source → annot v5→v6 mappings: - health_disease_stat → host_disease_stat - 16s_recover → x_16s_recover - 16s_recover_software → x_16s_recover_software Regenerated diff reports with mappings applied. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add MappingConfig dataclass supporting renames, splits, and deletions - mapping_config.yaml format for documenting term changes with reasons - Update comparator to account for splits (1:N) and intentional deletions - Update summary output to display splits and deletion reasons - Fix v6.0.0 class detection (case-insensitive matching) - Add votu→otu mappings for v5→v6 v4→v5: All 8 "removed" terms now explained (4 renames, 2 splits, 2 deletions) v5→v6: All 9 "removed" terms now explained (6 renames, 3 deletions) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Fix package name normalization to handle hyphens and slashes - Fix checklist name mapping for v6.0.0 names (e.g., "MIGS eukaryote" → "migs_eu") - Allow mixin classes that match checklist patterns (v6.0.0 uses mixins) - Consistent snake_case output for all package/checklist names Package comparison: v5→v6.0.0 now shows 16 shared (was 4) Checklist comparison: v5→v6.0.0 now shows 10 shared (was 0) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add "Structured comment name" to name column mappings - Add "Checklist item" to item column mappings - Add v2.x sheet names (migs_mims_miens, MIMARKS, etc.) v2.1 files now correctly load 49 terms for comparison with v4. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Create v2.1_to_v4 diff report documenting schema evolution - Add mapping config explaining term renames (country→geo_loc_name, source_mat_identifiers→source_mat_id, submit_to_insdc→submitted_to_insdc) - Document environment→env_biome/env_feature/env_material split - All 49 v2.1 terms now accounted for (0 unexplained removals) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add excel_2010.yaml profile for v2.0 files (headers at row 16) - Update excel_2010_2011.yaml to only match /2011/ directory - Add header_row support to Excel reader - Add term merge support (N:1 mapping, opposite of splits) - Create v2.0_to_v2.1 diff documenting: - 3 renames (geo_loc_name→country, etc.) - 1 merge (biome+feature+material→environment) - 4 temporary removals (re-added in v4) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add --old-additional and --new-additional CLI options for merging multiple files into a single schema representation - Add NormalizedSchema.merge() method for combining schemas - Update v2.0 to v2.1 diff to use pooled files: - v2.0: MIGS_MIMS + MIENS (55 terms) - v2.1: Migs_mims_miens + MIMARKS (57 terms) - Update mapping config to reflect that MIMARKS kept old naming conventions (no renames/merges needed for pooled comparison) - Only 2 terms genuinely removed (annot_source, assembly_name) - 4 terms added by v2.1 core reorganization (country, environment, etc.) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update v2.1_to_v4 to use pooled v2.1 (Migs_mims_miens + MIMARKS) - Add deletions for MIMARKS biome/feature/material terms (superseded by env_*) - Now correctly tracks 57 v2.1 terms → 55 v4 terms - Only 2 genuinely new terms in v4 (annot_source, assembly_name re-added) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Track term membership changes in checklists and packages - Detect requirement level changes (mandatory <-> optional) - Add package composition comparison (terms added/removed per package) - Add --include-membership CLI flag (default: enabled) - Create terminology_by_version.yaml glossary - Regenerate all diff reports with membership data New output sections: - MEMBERSHIP CHANGES: tracks terms added/removed from groups - PACKAGE COMPOSITION CHANGES: tracks term list changes per package - YAML membership_differences and package_composition_differences Co-Authored-By: Claude Opus 4.5 <[email protected]>
Major changes: - Add cache-based Word reader (removes API dependency) - Document XSD as parallel branch, not part of main lineage - Create mapping configs for all early transitions - Add diff results for entire pre-Excel era Mapping configs added: - v1.1_to_v1.2: 1 rename, 8 deletions documented - v1.2_to_may2009: 5 renames, 1 split, 2 deletions - may2009_to_nov2009: rename tracking - nov2009_to_oct2010: verbose→abbreviated naming All diffs now have zero unexplained term removals. Co-Authored-By: Claude Opus 4.5 <[email protected]>
New features: - Term evolution report (term_evolution.yaml) tracking all terms across versions with events (added, renamed, split, removed) - Package composition now shows specific terms added/removed - Definition changes show inline diffs with [-removed-]/[+added+] - Terminology glossary updated with early version documentation Files added: - src/mixs/diff/evolution.py - term evolution tracking module - assets/diff_results/term_evolution.yaml - 845 terms, 911 events Co-Authored-By: Claude Opus 4.5 <[email protected]>
Let the diff engine discover and report unexplained removals rather than pre-documenting them with generic reasons like "Removed in vX". This helps users see what terms were actually lost between versions. Changes: - v5→v6: Keep only 3 structural deletions, remove 25+ generic ones - v4→v5: Remove generic deletions, add alt_elev split and finishing_strategy merge - Regenerate all affected diff results Now shows: - v4→v5: 11 unexplained removals (alt, elev, finishing_strategy, etc.) - v5→v6: 28 unexplained removals (host_blood_press_diast, etc.) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The packages/environmental_packages sheets contain 500+ additional terms that were not being loaded because the reader only updated membership for existing terms. Now creates new NormalizedTerm objects for terms found in packages sheets. This fix increased term counts: - v5: 94 → 600 terms - v4: 55 → 343 terms Extracts all available fields from package sheet rows: - item, definition, expected_value, value_syntax - example, section, preferred_unit, occurrence, mixs_id Co-Authored-By: Claude Opus 4.5 <[email protected]>
deptry flagged all `from mixs import ...` in src/mixs/diff/ as transitive dependency violations (DEP003) because it didn't recognize `mixs` as the current project's own package. Adding `known_first_party = ["mixs"]` to [tool.deptry] resolves all 23 false positives. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Move `import re` to module level in xsd_reader.py and excel_reader.py - Remove unused typing imports across 5 files - Wrap openpyxl workbook in try-finally for proper cleanup on error - Use path_lower consistently in version detection checks Co-Authored-By: Claude Opus 4.6 <[email protected]>
The pre-v4 format readers (Word .docx, XSD, 2009-2011 Excel) had parsing issues with headerless formats and inflated term counts. Rather than ship broken code, scope the tool to the stable v4→v5→v6+ comparison chain and catalog older formats in version_registry.yaml for historical reference. Removed: - Word reader, XSD reader, pre-extracted word_extracted_terms.json - 5 pre-v4 Excel format profiles (2009, 2009_early, 2010, 2010_miens, 2010_2011) - 8 pre-v4 diff result directories and mapping configs - Cross-version evolution tracker (evolution.py) - python-docx dependency Added: - 30 tests covering all three comparison paths (v4→v5, v5→v6, v6→v6) plus reader unit tests, format detection, and model tests Retained: - version_registry.yaml (full historical catalog from 2005 to present) - terminology_by_version.yaml (naming convention reference) - v4, v5, and default Excel format profiles - 4_to_5, 5_to_6, and 6_to_pre_7 mapping configs and diff results
- Delete diff-releases script (2408 lines) — replaced by mixs-legacy-diff - Convert 5 TSV mapping files to unified mapping_config.yaml (grouped by entity type) - Update comparison.py to handle grouped renames/deletions format - Fix release workflow: inputs now accept full GitHub specs with paths, fixing 404 when comparing pre-v6.2 tags (model/schema/ vs src/mixs/schema/) - Add test-legacy-diff.yaml CI workflow that runs all 30 tests on PRs - Add tests/conftest.py to auto-download Excel files from public mixs-legacy repo - Remove unnecessary skip gates on LinkML-only tests Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add src/mixs/diff/README.md with tag/path reference table, usage examples, mapping hint format, and release workflow integration docs - Delete contrib/docs/SCHEMA_DIFFING.md (referenced deleted script and personal Mac paths) - Delete contrib/docs/diff_two_linkml_mixs_releases_analysis.md (described the 2700-line script that no longer exists) - Delete assets/diff_results/releases.yaml and schema_comparison_results.yaml (outputs from old diff-releases tool) - Update release/README.md: replace "working to develop robust methods" with pointer to mixs-legacy-diff tool - Improve workflow input descriptions with path guidance and doc link Co-Authored-By: Claude Opus 4.6 <[email protected]>
f5b8254 to
1ca3de5
Compare
|
Rebase & cross-reference note (2026-03-10) This PR was rebased onto current Merge order: this PR should merge FIRST. It changes the release workflow parameters (unified
The |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 39 out of 45 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Update term's package membership | ||
| req_idx = col_map.get("requirement") | ||
| requirement = str(row[req_idx]).strip() if req_idx and row[req_idx] else "" | ||
| schema.terms[term_name].package_membership[pkg_name_normalized] = requirement |
There was a problem hiding this comment.
Same issue in _process_packages_sheet_xls: if req_idx and row[req_idx] will skip requirement values when the requirement column is at index 0. Prefer req_idx is not None (and bounds check) to avoid dropping data.
There was a problem hiding this comment.
Valid bug. On line 395 (xlsx) and line 470 (xls), if req_idx and row[req_idx] will incorrectly evaluate to falsy when req_idx == 0 (i.e., when the requirement column is the first column). Should be if req_idx is not None and row[req_idx]. Will fix.
| # Get readers | ||
| old_fmt = detect_format(old) if old_format == 'auto' else old_format | ||
| new_fmt = detect_format(new) if new_format == 'auto' else new_format | ||
|
|
||
| logger.info(f"Detected formats - Old: {old_fmt}, New: {new_fmt}") | ||
|
|
||
| old_reader = get_reader(old, profiles_dir=profile_dir) | ||
| new_reader = get_reader(new, profiles_dir=profile_dir) | ||
|
|
There was a problem hiding this comment.
The --old-format / --new-format options are computed (old_fmt/new_fmt) but not actually used to choose the reader: get_reader(old, ...) always re-detects from the path. As a result, forcing formats has no effect beyond logging. Wire the forced formats into reader selection (e.g., pass a format_override into get_reader() or select ExcelReader/LinkMLReader directly based on the option).
There was a problem hiding this comment.
Valid observation. The --old-format and --new-format options compute old_fmt/new_fmt (lines 196-197) but then get_reader() at lines 201-202 uses the raw path for auto-detection, ignoring the format override. The format variables are only used for the log message at line 199. Will fix — get_reader() should accept a format override parameter, or we should select the reader based on the resolved format.
tests/conftest.py
Outdated
| import os | ||
| import urllib.request | ||
| from pathlib import Path |
There was a problem hiding this comment.
Unused import: os is imported but not used in this test fixture module. Remove it to avoid lint noise and keep dependencies clear.
There was a problem hiding this comment.
Valid — import os on line 7 of conftest.py is unused. Will remove.
| import logging | ||
| import os | ||
| import re | ||
| import shutil | ||
| import tempfile | ||
| from pathlib import Path | ||
| from typing import Optional |
There was a problem hiding this comment.
Unused import: os is imported but not referenced anywhere in this module. Removing it avoids lint noise and makes the module's dependencies clearer.
There was a problem hiding this comment.
Valid — import os on line 7 of linkml_reader.py is unused (was needed by the old temp-file approach that used os.unlink(), which was refactored to use shutil.rmtree() instead). Will remove.
| paths: | ||
| - "src/mixs/diff/**" | ||
| - "tests/test_legacy_diff.py" | ||
| - "tests/conftest.py" |
There was a problem hiding this comment.
In the workflow triggers, pull_request.paths includes assets/legacy_format_profiles/** and assets/between_diff_mappings/**, but the push trigger only watches src/mixs/diff/** and tests. That means pushes to main that only change mapping/profile assets won't run this job. Consider mirroring the asset paths under push.paths as well for consistent coverage.
| - "tests/conftest.py" | |
| - "tests/conftest.py" | |
| - "assets/legacy_format_profiles/**" | |
| - "assets/between_diff_mappings/**" |
There was a problem hiding this comment.
Valid — the push trigger (lines 12-17) is missing the asset paths that are present in the pull_request trigger (assets/legacy_format_profiles/** and assets/between_diff_mappings/**). Changes to format profiles or mapping configs pushed directly to main would not trigger the workflow. Will fix.
| # Update term's package membership | ||
| req_idx = col_map.get("requirement") | ||
| requirement = str(row[req_idx]).strip() if req_idx and row[req_idx] else "" | ||
| schema.terms[term_name].package_membership[pkg_name_normalized] = requirement |
There was a problem hiding this comment.
In _process_packages_sheet_xlsx, requirement = ... if req_idx and row[req_idx] else "" treats column index 0 as falsy, so a valid requirement column at index 0 would be ignored. Use an explicit req_idx is not None check (and bounds check) instead of truthiness.
There was a problem hiding this comment.
Same bug as comment #12 — if req_idx and row[req_idx] on line 395 (xlsx path) has the same falsy-zero problem. Will fix both together using req_idx is not None.
| # Load merges with descriptions (N:1 [old1, old2] -> new) | ||
| merges_data = data.get('merges', {}) or {} | ||
| for new_name, merge_info in merges_data.items(): | ||
| if isinstance(merge_info, list): | ||
| config.merges[new_name] = merge_info | ||
| elif isinstance(merge_info, dict): | ||
| config.merges[new_name] = merge_info.get('sources', merge_info.get('old_names', [])) | ||
| if 'description' in merge_info: | ||
| config.merge_descriptions[new_name] = merge_info['description'] |
There was a problem hiding this comment.
load_mapping_config() doesn't parse the merge structure used in assets/between_diff_mappings/4_to_5/mapping_config.yaml (merges: <old_name>: {target: <new_name>, ...}). The current code expects merges to be keyed by the new name with a list/dict of source names, so merges from that config will silently become empty. Update the parser to accept the {target: ...} form (or change the mapping YAML structure to match the parser) so merges are actually applied.
There was a problem hiding this comment.
Looking at the current code in load_mapping_config() (comparison.py lines 596-671), it does handle merges — see lines 641-648 which parse the merges key from the YAML config, supporting both list and dict formats with optional descriptions. The merge structure is parsed into config.merges (Dict[str, List[str]]) and config.merge_descriptions. If the comment is about a specific merge YAML structure that isn't being parsed correctly, could you share an example of the config that fails?
| def setUpClass(cls): | ||
| cls.v4_file, cls.v5_file = ensure_legacy_files() | ||
| cls.output_dir = Path("/tmp/test_legacy_diff_v4_to_v5") | ||
| if cls.output_dir.exists(): | ||
| shutil.rmtree(cls.output_dir) | ||
| cls.result, cls.parsed, cls.summary_path = _run_comparison( | ||
| cls.v4_file, cls.v5_file, cls.output_dir, | ||
| mappings_dir=MAPPINGS_DIR / "4_to_5", | ||
| ) |
There was a problem hiding this comment.
The tests write to hard-coded /tmp/test_legacy_diff_* directories. This can cause cross-run conflicts (parallel test runs) and makes the suite less portable (non-POSIX environments). Prefer tempfile.TemporaryDirectory() (or pytest's tmp_path) to allocate unique per-run output dirs.
There was a problem hiding this comment.
Valid concern. The tests use hardcoded Path("/tmp/test_legacy_diff_*") paths (lines 166, 224, 271). Since these are unittest.TestCase classes (not pytest functions), the tmp_path fixture isn't available directly. However, the tests do clean up via tearDownClass with shutil.rmtree(). The pragmatic fix would be to switch to tempfile.mkdtemp() for cross-platform safety — /tmp doesn't exist on Windows. Will address.
| # Raw GitHub URLs for the two Excel files we need | ||
| _GITHUB_RAW = "https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs-legacy/master" | ||
| _EXCEL_FILES = { | ||
| "mixs4/MIxS_210514.xls": f"{_GITHUB_RAW}/mixs4/MIxS_210514.xls", | ||
| "mixs5/mixs_v5.xlsx": f"{_GITHUB_RAW}/mixs5/mixs_v5.xlsx", | ||
| } |
There was a problem hiding this comment.
The test data download URLs are pinned to the mixs-legacy repo's master branch. This makes tests non-deterministic if the upstream files change or if the default branch is renamed. Consider pinning to an immutable ref (tag/commit SHA) or mirroring the test fixtures into this repo to keep CI stable.
- Fix falsy-zero bug: use `req_idx is not None` instead of `req_idx` in both _process_packages_sheet_xlsx and _process_packages_sheet_xls - Wire --old-format/--new-format CLI options through to get_reader() via new format_override parameter so format overrides actually take effect - Remove unused `import os` from conftest.py and linkml_reader.py - Add missing asset paths to push trigger in test-legacy-diff.yaml - Replace hardcoded /tmp paths with tempfile.mkdtemp() in tests Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
mixs-legacy-diffCLI replaces both the olddiff-releasesscript (2408 lines, LinkML-only) and the original format-agnostic diff tool into a single entry pointassets/terminology_by_version.yamlfor future reimplementationWhat changed
mixs-legacy-diffCLImapping_config.yaml(grouped by slots/classes/enums/subsets)mixs-legacycheckoutRelease workflow change
The
create-release-pr.yamlworkflow inputs changed from bare refs to full GitHub specs:Before (broken for pre-v6.2 tags):
After (explicit paths, works for any tag):
To compare against
mixs6.0.0, use:GenomicsStandardsConsortium/[email protected]:model/schema/mixs.yamlThe key insight: the schema path changed from
model/schema/mixs.yaml(tagsmixs6.0.0throughmixs6.1.1) tosrc/mixs/schema/mixs.yaml(tagsv6.2.0onward). The old workflow hardcoded the newer path, so comparing against pre-v6.2 tags would silently fail.Usage
Test plan
Closes #1114
🤖 Generated with Claude Code