Skip to content

Test MIxS build against linkml main + TSV list normalization demo#1130

Open
turbomam wants to merge 11 commits intomainfrom
1129-linkml-1.10-csv-tsv-demo
Open

Test MIxS build against linkml main + TSV list normalization demo#1130
turbomam wants to merge 11 commits intomainfrom
1129-linkml-1.10-csv-tsv-demo

Conversation

@turbomam
Copy link
Copy Markdown
Member

@turbomam turbomam commented Mar 3, 2026

Closes #1129

Summary

  • Upgrade linkml and linkml-runtime to the fix-3250-empty-cell-crash branch (includes PR #3134 configurable list formatting + PR #3251 empty-cell crash fix)
  • Add schema-aware TSV list normalization demo for mixed-format input files
  • Add bare-pipe TSV round-trip demo (--list-wrapper none): YAML → TSV → YAML works end-to-end
  • New Makefile targets: normalize-tsv-demo, normalize-tsv-roundtrip, tsv-bare-pipe-roundtrip

TODO: Switch to linkml main or PyPI release

Both linkml and linkml-runtime are pinned to the fix-3250-empty-cell-crash branch because the empty-cell crash fix (linkml#3251) hasn't merged to main yet. Once #3251 merges, update pyproject.toml to point at main (or a PyPI release if available). This is marked with TODO comments in pyproject.toml.

What this demonstrates

  1. normalize_tsv_lists.py — pre-processing step for mixed-format TSV files (some rows bracketed, some bare-pipe, some spaced). linkml-convert expects one wrapper style per file, so this normalizes to a single style first.
  2. --list-wrapper none round-trip — with the empty-cell fix, linkml-convert dumps and loads bare-pipe TSV correctly. No brackets needed.
  3. Validator gaplinkml-validate still can't parse multivalued TSV fields (linkml#3147).

Test plan

  • make clean all all-contrib test passes
  • make normalize-tsv-demo normalizes messy TSV
  • make normalize-tsv-roundtrip completes full bracket-format TSV → YAML → TSV cycle
  • make tsv-bare-pipe-roundtrip completes full bare-pipe TSV round-trip
  • Switch pyproject.toml to linkml main/PyPI once #3251 merges

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 3, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://GenomicsStandardsConsortium.github.io/mixs/pr-preview/pr-1130/

Built to branch gh-pages at 2026-03-10 22:59 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates MIxS’ development/build environment to exercise LinkML mainline changes while adding a schema-aware demo workflow for normalizing heterogeneous multivalued TSV formatting into a canonical form that LinkML tooling can reliably parse.

Changes:

  • Pin linkml / linkml-runtime to GitHub main and refresh the Poetry lockfile.
  • Add a TSV list-normalization demo (script + README + sample input/output artifacts).
  • Add Makefile demo targets for normalization and round-trip conversions.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pyproject.toml Switches LinkML dependencies to git main for compatibility testing.
poetry.lock Updates locked dependency graph for the LinkML mainline bump.
src/data/examples/tsv-normalization/normalize_tsv_lists.py Adds a SchemaView-driven TSV normalizer for multivalued slots.
src/data/examples/tsv-normalization/README.md Documents the demo workflow and current upstream limitations.
src/data/examples/tsv-normalization/MimsSoil-messy-lists.tsv Provides messy TSV input examples with varied list styles.
src/data/examples/tsv-normalization/MimsSoil-normalized-lists.tsv Provides normalized TSV output example.
src/data/examples/tsv-normalization/MimsSoil-normalized.yaml Provides normalized YAML output example.
Makefile Adds demo targets for normalization and TSV/YAML round-trips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

turbomam and others added 10 commits March 10, 2026 16:59
Upgrade linkml to GitHub main (1.10.0.post78.dev0, commit 1f5d5fa) and
linkml-runtime to ^1.10 — full build and test suite passes with no
regressions.

Add schema-aware TSV normalization demo showing how heterogeneous list
formatting (bracketed pipes, bare pipes, spaced pipes) can be detected
and normalized using SchemaView to identify multivalued slots. Includes
messy input TSV, normalization script, and Makefile targets
(normalize-tsv-demo, normalize-tsv-roundtrip).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Both linkml and linkml-runtime now install from the same monorepo
commit, resolving the version skew (linkml#3241). Updates the TSV
normalization README to reflect that --list-wrapper none works for
dumping but crashes on load with empty multivalued cells (linkml#3250).
Dump target works now; load target commented out pending linkml#3250 fix.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Switch both linkml and linkml-runtime to the fix-3250-empty-cell-crash
branch which fixes the json_clean crash on empty multivalued cells
(linkml/linkml#3250, PR linkml/linkml#3251).

Uncomment the bare-pipe TSV load target in Makefile now that the
crash is fixed — the full YAML → TSV → YAML round-trip works.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The lock file must be committed alongside pyproject.toml changes
so CI can install without re-resolving.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Use induced_slot() instead of get_slot() to respect slot_usage overrides
- Remove unused `re` import
- Remove unused --delimiter CLI option (hardcoded to pipe)
- Update README to reflect that bare-pipe load now works (fix-3250 branch)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The normalize script still fills a gap: linkml-convert expects one
list wrapper style per invocation, but real-world MIxS TSV files
often mix bracketed, bare-pipe, and spaced-pipe formats. The script
is a pre-processing step that normalizes mixed-format files to a
single canonical style.

Updated the script docstring to explain when you need it vs. when
linkml-convert handles things directly. Rewrote the README to frame
the full workflow: normalize → round-trip → bare-pipe demo.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Once linkml PR #3251 (empty-cell crash fix) merges to main,
both linkml and linkml-runtime should be switched from the
fix-3250-empty-cell-crash branch to main or a PyPI release.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…tations

Add a Next Steps section to the TSV normalization README covering:
- Boolean slot demo with Mimag/HumanAssociated example data (needs #3144)
- Validator integration once linkml#3147 is addressed
- Schema-level annotations to avoid per-command CLI flags

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@turbomam
Copy link
Copy Markdown
Member Author

Rebase & cross-reference note (2026-03-10)

This PR was rebased onto current main along with #1115, #1090, and #1110.

This PR (LinkML main compatibility test) is independent of the other three and can merge in any order. The other three have inter-dependencies:

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +301 to +337
# Round-trip: normalized TSV → YAML → TSV to prove linkml-convert can parse it.
normalize-tsv-roundtrip: $(TSV_NORM_OUTPUT) contrib/mixs-patterns-materialized.yaml
@echo "=== Converting normalized TSV → YAML ==="
$(RUN) linkml-convert \
-s contrib/mixs-patterns-materialized.yaml \
-C MixsCompliantData -S mims_soil_data \
-f tsv -t yaml --no-validate \
$(TSV_NORM_OUTPUT) > $(TSV_NORM_YAML)
@echo "Output: $(TSV_NORM_YAML)"
@echo ""
@cat $(TSV_NORM_YAML)
@echo ""
@echo "=== Round-trip: YAML → TSV ==="
$(RUN) linkml-convert \
-s contrib/mixs-patterns-materialized.yaml \
-C MixsCompliantData -S mims_soil_data \
-f yaml -t tsv --no-validate \
$(TSV_NORM_YAML)

# --- Bare-pipe TSV demo (exercises --list-wrapper none from linkml PR #3134) ---

# Dump: YAML → bare-pipe TSV (works now).
$(TSV_BARE_PIPE_OUTPUT): $(TSV_NORM_YAML) contrib/mixs-patterns-materialized.yaml
@echo "=== Dumping YAML → bare-pipe TSV (--list-wrapper none) ==="
$(RUN) linkml-convert \
-s contrib/mixs-patterns-materialized.yaml \
-C MixsCompliantData -S mims_soil_data \
-f yaml -t tsv --no-validate \
--list-wrapper none \
$(TSV_NORM_YAML) > $(TSV_BARE_PIPE_OUTPUT)
@echo "Output: $(TSV_BARE_PIPE_OUTPUT)"
@echo ""
@cat $(TSV_BARE_PIPE_OUTPUT)

# Load: bare-pipe TSV → YAML (requires linkml PR #3251 or later).
$(TSV_BARE_PIPE_YAML): $(TSV_BARE_PIPE_OUTPUT) contrib/mixs-patterns-materialized.yaml
@echo "=== Loading bare-pipe TSV → YAML (--list-wrapper none) ==="
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$(TSV_BARE_PIPE_OUTPUT) depends on $(TSV_NORM_YAML), but there is no rule to build $(TSV_NORM_YAML) (it’s only created as a side effect of normalize-tsv-roundtrip). This currently works only if MimsSoil-normalized.yaml is already present. Add an explicit $(TSV_NORM_YAML): ... rule (or make bare-pipe targets depend on normalize-tsv-roundtrip) to avoid “No rule to make target” failures when the YAML artifact is missing.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. $(TSV_NORM_YAML) is only created as a side effect of the normalize-tsv-roundtrip recipe (which redirects to > $(TSV_NORM_YAML)), but there is no standalone rule for it. Currently this works because MimsSoil-normalized.yaml is committed, so make finds it on disk. But if the file is ever removed or .gitignored, make $(TSV_BARE_PIPE_OUTPUT) would fail with "No rule to make target."

The fix would be to either: (a) add an explicit $(TSV_NORM_YAML) rule that runs the linkml-convert step, or (b) make $(TSV_BARE_PIPE_OUTPUT) depend on normalize-tsv-roundtrip instead. Will address in a follow-up commit.

Comment on lines +121 to +126
reader = csv.DictReader(inf, delimiter='\t')
fieldnames = reader.fieldnames

with open(output_path, 'w', newline='') as outf:
writer = csv.DictWriter(outf, fieldnames=fieldnames, delimiter='\t')
writer.writeheader()
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input TSV is empty or missing a header row, csv.DictReader(...).fieldnames will be None and csv.DictWriter will raise a TypeError. Add an explicit check after reading fieldnames and raise a clear error (or exit nonzero) when no header is present.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid defensive coding suggestion. If the input file is empty or has no header, csv.DictReader.fieldnames would be None and csv.DictWriter would raise an unhelpful TypeError. Adding an explicit check after reading fieldnames to raise a clear error message is worthwhile. Will address in a follow-up commit.

Comment on lines +66 to +67
| `MimsSoil-normalized-lists.tsv` | No (generated) | Output of normalization step |
| `MimsSoil-normalized.yaml` | No (generated) | YAML from round-tripping the normalized TSV |
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Files” table marks MimsSoil-normalized-lists.tsv and MimsSoil-normalized.yaml as “No (generated)”, but both files are committed in this PR. Update the table to match what’s actually checked in, or stop committing these artifacts and add them to ignore rules.

Suggested change
| `MimsSoil-normalized-lists.tsv` | No (generated) | Output of normalization step |
| `MimsSoil-normalized.yaml` | No (generated) | YAML from round-tripping the normalized TSV |
| `MimsSoil-normalized-lists.tsv` | Yes | Output of normalization step |
| `MimsSoil-normalized.yaml` | Yes | YAML from round-tripping the normalized TSV |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — MimsSoil-normalized-lists.tsv and MimsSoil-normalized.yaml are both committed in the repo, but the Files table marks them as "No (generated)." The table should be updated to say "Yes" for those two rows. Will fix in a follow-up commit.

Comment on lines +283 to +299
# Normalize heterogeneous list formatting in a messy TSV using schema awareness.
# Shows: bare pipes, spaced pipes, and bracketed pipes all converge to canonical format.
normalize-tsv-demo: $(TSV_NORM_OUTPUT)

$(TSV_NORM_OUTPUT): $(TSV_NORM_INPUT) $(TSV_NORM_SCRIPT) $(SOURCE_SCHEMA_PATH)
@echo "=== Normalizing messy TSV lists ==="
$(RUN) python $(TSV_NORM_SCRIPT) \
--schema $(SOURCE_SCHEMA_PATH) \
--target-class MimsSoil \
--input $(TSV_NORM_INPUT) \
--output $(TSV_NORM_OUTPUT)
@echo ""
@echo "=== Before (multivalued columns) ==="
@awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)h[i]=$$i} NR>0{printf "%s\t%s\t%s\n",$$1,$$5,$$6}' $(TSV_NORM_INPUT)
@echo ""
@echo "=== After (multivalued columns) ==="
@awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)h[i]=$$i} NR>0{printf "%s\t%s\t%s\n",$$1,$$5,$$6}' $(TSV_NORM_OUTPUT)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize-tsv-demo only depends on $(TSV_NORM_OUTPUT). Because MimsSoil-normalized-lists.tsv is committed, make normalize-tsv-demo will typically do nothing on a fresh checkout (and won’t print the before/after awk output). Consider giving normalize-tsv-demo its own recipe that always runs the script (or writes to a temp file) so the demo consistently executes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation. Since MimsSoil-normalized-lists.tsv is committed, make normalize-tsv-demo will see the target as up-to-date and skip the recipe (including the before/after awk output). This is standard make behavior — the recipe only runs when inputs are newer than the output.

For a demo target that should always show output, there are a few options: (a) make it .PHONY so it always runs, (b) have it write to a temp file so the committed file doesn't short-circuit make, or (c) accept that users should rm the output first or touch the input to force a rebuild. Since this is explicitly a "standalone demo" and the help text tells users how to run it, I think (a) making it .PHONY is the simplest fix. Will address in a follow-up commit.

1. Add explicit rule for $(TSV_NORM_YAML) so $(TSV_BARE_PIPE_OUTPUT)
   has a proper prerequisite instead of relying on a side effect of
   normalize-tsv-roundtrip.

2. Guard against empty/missing TSV headers in normalize_tsv_lists.py
   by raising a clear ValueError when fieldnames is None.

3. Fix README Files table: MimsSoil-normalized-lists.tsv and
   MimsSoil-normalized.yaml are committed, not "No (generated)".

4. Add $(TSV_NORM_OUTPUT) to .PHONY so normalize-tsv-demo always
   rebuilds even when the output file is already committed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test MIxS build against linkml main + TSV list normalization demo

2 participants