Test MIxS build against linkml main + TSV list normalization demo#1130
Test MIxS build against linkml main + TSV list normalization demo#1130
Conversation
|
There was a problem hiding this comment.
Pull request overview
Updates MIxS’ development/build environment to exercise LinkML mainline changes while adding a schema-aware demo workflow for normalizing heterogeneous multivalued TSV formatting into a canonical form that LinkML tooling can reliably parse.
Changes:
- Pin
linkml/linkml-runtimeto GitHubmainand refresh the Poetry lockfile. - Add a TSV list-normalization demo (script + README + sample input/output artifacts).
- Add Makefile demo targets for normalization and round-trip conversions.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Switches LinkML dependencies to git main for compatibility testing. |
poetry.lock |
Updates locked dependency graph for the LinkML mainline bump. |
src/data/examples/tsv-normalization/normalize_tsv_lists.py |
Adds a SchemaView-driven TSV normalizer for multivalued slots. |
src/data/examples/tsv-normalization/README.md |
Documents the demo workflow and current upstream limitations. |
src/data/examples/tsv-normalization/MimsSoil-messy-lists.tsv |
Provides messy TSV input examples with varied list styles. |
src/data/examples/tsv-normalization/MimsSoil-normalized-lists.tsv |
Provides normalized TSV output example. |
src/data/examples/tsv-normalization/MimsSoil-normalized.yaml |
Provides normalized YAML output example. |
Makefile |
Adds demo targets for normalization and TSV/YAML round-trips. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Upgrade linkml to GitHub main (1.10.0.post78.dev0, commit 1f5d5fa) and linkml-runtime to ^1.10 — full build and test suite passes with no regressions. Add schema-aware TSV normalization demo showing how heterogeneous list formatting (bracketed pipes, bare pipes, spaced pipes) can be detected and normalized using SchemaView to identify multivalued slots. Includes messy input TSV, normalization script, and Makefile targets (normalize-tsv-demo, normalize-tsv-roundtrip). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Both linkml and linkml-runtime now install from the same monorepo commit, resolving the version skew (linkml#3241). Updates the TSV normalization README to reflect that --list-wrapper none works for dumping but crashes on load with empty multivalued cells (linkml#3250).
Dump target works now; load target commented out pending linkml#3250 fix. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Switch both linkml and linkml-runtime to the fix-3250-empty-cell-crash branch which fixes the json_clean crash on empty multivalued cells (linkml/linkml#3250, PR linkml/linkml#3251). Uncomment the bare-pipe TSV load target in Makefile now that the crash is fixed — the full YAML → TSV → YAML round-trip works. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The lock file must be committed alongside pyproject.toml changes so CI can install without re-resolving. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Use induced_slot() instead of get_slot() to respect slot_usage overrides - Remove unused `re` import - Remove unused --delimiter CLI option (hardcoded to pipe) - Update README to reflect that bare-pipe load now works (fix-3250 branch) Co-Authored-By: Claude Opus 4.6 <[email protected]>
The normalize script still fills a gap: linkml-convert expects one list wrapper style per invocation, but real-world MIxS TSV files often mix bracketed, bare-pipe, and spaced-pipe formats. The script is a pre-processing step that normalizes mixed-format files to a single canonical style. Updated the script docstring to explain when you need it vs. when linkml-convert handles things directly. Rewrote the README to frame the full workflow: normalize → round-trip → bare-pipe demo. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Once linkml PR #3251 (empty-cell crash fix) merges to main, both linkml and linkml-runtime should be switched from the fix-3250-empty-cell-crash branch to main or a PyPI release. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…tations Add a Next Steps section to the TSV normalization README covering: - Boolean slot demo with Mimag/HumanAssociated example data (needs #3144) - Validator integration once linkml#3147 is addressed - Schema-level annotations to avoid per-command CLI flags Co-Authored-By: Claude Opus 4.6 <[email protected]>
a4c7c9b to
67021ea
Compare
|
Rebase & cross-reference note (2026-03-10) This PR was rebased onto current This PR (LinkML main compatibility test) is independent of the other three and can merge in any order. The other three have inter-dependencies:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Round-trip: normalized TSV → YAML → TSV to prove linkml-convert can parse it. | ||
| normalize-tsv-roundtrip: $(TSV_NORM_OUTPUT) contrib/mixs-patterns-materialized.yaml | ||
| @echo "=== Converting normalized TSV → YAML ===" | ||
| $(RUN) linkml-convert \ | ||
| -s contrib/mixs-patterns-materialized.yaml \ | ||
| -C MixsCompliantData -S mims_soil_data \ | ||
| -f tsv -t yaml --no-validate \ | ||
| $(TSV_NORM_OUTPUT) > $(TSV_NORM_YAML) | ||
| @echo "Output: $(TSV_NORM_YAML)" | ||
| @echo "" | ||
| @cat $(TSV_NORM_YAML) | ||
| @echo "" | ||
| @echo "=== Round-trip: YAML → TSV ===" | ||
| $(RUN) linkml-convert \ | ||
| -s contrib/mixs-patterns-materialized.yaml \ | ||
| -C MixsCompliantData -S mims_soil_data \ | ||
| -f yaml -t tsv --no-validate \ | ||
| $(TSV_NORM_YAML) | ||
|
|
||
| # --- Bare-pipe TSV demo (exercises --list-wrapper none from linkml PR #3134) --- | ||
|
|
||
| # Dump: YAML → bare-pipe TSV (works now). | ||
| $(TSV_BARE_PIPE_OUTPUT): $(TSV_NORM_YAML) contrib/mixs-patterns-materialized.yaml | ||
| @echo "=== Dumping YAML → bare-pipe TSV (--list-wrapper none) ===" | ||
| $(RUN) linkml-convert \ | ||
| -s contrib/mixs-patterns-materialized.yaml \ | ||
| -C MixsCompliantData -S mims_soil_data \ | ||
| -f yaml -t tsv --no-validate \ | ||
| --list-wrapper none \ | ||
| $(TSV_NORM_YAML) > $(TSV_BARE_PIPE_OUTPUT) | ||
| @echo "Output: $(TSV_BARE_PIPE_OUTPUT)" | ||
| @echo "" | ||
| @cat $(TSV_BARE_PIPE_OUTPUT) | ||
|
|
||
| # Load: bare-pipe TSV → YAML (requires linkml PR #3251 or later). | ||
| $(TSV_BARE_PIPE_YAML): $(TSV_BARE_PIPE_OUTPUT) contrib/mixs-patterns-materialized.yaml | ||
| @echo "=== Loading bare-pipe TSV → YAML (--list-wrapper none) ===" |
There was a problem hiding this comment.
$(TSV_BARE_PIPE_OUTPUT) depends on $(TSV_NORM_YAML), but there is no rule to build $(TSV_NORM_YAML) (it’s only created as a side effect of normalize-tsv-roundtrip). This currently works only if MimsSoil-normalized.yaml is already present. Add an explicit $(TSV_NORM_YAML): ... rule (or make bare-pipe targets depend on normalize-tsv-roundtrip) to avoid “No rule to make target” failures when the YAML artifact is missing.
There was a problem hiding this comment.
Valid point. $(TSV_NORM_YAML) is only created as a side effect of the normalize-tsv-roundtrip recipe (which redirects to > $(TSV_NORM_YAML)), but there is no standalone rule for it. Currently this works because MimsSoil-normalized.yaml is committed, so make finds it on disk. But if the file is ever removed or .gitignored, make $(TSV_BARE_PIPE_OUTPUT) would fail with "No rule to make target."
The fix would be to either: (a) add an explicit $(TSV_NORM_YAML) rule that runs the linkml-convert step, or (b) make $(TSV_BARE_PIPE_OUTPUT) depend on normalize-tsv-roundtrip instead. Will address in a follow-up commit.
| reader = csv.DictReader(inf, delimiter='\t') | ||
| fieldnames = reader.fieldnames | ||
|
|
||
| with open(output_path, 'w', newline='') as outf: | ||
| writer = csv.DictWriter(outf, fieldnames=fieldnames, delimiter='\t') | ||
| writer.writeheader() |
There was a problem hiding this comment.
If the input TSV is empty or missing a header row, csv.DictReader(...).fieldnames will be None and csv.DictWriter will raise a TypeError. Add an explicit check after reading fieldnames and raise a clear error (or exit nonzero) when no header is present.
There was a problem hiding this comment.
Valid defensive coding suggestion. If the input file is empty or has no header, csv.DictReader.fieldnames would be None and csv.DictWriter would raise an unhelpful TypeError. Adding an explicit check after reading fieldnames to raise a clear error message is worthwhile. Will address in a follow-up commit.
| | `MimsSoil-normalized-lists.tsv` | No (generated) | Output of normalization step | | ||
| | `MimsSoil-normalized.yaml` | No (generated) | YAML from round-tripping the normalized TSV | |
There was a problem hiding this comment.
The “Files” table marks MimsSoil-normalized-lists.tsv and MimsSoil-normalized.yaml as “No (generated)”, but both files are committed in this PR. Update the table to match what’s actually checked in, or stop committing these artifacts and add them to ignore rules.
| | `MimsSoil-normalized-lists.tsv` | No (generated) | Output of normalization step | | |
| | `MimsSoil-normalized.yaml` | No (generated) | YAML from round-tripping the normalized TSV | | |
| | `MimsSoil-normalized-lists.tsv` | Yes | Output of normalization step | | |
| | `MimsSoil-normalized.yaml` | Yes | YAML from round-tripping the normalized TSV | |
There was a problem hiding this comment.
Good catch — MimsSoil-normalized-lists.tsv and MimsSoil-normalized.yaml are both committed in the repo, but the Files table marks them as "No (generated)." The table should be updated to say "Yes" for those two rows. Will fix in a follow-up commit.
| # Normalize heterogeneous list formatting in a messy TSV using schema awareness. | ||
| # Shows: bare pipes, spaced pipes, and bracketed pipes all converge to canonical format. | ||
| normalize-tsv-demo: $(TSV_NORM_OUTPUT) | ||
|
|
||
| $(TSV_NORM_OUTPUT): $(TSV_NORM_INPUT) $(TSV_NORM_SCRIPT) $(SOURCE_SCHEMA_PATH) | ||
| @echo "=== Normalizing messy TSV lists ===" | ||
| $(RUN) python $(TSV_NORM_SCRIPT) \ | ||
| --schema $(SOURCE_SCHEMA_PATH) \ | ||
| --target-class MimsSoil \ | ||
| --input $(TSV_NORM_INPUT) \ | ||
| --output $(TSV_NORM_OUTPUT) | ||
| @echo "" | ||
| @echo "=== Before (multivalued columns) ===" | ||
| @awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)h[i]=$$i} NR>0{printf "%s\t%s\t%s\n",$$1,$$5,$$6}' $(TSV_NORM_INPUT) | ||
| @echo "" | ||
| @echo "=== After (multivalued columns) ===" | ||
| @awk -F'\t' 'NR==1{for(i=1;i<=NF;i++)h[i]=$$i} NR>0{printf "%s\t%s\t%s\n",$$1,$$5,$$6}' $(TSV_NORM_OUTPUT) |
There was a problem hiding this comment.
normalize-tsv-demo only depends on $(TSV_NORM_OUTPUT). Because MimsSoil-normalized-lists.tsv is committed, make normalize-tsv-demo will typically do nothing on a fresh checkout (and won’t print the before/after awk output). Consider giving normalize-tsv-demo its own recipe that always runs the script (or writes to a temp file) so the demo consistently executes.
There was a problem hiding this comment.
Valid observation. Since MimsSoil-normalized-lists.tsv is committed, make normalize-tsv-demo will see the target as up-to-date and skip the recipe (including the before/after awk output). This is standard make behavior — the recipe only runs when inputs are newer than the output.
For a demo target that should always show output, there are a few options: (a) make it .PHONY so it always runs, (b) have it write to a temp file so the committed file doesn't short-circuit make, or (c) accept that users should rm the output first or touch the input to force a rebuild. Since this is explicitly a "standalone demo" and the help text tells users how to run it, I think (a) making it .PHONY is the simplest fix. Will address in a follow-up commit.
1. Add explicit rule for $(TSV_NORM_YAML) so $(TSV_BARE_PIPE_OUTPUT) has a proper prerequisite instead of relying on a side effect of normalize-tsv-roundtrip. 2. Guard against empty/missing TSV headers in normalize_tsv_lists.py by raising a clear ValueError when fieldnames is None. 3. Fix README Files table: MimsSoil-normalized-lists.tsv and MimsSoil-normalized.yaml are committed, not "No (generated)". 4. Add $(TSV_NORM_OUTPUT) to .PHONY so normalize-tsv-demo always rebuilds even when the output file is already committed. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Closes #1129
Summary
fix-3250-empty-cell-crashbranch (includes PR #3134 configurable list formatting + PR #3251 empty-cell crash fix)--list-wrapper none): YAML → TSV → YAML works end-to-endnormalize-tsv-demo,normalize-tsv-roundtrip,tsv-bare-pipe-roundtripTODO: Switch to linkml main or PyPI release
Both
linkmlandlinkml-runtimeare pinned to thefix-3250-empty-cell-crashbranch because the empty-cell crash fix (linkml#3251) hasn't merged to main yet. Once #3251 merges, updatepyproject.tomlto point atmain(or a PyPI release if available). This is marked with TODO comments inpyproject.toml.What this demonstrates
normalize_tsv_lists.py— pre-processing step for mixed-format TSV files (some rows bracketed, some bare-pipe, some spaced).linkml-convertexpects one wrapper style per file, so this normalizes to a single style first.--list-wrapper noneround-trip — with the empty-cell fix,linkml-convertdumps and loads bare-pipe TSV correctly. No brackets needed.linkml-validatestill can't parse multivalued TSV fields (linkml#3147).Test plan
make clean all all-contrib testpassesmake normalize-tsv-demonormalizes messy TSVmake normalize-tsv-roundtripcompletes full bracket-format TSV → YAML → TSV cyclemake tsv-bare-pipe-roundtripcompletes full bare-pipe TSV round-trip🤖 Generated with Claude Code