Summary
A full uv run pytest tests/e2e/ -m e2e run (563 passed / 30 failed / 56 skipped) surfaced a cluster of pre-existing failures in the winml run / winml serve inference layer (TestInferenceAllModels, TestSchemaAllModels). They are not caused by task-detection work — each was bisected to an origin that predates the recent task PRs (#801, #807, #841, #851, #850, #834), and they reproduce identically on those commits and on plain origin/main.
This issue tracks them so they're not mistaken for regressions. (5 additional [dml] failures are excluded — they're environment-only: the test machine has no DirectML EP.)
Why the daily E2E pipeline is green despite these
All 25 failures live in tests/e2e/test_run_e2e.py and tests/e2e/test_serve_e2e.py (TestInferenceAllModels / TestSchemaAllModels). The daily Modelkit E2E Test pipeline's pytestTargets is [analyze, inspect, build, compile, config, export, optimize, quantize, sys, perf, eval] — run and serve are not in it, so these suites are never executed by CI.
CI's Phase 1 (winml perf) does include bert-base-multilingual-cased (fill-mask + masked-lm) and clip-vit-base-patch32 (zero-shot-image-classification) in its curated models list, but Phase 1 is perf benchmarking of the exported ONNX model — it never invokes the HF pipeline() / WinML wrapper / candidate_labels path that fails here. So perf is green while run/serve inference is not.
Net: these are latent gaps the daily E2E gate does not cover. They only surface in a full local uv run pytest tests/e2e/ -m e2e.
1. fill-mask inference — generic wrapper lacks base_model_prefix
Error: AttributeError: 'WinMLModelForGenericTask' object has no attribute 'base_model_prefix'
Affected (run + serve): roberta-base, xlm-roberta-base, distilbert-base-uncased, bert-base-multilingual-cased, bert-base-multilingual-uncased, bert-base-uncased, all-mpnet-base-v2, multi-qa-mpnet-base-dot-v1 — all fill-mask.
Root cause: TASK_TO_WINML_CLASS["fill-mask"] = "WinMLModelForMaskedLM", but that class is never implemented (the table comment says "Not yet implemented — falls back to WinMLModelForGenericTask at runtime"; present since the initial commit 0a6350a5). The generic fallback has no base_model_prefix, which HF's FillMaskPipeline.ensure_exactly_one_mask_token references.
Contributing: the e2e harness feeds these models SAMPLE_TEXT ("The quick brown fox…") with no [MASK] token, so the pipeline reaches the "no mask token" error path regardless.
Suggested fix: implement WinMLModelForMaskedLM (with base_model_prefix), and/or have the harness pass a [MASK]-containing input for fill-mask.
2. masked-lm catalog entry uses a non-canonical task name
Error: KeyError: Unknown task masked-lm (and at the export step ValueError: bert doesn't support task masked-lm for the onnx backend. Supported tasks are: feature-extraction, fill-mask, …)
Affected (run + serve + schema): google-bert/bert-base-multilingual-cased — masked-lm.
Origin (bisected): c941ff9b — #759 "Update hub_models.json with latest model catalog" (2026-05-27), the commit that added this (model, task) pair. It has been failing since it was introduced (born red).
Root cause: hub_models.json uses the alias masked-lm where the canonical export/pipeline task is fill-mask; nothing normalizes it in the run/serve path, so it reaches the ONNX exporter and HF pipeline() verbatim.
Suggested fix: change the catalog entry to fill-mask, or normalize task aliases at the run/serve entry point. (Note: once normalized it then hits issue #1.)
3. depth-estimation — output tensor not serializable
Error: run: TypeError: Object of type Tensor is not JSON serializable; serve: pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'torch.Tensor'>
Affected (run + serve): dpt-hybrid-midas — depth-estimation.
Root cause: the depth-estimation pipeline output contains a torch.Tensor that the run/serve serializer doesn't convert to a JSON-safe type.
Suggested fix: convert tensor outputs to lists in the depth postprocess / serialization path.
4. zero-shot (image + text) — missing candidate_labels
Error: ZeroShotImageClassificationPipeline.__call__() missing 1 required positional argument: 'candidate_labels'; and object of type 'NoneType' has no len() for zero-shot-classification.
Affected (run + serve): laion/CLIP-ViT-B-32-laion2B-s34B-b79K (zero-shot-image-classification), lxyuan/distilbert-base-multilingual-cased-sentiments-student (zero-shot-classification).
Root cause: zero-shot pipelines require candidate_labels, which the run/serve harness / TASK_REGISTRY input schema does not supply.
Suggested fix: provide candidate_labels for zero-shot tasks (harness input + TASK_REGISTRY schema).
Provenance / not task-detection
All four categories live in the model-wrapper, serialization, and pipeline-input layers — none in loader/resolution task detection. Verified empirically by re-running representative cases across commits (origin/main, pre-#807, pre-#801, and #759), with identical failures throughout. The in-flight unify-task-detection PR (#878) does not touch the run/serve wrapper layer.
Summary
A full
uv run pytest tests/e2e/ -m e2erun (563 passed / 30 failed / 56 skipped) surfaced a cluster of pre-existing failures in thewinml run/winml serveinference layer (TestInferenceAllModels,TestSchemaAllModels). They are not caused by task-detection work — each was bisected to an origin that predates the recent task PRs (#801, #807, #841, #851, #850, #834), and they reproduce identically on those commits and on plainorigin/main.This issue tracks them so they're not mistaken for regressions. (5 additional
[dml]failures are excluded — they're environment-only: the test machine has no DirectML EP.)Why the daily E2E pipeline is green despite these
All 25 failures live in
tests/e2e/test_run_e2e.pyandtests/e2e/test_serve_e2e.py(TestInferenceAllModels/TestSchemaAllModels). The daily Modelkit E2E Test pipeline'spytestTargetsis[analyze, inspect, build, compile, config, export, optimize, quantize, sys, perf, eval]—runandserveare not in it, so these suites are never executed by CI.CI's Phase 1 (
winml perf) does includebert-base-multilingual-cased(fill-mask + masked-lm) andclip-vit-base-patch32(zero-shot-image-classification) in its curatedmodelslist, but Phase 1 is perf benchmarking of the exported ONNX model — it never invokes the HFpipeline()/ WinML wrapper /candidate_labelspath that fails here. So perf is green whilerun/serveinference is not.Net: these are latent gaps the daily E2E gate does not cover. They only surface in a full local
uv run pytest tests/e2e/ -m e2e.1. fill-mask inference — generic wrapper lacks
base_model_prefixError:
AttributeError: 'WinMLModelForGenericTask' object has no attribute 'base_model_prefix'Affected (run + serve):
roberta-base,xlm-roberta-base,distilbert-base-uncased,bert-base-multilingual-cased,bert-base-multilingual-uncased,bert-base-uncased,all-mpnet-base-v2,multi-qa-mpnet-base-dot-v1— allfill-mask.Root cause:
TASK_TO_WINML_CLASS["fill-mask"] = "WinMLModelForMaskedLM", but that class is never implemented (the table comment says "Not yet implemented — falls back toWinMLModelForGenericTaskat runtime"; present since the initial commit0a6350a5). The generic fallback has nobase_model_prefix, which HF'sFillMaskPipeline.ensure_exactly_one_mask_tokenreferences.Contributing: the e2e harness feeds these models
SAMPLE_TEXT("The quick brown fox…") with no[MASK]token, so the pipeline reaches the "no mask token" error path regardless.Suggested fix: implement
WinMLModelForMaskedLM(withbase_model_prefix), and/or have the harness pass a[MASK]-containing input for fill-mask.2.
masked-lmcatalog entry uses a non-canonical task nameError:
KeyError: Unknown task masked-lm(and at the export stepValueError: bert doesn't support task masked-lm for the onnx backend. Supported tasks are: feature-extraction, fill-mask, …)Affected (run + serve + schema):
google-bert/bert-base-multilingual-cased—masked-lm.Origin (bisected):
c941ff9b— #759 "Update hub_models.json with latest model catalog" (2026-05-27), the commit that added this(model, task)pair. It has been failing since it was introduced (born red).Root cause:
hub_models.jsonuses the aliasmasked-lmwhere the canonical export/pipeline task isfill-mask; nothing normalizes it in the run/serve path, so it reaches the ONNX exporter and HFpipeline()verbatim.Suggested fix: change the catalog entry to
fill-mask, or normalize task aliases at the run/serve entry point. (Note: once normalized it then hits issue #1.)3. depth-estimation — output tensor not serializable
Error:
run:TypeError: Object of type Tensor is not JSON serializable;serve:pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'torch.Tensor'>Affected (run + serve):
dpt-hybrid-midas—depth-estimation.Root cause: the depth-estimation pipeline output contains a
torch.Tensorthat the run/serve serializer doesn't convert to a JSON-safe type.Suggested fix: convert tensor outputs to lists in the depth postprocess / serialization path.
4. zero-shot (image + text) — missing
candidate_labelsError:
ZeroShotImageClassificationPipeline.__call__() missing 1 required positional argument: 'candidate_labels'; andobject of type 'NoneType' has no len()for zero-shot-classification.Affected (run + serve):
laion/CLIP-ViT-B-32-laion2B-s34B-b79K(zero-shot-image-classification),lxyuan/distilbert-base-multilingual-cased-sentiments-student(zero-shot-classification).Root cause: zero-shot pipelines require
candidate_labels, which the run/serve harness /TASK_REGISTRYinput schema does not supply.Suggested fix: provide
candidate_labelsfor zero-shot tasks (harness input +TASK_REGISTRYschema).Provenance / not task-detection
All four categories live in the model-wrapper, serialization, and pipeline-input layers — none in
loader/resolutiontask detection. Verified empirically by re-running representative cases across commits (origin/main, pre-#807, pre-#801, and #759), with identical failures throughout. The in-flight unify-task-detection PR (#878) does not touch therun/servewrapper layer.