Skip to content

E2E run/serve inference fails for fill-mask, masked-lm, depth-estimation, zero-shot (pre-existing; not covered by daily E2E pipeline) #892

@timenick

Description

@timenick

Summary

A full uv run pytest tests/e2e/ -m e2e run (563 passed / 30 failed / 56 skipped) surfaced a cluster of pre-existing failures in the winml run / winml serve inference layer (TestInferenceAllModels, TestSchemaAllModels). They are not caused by task-detection work — each was bisected to an origin that predates the recent task PRs (#801, #807, #841, #851, #850, #834), and they reproduce identically on those commits and on plain origin/main.

This issue tracks them so they're not mistaken for regressions. (5 additional [dml] failures are excluded — they're environment-only: the test machine has no DirectML EP.)

Why the daily E2E pipeline is green despite these

All 25 failures live in tests/e2e/test_run_e2e.py and tests/e2e/test_serve_e2e.py (TestInferenceAllModels / TestSchemaAllModels). The daily Modelkit E2E Test pipeline's pytestTargets is [analyze, inspect, build, compile, config, export, optimize, quantize, sys, perf, eval]run and serve are not in it, so these suites are never executed by CI.

CI's Phase 1 (winml perf) does include bert-base-multilingual-cased (fill-mask + masked-lm) and clip-vit-base-patch32 (zero-shot-image-classification) in its curated models list, but Phase 1 is perf benchmarking of the exported ONNX model — it never invokes the HF pipeline() / WinML wrapper / candidate_labels path that fails here. So perf is green while run/serve inference is not.

Net: these are latent gaps the daily E2E gate does not cover. They only surface in a full local uv run pytest tests/e2e/ -m e2e.

1. fill-mask inference — generic wrapper lacks base_model_prefix

Error: AttributeError: 'WinMLModelForGenericTask' object has no attribute 'base_model_prefix'

Affected (run + serve): roberta-base, xlm-roberta-base, distilbert-base-uncased, bert-base-multilingual-cased, bert-base-multilingual-uncased, bert-base-uncased, all-mpnet-base-v2, multi-qa-mpnet-base-dot-v1 — all fill-mask.

Root cause: TASK_TO_WINML_CLASS["fill-mask"] = "WinMLModelForMaskedLM", but that class is never implemented (the table comment says "Not yet implemented — falls back to WinMLModelForGenericTask at runtime"; present since the initial commit 0a6350a5). The generic fallback has no base_model_prefix, which HF's FillMaskPipeline.ensure_exactly_one_mask_token references.

Contributing: the e2e harness feeds these models SAMPLE_TEXT ("The quick brown fox…") with no [MASK] token, so the pipeline reaches the "no mask token" error path regardless.

Suggested fix: implement WinMLModelForMaskedLM (with base_model_prefix), and/or have the harness pass a [MASK]-containing input for fill-mask.

2. masked-lm catalog entry uses a non-canonical task name

Error: KeyError: Unknown task masked-lm (and at the export step ValueError: bert doesn't support task masked-lm for the onnx backend. Supported tasks are: feature-extraction, fill-mask, …)

Affected (run + serve + schema): google-bert/bert-base-multilingual-casedmasked-lm.

Origin (bisected): c941ff9b#759 "Update hub_models.json with latest model catalog" (2026-05-27), the commit that added this (model, task) pair. It has been failing since it was introduced (born red).

Root cause: hub_models.json uses the alias masked-lm where the canonical export/pipeline task is fill-mask; nothing normalizes it in the run/serve path, so it reaches the ONNX exporter and HF pipeline() verbatim.

Suggested fix: change the catalog entry to fill-mask, or normalize task aliases at the run/serve entry point. (Note: once normalized it then hits issue #1.)

3. depth-estimation — output tensor not serializable

Error: run: TypeError: Object of type Tensor is not JSON serializable; serve: pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'torch.Tensor'>

Affected (run + serve): dpt-hybrid-midasdepth-estimation.

Root cause: the depth-estimation pipeline output contains a torch.Tensor that the run/serve serializer doesn't convert to a JSON-safe type.

Suggested fix: convert tensor outputs to lists in the depth postprocess / serialization path.

4. zero-shot (image + text) — missing candidate_labels

Error: ZeroShotImageClassificationPipeline.__call__() missing 1 required positional argument: 'candidate_labels'; and object of type 'NoneType' has no len() for zero-shot-classification.

Affected (run + serve): laion/CLIP-ViT-B-32-laion2B-s34B-b79K (zero-shot-image-classification), lxyuan/distilbert-base-multilingual-cased-sentiments-student (zero-shot-classification).

Root cause: zero-shot pipelines require candidate_labels, which the run/serve harness / TASK_REGISTRY input schema does not supply.

Suggested fix: provide candidate_labels for zero-shot tasks (harness input + TASK_REGISTRY schema).

Provenance / not task-detection

All four categories live in the model-wrapper, serialization, and pipeline-input layers — none in loader/resolution task detection. Verified empirically by re-running representative cases across commits (origin/main, pre-#807, pre-#801, and #759), with identical failures throughout. The in-flight unify-task-detection PR (#878) does not touch the run/serve wrapper layer.

Metadata

Metadata

Labels

P2Medium — minor bug or non-critical improvementbugSomething isn't workingtriagedIssue has been triaged

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions