E2E run/serve inference fails for fill-mask, masked-lm, depth-estimation, zero-shot (pre-existing; not covered by daily E2E pipeline)

## Summary

A full `uv run pytest tests/e2e/ -m e2e` run (563 passed / 30 failed / 56 skipped) surfaced a cluster of **pre-existing** failures in the `winml run` / `winml serve` inference layer (`TestInferenceAllModels`, `TestSchemaAllModels`). They are **not** caused by task-detection work — each was bisected to an origin that predates the recent task PRs (#801, #807, #841, #851, #850, #834), and they reproduce identically on those commits and on plain `origin/main`.

This issue tracks them so they're not mistaken for regressions. (5 additional `[dml]` failures are excluded — they're environment-only: the test machine has no DirectML EP.)

## Why the daily E2E pipeline is green despite these

All 25 failures live in `tests/e2e/test_run_e2e.py` and `tests/e2e/test_serve_e2e.py` (`TestInferenceAllModels` / `TestSchemaAllModels`). The daily **Modelkit E2E Test** pipeline's `pytestTargets` is `[analyze, inspect, build, compile, config, export, optimize, quantize, sys, perf, eval]` — **`run` and `serve` are not in it**, so these suites are never executed by CI.

CI's Phase 1 (`winml perf`) *does* include `bert-base-multilingual-cased` (fill-mask + masked-lm) and `clip-vit-base-patch32` (zero-shot-image-classification) in its curated `models` list, but Phase 1 is **perf benchmarking** of the exported ONNX model — it never invokes the HF `pipeline()` / WinML wrapper / `candidate_labels` path that fails here. So perf is green while `run`/`serve` inference is not.

Net: these are latent gaps the daily E2E gate does not cover. They only surface in a full local `uv run pytest tests/e2e/ -m e2e`.

## 1. fill-mask inference — generic wrapper lacks `base_model_prefix`

**Error:** `AttributeError: 'WinMLModelForGenericTask' object has no attribute 'base_model_prefix'`

**Affected (run + serve):** `roberta-base`, `xlm-roberta-base`, `distilbert-base-uncased`, `bert-base-multilingual-cased`, `bert-base-multilingual-uncased`, `bert-base-uncased`, `all-mpnet-base-v2`, `multi-qa-mpnet-base-dot-v1` — all `fill-mask`.

**Root cause:** `TASK_TO_WINML_CLASS["fill-mask"] = "WinMLModelForMaskedLM"`, but that class is **never implemented** (the table comment says "Not yet implemented — falls back to `WinMLModelForGenericTask` at runtime"; present since the initial commit `0a6350a5`). The generic fallback has no `base_model_prefix`, which HF's `FillMaskPipeline.ensure_exactly_one_mask_token` references.

**Contributing:** the e2e harness feeds these models `SAMPLE_TEXT` ("The quick brown fox…") with **no `[MASK]` token**, so the pipeline reaches the "no mask token" error path regardless.

**Suggested fix:** implement `WinMLModelForMaskedLM` (with `base_model_prefix`), and/or have the harness pass a `[MASK]`-containing input for fill-mask.

## 2. `masked-lm` catalog entry uses a non-canonical task name

**Error:** `KeyError: Unknown task masked-lm` (and at the export step `ValueError: bert doesn't support task masked-lm for the onnx backend. Supported tasks are: feature-extraction, fill-mask, …`)

**Affected (run + serve + schema):** `google-bert/bert-base-multilingual-cased` — `masked-lm`.

**Origin (bisected):** `c941ff9b` — **#759 "Update hub_models.json with latest model catalog"** (2026-05-27), the commit that added this `(model, task)` pair. It has been failing since it was introduced (born red).

**Root cause:** `hub_models.json` uses the alias `masked-lm` where the canonical export/pipeline task is `fill-mask`; nothing normalizes it in the run/serve path, so it reaches the ONNX exporter and HF `pipeline()` verbatim.

**Suggested fix:** change the catalog entry to `fill-mask`, or normalize task aliases at the run/serve entry point. (Note: once normalized it then hits issue #1.)

## 3. depth-estimation — output tensor not serializable

**Error:** `run`: `TypeError: Object of type Tensor is not JSON serializable`; `serve`: `pydantic_core.PydanticSerializationError: Unable to serialize unknown type: <class 'torch.Tensor'>`

**Affected (run + serve):** `dpt-hybrid-midas` — `depth-estimation`.

**Root cause:** the depth-estimation pipeline output contains a `torch.Tensor` that the run/serve serializer doesn't convert to a JSON-safe type.

**Suggested fix:** convert tensor outputs to lists in the depth postprocess / serialization path.

## 4. zero-shot (image + text) — missing `candidate_labels`

**Error:** `ZeroShotImageClassificationPipeline.__call__() missing 1 required positional argument: 'candidate_labels'`; and `object of type 'NoneType' has no len()` for zero-shot-classification.

**Affected (run + serve):** `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` (`zero-shot-image-classification`), `lxyuan/distilbert-base-multilingual-cased-sentiments-student` (`zero-shot-classification`).

**Root cause:** zero-shot pipelines require `candidate_labels`, which the run/serve harness / `TASK_REGISTRY` input schema does not supply.

**Suggested fix:** provide `candidate_labels` for zero-shot tasks (harness input + `TASK_REGISTRY` schema).

## Provenance / not task-detection

All four categories live in the model-wrapper, serialization, and pipeline-input layers — none in `loader/resolution` task detection. Verified empirically by re-running representative cases across commits (`origin/main`, pre-#807, pre-#801, and #759), with identical failures throughout. The in-flight unify-task-detection PR (#878) does not touch the `run`/`serve` wrapper layer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E run/serve inference fails for fill-mask, masked-lm, depth-estimation, zero-shot (pre-existing; not covered by daily E2E pipeline) #892

Summary

Why the daily E2E pipeline is green despite these

1. fill-mask inference — generic wrapper lacks `base_model_prefix`

2. `masked-lm` catalog entry uses a non-canonical task name

3. depth-estimation — output tensor not serializable

4. zero-shot (image + text) — missing `candidate_labels`

Provenance / not task-detection

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

E2E run/serve inference fails for fill-mask, masked-lm, depth-estimation, zero-shot (pre-existing; not covered by daily E2E pipeline) #892

Description

Summary

Why the daily E2E pipeline is green despite these

1. fill-mask inference — generic wrapper lacks base_model_prefix

2. masked-lm catalog entry uses a non-canonical task name

3. depth-estimation — output tensor not serializable

4. zero-shot (image + text) — missing candidate_labels

Provenance / not task-detection

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. fill-mask inference — generic wrapper lacks `base_model_prefix`

2. `masked-lm` catalog entry uses a non-canonical task name

4. zero-shot (image + text) — missing `candidate_labels`