Skip to content

Wide performance discrepancy between LM Studio MLX and mlx_vlm.generate with Qwen3.5-35B-A3B #285

@Belluxx

Description

@Belluxx

When running Qwen3.5-35B-A3B 4bit MLX on an M4, I noticed that using a ~3000 tokens prompt:

  • LM Studio gets consistently around 78 tok/s token generation and 6s TTFT
  • mlx_vlm.generate reaches 108 tok/s token generation and 2s TTFT

The discrepancy in token generation is almost identical even with very short prompts (<100 tokens).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions