gem-cap-chan is a utility for batch captioning images with natural language using OpenAPI-compatible multimodal models like Gemma3. Designed for creating high-quality datasets for Stable Diffusion and LoRA training.
- API Flexibility: Works with any OpenAPI-compatible endpoint (local or cloud-based)
- Batch Processing: Recursively process entire directories of training images
- Optimized Captions: Default prompt tuned for Stable Diffusion/LoRA training
- Smart Image Handling: Automatic resizing and format conversion
- Progress Tracking: Real-time progress with ETA and performance metrics
- Failure Recovery: Automatic retries with error skipping
- Security: Token authentication for remote endpoints
- Python 3.7+
- Pillow
- Requests
- OpenAPI-compatible multimodal endpoint (e.g., llama.cpp with mmproj support)
- Clone the repository:
git clone https://github.com/2dameneko/gem-cap-chan - Install dependencies (if your system does not have these components installed by default):
pip install Pillow requests
- Start your multimodal API server (example for llama.cpp):
llama-server --model "gemma3-27b.Q4_K_M.gguf" \ --mmproj "gemma3-27b-mmproj.gguf" \ --host 0.0.0.0 --port 5000
- Run captioning:
python gem-cap-chan.py /path/to/training_images
- Captions will be saved as
.txtfiles in the output directory
Run without arguments for default behavior. Available CLI options (python gem-cap-chan.py -h):
| Argument | Description |
|---|---|
input_dir |
Directory containing images to caption (required) |
--api_base |
API base URL (default: http://localhost:5000) |
--api_token |
Authentication token for secure/remote endpoints |
--output_dir |
Output directory for caption files (default: same as input_dir) |
--max_size |
Max image dimension for resizing (pixels, default: 1024) |
Modify the DEFAULT_PROMPT variable in the script for different caption styles.
.jpg, .png, .webp, .jpeg, .bmp
- 0.1: Initial release with local endpoint support
This project is a proof of concept and not production-ready
- OpenAPI Specification: OpenAI
- llama.cpp: ggerganov/llama.cpp
- Gemma3: Google DeepMind
- Pillow: Python Imaging Library
Model Implementation Credits
Gemma3 27b · Gemma3 27b DPO Abliterated
Thank you for your interest in gem-cap-chan!