Skip to content

Commit b2eeac3

Browse files
sgurunatactions-userpvishwanAhmedSeemalKvhpintel
authored
Finetuning Blueprint Solution (#88)
* Release v1.5.2 Signed-off-by: amberjain1 <amber.jain@intel.com> Signed-off-by: psurabh <pradeep.surabhi@intel.com> Signed-off-by: mdfaheem-intel <mohammad.faheem@intel.com> Signed-off-by: vivekrsintc <vivek.rs@intel.com> Co-authored-by: pvishwan <pramodh.vishwanath@intel.com> Co-authored-by: AhmedSeemalK <ahmed.seemal@intel.com> Co-authored-by: vhpintel <vijay.kumar.h.p@intel.com> Co-authored-by: sgurunat <gurunath.s@intel.com> Co-authored-by: jaswanth8888 <jaswanth.karani@intel.com> Co-authored-by: sandeshk-intel <sandesh.kumar.s@intel.com> Co-authored-by: vinayK34 <vinay3.kumar@intel.com> Signed-off-by: Github Actions <actions@github.com> * Adding Finetuning as a blueprint solution as part of release v1.5.2 Signed-off-by: S, Gurunath <gurunath.s@intel.com> * False positive bandit san issue in gpu_engine file, added comment to supress it Signed-off-by: S, Gurunath <gurunath.s@intel.com> --------- Signed-off-by: amberjain1 <amber.jain@intel.com> Signed-off-by: psurabh <pradeep.surabhi@intel.com> Signed-off-by: mdfaheem-intel <mohammad.faheem@intel.com> Signed-off-by: vivekrsintc <vivek.rs@intel.com> Signed-off-by: Github Actions <actions@github.com> Signed-off-by: S, Gurunath <gurunath.s@intel.com> Co-authored-by: Github Actions <actions@github.com> Co-authored-by: pvishwan <pramodh.vishwanath@intel.com> Co-authored-by: AhmedSeemalK <ahmed.seemal@intel.com> Co-authored-by: vhpintel <vijay.kumar.h.p@intel.com> Co-authored-by: jaswanth8888 <jaswanth.karani@intel.com> Co-authored-by: sandeshk-intel <sandesh.kumar.s@intel.com> Co-authored-by: vinayK34 <vinay3.kumar@intel.com>
1 parent 6774feb commit b2eeac3

279 files changed

Lines changed: 35221 additions & 5 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Fine-Tuning Service
2+
3+
Copyright (C) 2024-2025 Intel Corporation
4+
SPDX-License-Identifier: Apache-2.0
5+
6+
This is reference/blueprint fine-tuning solution for Intel® AI for Enterprise Inference that deploys a complete LLM fine-tuning stack alongside the existing inference cluster.
7+
8+
## What Gets Deployed
9+
10+
| Component | Namespace | Purpose |
11+
|-----------|-----------|--------|
12+
| Fine-Tuning API | `finetuning-api` | OpenAI-compatible API for managing fine-tuning jobs |
13+
| Data Preparation Service | `dataprep` | Document processing, Q&A dataset generation |
14+
| Celery Workers | `dataprep` | Async processing (Docling, LlamaIndex) |
15+
| Fine-Tuning UI | `finetuning-ui` | Web interface |
16+
| PostgreSQL (x2) | `dataprep`, `finetuning-api` | Persistent storage per service |
17+
| Redis (x2) | `dataprep`, `finetuning-api` | Caching and task queuing |
18+
| MinIO | `dataprep` | Shared object storage for training files |
19+
20+
> **Note — Nvidia GPU Training Engine (Unsloth):** The actual GPU fine-tuning workload runs on a **separate Nvidia GPU machine**, not on the Enterprise Inference cluster. To deploy the Nvidia/Unsloth fine-tuning engine on that machine, follow the instructions in [src/finetuning-engine/README.md](src/finetuning-engine/README.md). Once it is running, set its URL and Keycloak credentials in `blueprints/finetuning_service/finetune-config.cfg` before deploying this service.
21+
22+
## Directory Structure
23+
24+
```
25+
blueprints/finetuning_service/
26+
├── README.md # This file
27+
├── finetune-config.cfg # User-facing configuration
28+
├── playbooks/
29+
│ ├── deploy-all.yml # Main orchestration playbook
30+
│ ├── deploy-finetuning-api.yml # Fine-Tuning API
31+
│ ├── deploy-dataprep.yml # Data Preparation Service
32+
│ ├── deploy-ui.yml # Fine-Tuning UI
33+
│ └── build-images.yml # Container image builds
34+
├── vars/
35+
│ └── finetune-plugin-vars.yml # Internal deployment variables
36+
├── scripts/
37+
│ └── setup-keycloak-finetuning.sh # Keycloak realm/client setup
38+
└── src/
39+
├── api/ # Fine-Tuning API source (FastAPI)
40+
├── dataprep/ # Data Preparation source (FastAPI)
41+
├── ui/ # Fine-Tuning UI source (Next.js)
42+
└── finetuning-engine/ # Nvidia/Unsloth GPU training backend
43+
```
44+
45+
## Quick Start
46+
47+
### Step 1: Deploy the Nvidia Fine-Tuning Engine (GPU machine)
48+
49+
Before deploying the cluster-side services, the Nvidia/Unsloth training backend must be running on a GPU machine. Follow the instructions in [src/finetuning-engine/README.md](src/finetuning-engine/README.md).
50+
51+
### Step 2: Configure
52+
53+
Edit `core/inventory/inference-config.cfg`:
54+
55+
```properties
56+
deploy_finetune_plugin=on
57+
```
58+
59+
**`blueprints/finetuning_service/finetune-config.cfg`** — set the Nvidia backend URL and Keycloak credentials:
60+
```properties
61+
nvidia_finetune_backend_url: https://your-nvidia-gpu-server:8443
62+
nvidia_keycloak_token_url: https://your-keycloak-server/realms/finetuning/protocol/openid-connect/token
63+
nvidia_keycloak_client_id: finetuning-api
64+
nvidia_keycloak_client_secret: <client-secret-from-nvidia-keycloak>
65+
```
66+
67+
### Step 3: Generate Secrets
68+
69+
```bash
70+
cd core/scripts
71+
./generate-vault-secrets.sh
72+
```
73+
74+
### Step 4: Deploy
75+
76+
```bash
77+
cd core
78+
./inference-stack-deploy.sh
79+
```
80+
81+
Choose option **1** (Fresh Install) or **3** (Update Cluster).
82+
83+
### Step 5: Access
84+
85+
After successful deployment:
86+
87+
- **UI**: `https://<cluster-url>/enterprise-ai/ui`
88+
- **API docs**: `https://<cluster-url>/enterprise-ai/api/docs`
89+
- **Data Prep docs**: `https://<cluster-url>/enterprise-ai/dataprep/docs`
90+
91+
## Configuration
92+
93+
Two configuration files must be set before deployment.
94+
95+
### 1. Enable the plugin (`core/inventory/inference-config.cfg`)
96+
97+
```properties
98+
# Enable the fine-tuning service
99+
deploy_finetune_plugin=on
100+
```
101+
102+
### 2. Nvidia backend & Keycloak settings (`blueprints/finetuning_service/finetune-config.cfg`)
103+
104+
```properties
105+
# URL of the Nvidia/Unsloth fine-tuning engine (deployed separately on a GPU machine)
106+
nvidia_finetune_backend_url: https://your-nvidia-gpu-server:8443
107+
108+
# Keycloak token endpoint on the Nvidia machine's Keycloak
109+
# Format: https://<keycloak-host>/realms/<realm>/protocol/openid-connect/token
110+
nvidia_keycloak_token_url: https://your-keycloak-server/realms/finetuning/protocol/openid-connect/token
111+
112+
# OAuth2 client credentials used by the Fine-Tuning API to authenticate with the Nvidia backend
113+
nvidia_keycloak_client_id: finetuning-api
114+
nvidia_keycloak_client_secret: <client-secret-from-nvidia-keycloak>
115+
116+
# Set to false only in development with self-signed certificates
117+
nvidia_keycloak_verify_ssl: true
118+
```
119+
120+
### Advanced Settings (`blueprints/finetuning_service/vars/finetune-plugin-vars.yml`)
121+
122+
Edit this file to customise:
123+
- Resource requests/limits (CPU, Memory)
124+
- Replica counts
125+
- Storage sizes
126+
- Image repositories and tags
127+
- Base URL paths
128+
129+
## Manual Deployment (without inference-stack-deploy.sh)
130+
131+
Run from the `core/` directory:
132+
133+
```bash
134+
ansible-playbook -i inventory/hosts.yml \
135+
../blueprints/finetuning_service/playbooks/deploy-all.yml \
136+
--vault-password-file inventory/.vault-passfile
137+
```
138+
139+
Or deploy individual components:
140+
141+
```bash
142+
# Data Preparation Service
143+
ansible-playbook -i inventory/hosts.yml \
144+
../blueprints/finetuning_service/playbooks/deploy-dataprep.yml \
145+
--vault-password-file inventory/.vault-passfile
146+
147+
# Fine-Tuning API
148+
ansible-playbook -i inventory/hosts.yml \
149+
../blueprints/finetuning_service/playbooks/deploy-finetuning-api.yml \
150+
--vault-password-file inventory/.vault-passfile
151+
152+
# Fine-Tuning UI
153+
ansible-playbook -i inventory/hosts.yml \
154+
../blueprints/finetuning_service/playbooks/deploy-ui.yml \
155+
--vault-password-file inventory/.vault-passfile
156+
```
157+
158+
## Deployment Status Checks
159+
160+
```bash
161+
kubectl get pods -n dataprep
162+
kubectl get pods -n finetuning-api
163+
kubectl get pods -n finetuning-ui
164+
```
165+
166+
## Troubleshooting
167+
168+
### Pod not starting
169+
170+
```bash
171+
kubectl logs -n <namespace> <pod-name>
172+
```
173+
174+
### Keycloak authentication issues
175+
176+
Re-run the Keycloak setup script:
177+
178+
```bash
179+
bash blueprints/finetuning_service/scripts/setup-keycloak-finetuning.sh
180+
```
181+
182+
Verify both clients exist in the Keycloak admin console: `finetuning-backend` (confidential) and `finetuning-ui` (public).
183+
184+
### Database connection issues
185+
186+
```bash
187+
kubectl get pods -n dataprep | grep postgres
188+
kubectl get pods -n finetuning-api | grep postgres
189+
```
190+
191+
## Updating Configuration
192+
193+
```bash
194+
# Edit config
195+
vi core/inventory/inference-config.cfg
196+
197+
# Redeploy (choose option 3 - Update)
198+
cd core && ./inference-stack-deploy.sh
199+
```
200+
201+
## Uninstalling
202+
203+
```bash
204+
kubectl delete namespace dataprep
205+
kubectl delete namespace finetuning-api
206+
kubectl delete namespace finetuning-ui
207+
```
208+
209+
Or disable in config and redeploy:
210+
211+
```properties
212+
deploy_finetune_plugin=off
213+
```
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
###############################################################################
3+
# Fine-Tuning Plugin Configuration
4+
###############################################################################
5+
# Only essential settings required - everything else is auto-configured
6+
###############################################################################
7+
8+
###############################################################################
9+
# Nvidia Fine-Tuning Backend Configuration (REQUIRED)
10+
###############################################################################
11+
# URL of your Nvidia fine-tuning backend API
12+
# Example: https://nvidia-server.company.com:8443 or http://192.168.1.100:8443
13+
nvidia_finetune_backend_url:
14+
15+
###############################################################################
16+
# Keycloak OAuth2 Configuration for Nvidia Backend (REQUIRED)
17+
###############################################################################
18+
# The Fine-Tuning API uses OAuth2 client credentials to authenticate with
19+
# the Nvidia backend. These are the credentials configured in Keycloak.
20+
21+
# Keycloak token endpoint URL
22+
# Format: https://<keycloak-domain>/realms/<realm-name>/protocol/openid-connect/token
23+
nvidia_keycloak_token_url:
24+
25+
# OAuth2 Client ID for authenticating with Nvidia backend
26+
nvidia_keycloak_client_id:
27+
28+
# OAuth2 Client Secret for authenticating with Nvidia backend
29+
nvidia_keycloak_client_secret:
30+
31+
# Whether to verify SSL certificates when calling Keycloak (true/false)
32+
# Set to false only in development with self-signed certificates
33+
nvidia_keycloak_verify_ssl: true
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Copyright (C) 2024-2025 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
---
4+
###########################################################################
5+
# Build Container Images using BuildKit (rootless)
6+
###########################################################################
7+
- name: Display Build Message
8+
debug:
9+
msg:
10+
- "=============================================="
11+
- "Building container images with BuildKit (rootless)..."
12+
- "This will take several minutes."
13+
- "Build Order: DataPrep → API → UI"
14+
- "=============================================="
15+
run_once: true
16+
17+
- name: Ensure BuildKit scripts are executable
18+
shell: |
19+
chmod +x {{ dataprep_source_path }}/buildkit/data-prep/deploy-dataprep.sh
20+
chmod +x {{ dataprep_source_path }}/buildkit/celery/deploy-celery.sh
21+
chmod +x {{ finetune_api_source_path }}/buildkit/deploy-finetuning.sh
22+
chmod +x {{ finetune_ui_source_path }}/deployment/buildkit/deploy-frontend.sh
23+
run_once: true
24+
25+
- name: Set proxy build args for BuildKit
26+
set_fact:
27+
buildkit_proxy_args: >-
28+
{% if env_proxy is defined and env_proxy.http_proxy | default('') != '' %}
29+
HTTP_PROXY={{ env_proxy.http_proxy }}
30+
http_proxy={{ env_proxy.http_proxy }}
31+
{% if env_proxy.https_proxy | default('') != '' %}
32+
HTTPS_PROXY={{ env_proxy.https_proxy }}
33+
https_proxy={{ env_proxy.https_proxy }}
34+
{% endif %}
35+
{% if env_proxy.no_proxy | default('') != '' %}
36+
NO_PROXY={{ env_proxy.no_proxy }}
37+
no_proxy={{ env_proxy.no_proxy }}
38+
{% endif %}
39+
{% else %}
40+
{% endif %}
41+
run_once: true
42+
43+
- name: Build Data Prep Backend Image with BuildKit
44+
shell: |
45+
kubectl delete job buildkit-data-prep-backend -n {{ dataprep_namespace }} --ignore-not-found=true 2>/dev/null || true
46+
cd {{ dataprep_source_path }}/buildkit/data-prep && \
47+
NAMESPACE={{ dataprep_namespace }} {{ buildkit_proxy_args | trim }} ./deploy-dataprep.sh
48+
run_once: true
49+
50+
###########################################################################
51+
# Build Data Prep Celery Worker Image with BuildKit
52+
###########################################################################
53+
- name: Build Celery Worker Image with BuildKit
54+
shell: |
55+
kubectl delete job buildkit-celery-worker -n {{ dataprep_namespace }} --ignore-not-found=true 2>/dev/null || true
56+
cd {{ dataprep_source_path }}/buildkit/celery && \
57+
NAMESPACE={{ dataprep_namespace }} {{ buildkit_proxy_args | trim }} ./deploy-celery.sh
58+
kubectl wait --for=condition=complete --timeout=600s job/buildkit-celery-worker -n {{ dataprep_namespace }}
59+
run_once: true
60+
61+
###########################################################################
62+
# Build Fine-Tuning API Image with BuildKit (SECOND - uses MinIO from DataPrep)
63+
###########################################################################
64+
- name: Build Fine-Tuning API Image with BuildKit
65+
shell: |
66+
cd {{ finetune_api_source_path }}/buildkit && \
67+
NAMESPACE={{ finetune_api_namespace }} {{ buildkit_proxy_args | trim }} ./deploy-finetuning.sh
68+
run_once: true
69+
70+
###########################################################################
71+
# Build Fine-Tuning UI Image with BuildKit (THIRD)
72+
###########################################################################
73+
- name: Build Fine-Tuning UI Image with BuildKit
74+
shell: |
75+
cd {{ finetune_ui_source_path }}/deployment/buildkit && \
76+
NAMESPACE={{ finetune_ui_namespace }} \
77+
{{ buildkit_proxy_args | trim }} \
78+
NEXT_PUBLIC_BASE_PATH=/enterprise-ai/ui \
79+
NEXT_PUBLIC_AUTH_URL={{ ('https://' + cluster_url + '/realms/' + finetune_keycloak_realm) | quote }} \
80+
NEXT_PUBLIC_FILES_BASE_URL={{ ('https://' + cluster_url + '/enterprise-ai') | quote }} \
81+
NEXT_PUBLIC_DATAPREP_BASE_URL={{ ('https://' + cluster_url + '/enterprise-ai') | quote }} \
82+
NEXT_PUBLIC_FINETUNING_API_URL={{ ('https://' + cluster_url + '/enterprise-ai') | quote }} \
83+
NEXT_TELEMETRY_DISABLED=1 \
84+
./deploy-frontend.sh
85+
run_once: true
86+
87+
- name: Display Build Complete Message
88+
debug:
89+
msg:
90+
- "=============================================="
91+
- "Container images built successfully!"
92+
- "=============================================="
93+
run_once: true

0 commit comments

Comments
 (0)