This example lists out steps to test inference using a deployed model.
- At this point, it is assumed the node or cluster is deployed with Intel® AI for Enterprise Inference on-prem or on a CSP. If not, follow the deployment guide or refer to all offerings to set up a server.
- Navigate to the Enterprise-Inference/core folder.
- Have a list of deployed models on the node or cluster. To see the list, log on to the node or cluster and follow these instructions:
This works only if APISIX and Keycloak are deployed. Otherwise, refer to Method 2 below.
Run this command to see the list of models deployed:
kubectl get apisixroutesa) Run inference-stack-deploy.sh:
./inference-stack-deploy.shb) Select Option 3: Update Deployed Inference Cluster to go into the Update Existing Cluster menu.
c) Select Option 2: Manage LLM Models to go into the Manage LLM Models menu.
d) Select Option 3: List Installed Models to check all deployed models on the node or cluster.
e) After the script has finished, scroll up in the terminal to look at the section with "Print Installed Models in Comma Separated Format" to see the list of deployed models.
For servers supporting LiteLLM: Alternatively, run some Python code with OpenAI to get this list of models. This can only be done if the base URL and API key are already acquired.
Run the commands below to generate an API token used to access the node or cluster. The BASE_URL needs to be set to the domain used in the setup process.
source scripts/generate-token.shSave the token for later use.
-
Install Python. Ensure the version is compatible.
-
Install
openai:
pip install openai- Set environment variables:
BASE_URLis the HTTPS endpoint of the remote server with the model of choice and/v1(i.e. https://api.example.com//v1). The deployed model name can be found by runningkubectl get apisixroutesfor a list of deployed models. Note: If using LiteLLM, the model name andv1are not needed. By default, LiteLLM is not used.OPENAI_API_KEYis the access token or key to access the model(s) on the server.
export BASE_URL="base_url_or_domain_of_node_or_cluster"
export OPENAI_API_KEY="contents_of_TOKEN"Create a script inference.py with these contents. Change the model if needed. The commented out code that lists the models will only work if the remote server is deployed with LiteLLM. Otherwise, only the specified model from the BASE_URL will be shown. If the SSL certificate is self-signed, an HTTP client is created with input argument verify=false to bypass it.
from openai import OpenAI
import os
import httpx
# Create a custom HTTP client with SSL verification disabled and custom headers
http_client = httpx.Client(
verify=False,
headers={
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
"Content-Type": "application/json"
}
)
client = OpenAI(
base_url=os.environ["BASE_URL"],
http_client=http_client
)
# For remote servers using LiteLLM only: list out available models from endpoint
#models = client.models.list()
#print("Available models: %s" %models)
# Run inference with model
print("Running inference with selected model:")
completion = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello"}
])
print(completion.choices[0].message)Run the script. The output should be the response to the query.
python3 inference.pyThe model can be customized to any model deployed on the node or cluster. The prompt can be changed in the messages argument.
Congratulations! Now use Intel® AI for Enterprise Inference to power other GenAI applications!
Return to the Post Deployment section for additional resources and tasks to try.