Ollama Service¶
Ollama provides high-performance LLM inference with GPU acceleration.
Overview¶
| Property | Value |
|---|---|
| Type | LLM Inference |
| Default Port | 11434 |
| GPU Required | Yes |
| Container | docker://ollama/ollama:latest |
Quick Start¶
# Start Ollama service
python main.py --recipe recipes/services/ollama.yaml
# Check status
python main.py --status
# Run benchmark
python main.py --recipe recipes/clients/ollama_benchmark.yaml --target-service ollama_xxx
Recipe Configuration¶
Basic Recipe¶
# recipes/services/ollama.yaml
service:
name: ollama
description: "Ollama LLM inference server"
container:
docker_source: docker://ollama/ollama:latest
image_path: $HOME/containers/ollama_latest.sif
resources:
nodes: 1
ntasks: 1
cpus_per_task: 4
mem: "32G"
time: "04:00:00"
partition: gpu
qos: default
gres: "gpu:1"
environment:
OLLAMA_HOST: "0.0.0.0:11434"
OLLAMA_MODELS: "$HOME/.ollama/models"
OLLAMA_NUM_PARALLEL: "4"
ports:
- 11434
With Monitoring¶
# recipes/services/ollama_with_cadvisor.yaml
service:
name: ollama
# ... same as above ...
enable_cadvisor: true
cadvisor_port: 8080
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
OLLAMA_HOST |
Bind address and port | 0.0.0.0:11434 |
OLLAMA_MODELS |
Model storage directory | $HOME/.ollama/models |
OLLAMA_NUM_PARALLEL |
Concurrent requests | 4 |
OLLAMA_NUM_GPU |
GPUs to use | All available |
OLLAMA_GPU_LAYERS |
Layers to offload to GPU | All |
Supported Models¶
Models are pulled automatically on first use:
| Model | Size | Description |
|---|---|---|
llama2 |
3.8GB | Meta's Llama 2 |
llama2:13b |
7.4GB | Llama 2 13B |
codellama |
3.8GB | Code-focused Llama |
mistral |
4.1GB | Mistral 7B |
qwen2.5:0.5b |
0.4GB | Qwen 0.5B (fast) |
API Endpoints¶
Once running, Ollama exposes:
| Endpoint | Method | Description |
|---|---|---|
/api/generate |
POST | Generate text completion |
/api/chat |
POST | Chat completion |
/api/tags |
GET | List available models |
/api/pull |
POST | Pull a model |
/api/embeddings |
POST | Generate embeddings |
Example: Generate Text¶
curl http://mel2073:11434/api/generate -d '{
"model": "llama2",
"prompt": "What is machine learning?",
"stream": false
}'
Example: Chat¶
curl http://mel2073:11434/api/chat -d '{
"model": "llama2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Benchmark Client¶
The Ollama benchmark client tests inference performance:
# recipes/clients/ollama_benchmark.yaml
client:
name: ollama_benchmark
type: ollama_benchmark
parameters:
model: "llama2"
num_requests: 50
concurrent_requests: 5
prompt_file: "prompts.txt" # Optional
output_file: "$HOME/results/ollama_benchmark.json"
resources:
cpus_per_task: 2
mem: "4G"
time: "00:30:00"
partition: cpu
Benchmark Metrics¶
| Metric | Description |
|---|---|
requests_per_second |
Throughput |
tokens_per_second |
Generation speed |
latency_mean |
Average response time |
latency_p95 |
95th percentile latency |
latency_p99 |
99th percentile latency |
success_rate |
Percentage of successful requests |
Multi-GPU Configuration¶
For larger models or higher throughput:
Troubleshooting¶
Model Not Loading¶
# Check GPU availability
nvidia-smi
# Check Ollama logs
cat slurm-*.out | grep -i error
# Verify model exists
curl http://localhost:11434/api/tags
Out of Memory¶
- Reduce
OLLAMA_NUM_PARALLEL - Use a smaller model
- Request more GPU memory in resources
Connection Refused¶
- Verify
OLLAMA_HOSTis set to0.0.0.0:11434 - Check firewall/network settings
- Ensure service is in RUNNING state
See also: Services Overview | Benchmark Examples