Ollama Service¶

Ollama provides high-performance LLM inference with GPU acceleration.

Overview¶

Property	Value
Type	LLM Inference
Default Port	11434
GPU Required	Yes
Container	`docker://ollama/ollama:latest`

Quick Start¶

# Start Ollama service
python main.py --recipe recipes/services/ollama.yaml

# Check status
python main.py --status

# Run benchmark
python main.py --recipe recipes/clients/ollama_benchmark.yaml --target-service ollama_xxx

Recipe Configuration¶

Basic Recipe¶

# recipes/services/ollama.yaml
service:
  name: ollama
  description: "Ollama LLM inference server"

  container:
    docker_source: docker://ollama/ollama:latest
    image_path: $HOME/containers/ollama_latest.sif

  resources:
    nodes: 1
    ntasks: 1
    cpus_per_task: 4
    mem: "32G"
    time: "04:00:00"
    partition: gpu
    qos: default
    gres: "gpu:1"

  environment:
    OLLAMA_HOST: "0.0.0.0:11434"
    OLLAMA_MODELS: "$HOME/.ollama/models"
    OLLAMA_NUM_PARALLEL: "4"

  ports:
    - 11434

With Monitoring¶

# recipes/services/ollama_with_cadvisor.yaml
service:
  name: ollama
  # ... same as above ...

  enable_cadvisor: true
  cadvisor_port: 8080

Environment Variables¶

Variable	Description	Default
`OLLAMA_HOST`	Bind address and port	`0.0.0.0:11434`
`OLLAMA_MODELS`	Model storage directory	`$HOME/.ollama/models`
`OLLAMA_NUM_PARALLEL`	Concurrent requests	`4`
`OLLAMA_NUM_GPU`	GPUs to use	All available
`OLLAMA_GPU_LAYERS`	Layers to offload to GPU	All

Supported Models¶

Models are pulled automatically on first use:

Model	Size	Description
`llama2`	3.8GB	Meta's Llama 2
`llama2:13b`	7.4GB	Llama 2 13B
`codellama`	3.8GB	Code-focused Llama
`mistral`	4.1GB	Mistral 7B
`qwen2.5:0.5b`	0.4GB	Qwen 0.5B (fast)

API Endpoints¶

Once running, Ollama exposes:

Endpoint	Method	Description
`/api/generate`	POST	Generate text completion
`/api/chat`	POST	Chat completion
`/api/tags`	GET	List available models
`/api/pull`	POST	Pull a model
`/api/embeddings`	POST	Generate embeddings

Example: Generate Text¶

curl http://mel2073:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "What is machine learning?",
  "stream": false
}'

Example: Chat¶

curl http://mel2073:11434/api/chat -d '{
  "model": "llama2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Benchmark Client¶

The Ollama benchmark client tests inference performance:

# recipes/clients/ollama_benchmark.yaml
client:
  name: ollama_benchmark
  type: ollama_benchmark

  parameters:
    model: "llama2"
    num_requests: 50
    concurrent_requests: 5
    prompt_file: "prompts.txt"  # Optional
    output_file: "$HOME/results/ollama_benchmark.json"

  resources:
    cpus_per_task: 2
    mem: "4G"
    time: "00:30:00"
    partition: cpu

Benchmark Metrics¶

Metric	Description
`requests_per_second`	Throughput
`tokens_per_second`	Generation speed
`latency_mean`	Average response time
`latency_p95`	95th percentile latency
`latency_p99`	99th percentile latency
`success_rate`	Percentage of successful requests

Multi-GPU Configuration¶

For larger models or higher throughput:

resources:
  gres: "gpu:4"  # Request 4 GPUs

environment:
  OLLAMA_NUM_GPU: "4"

Troubleshooting¶

Model Not Loading¶

# Check GPU availability
nvidia-smi

# Check Ollama logs
cat slurm-*.out | grep -i error

# Verify model exists
curl http://localhost:11434/api/tags

Out of Memory¶

Reduce OLLAMA_NUM_PARALLEL
Use a smaller model
Request more GPU memory in resources

Connection Refused¶

Verify OLLAMA_HOST is set to 0.0.0.0:11434
Check firewall/network settings
Ensure service is in RUNNING state