Skip to content

Service Recipes

Service recipes define how AI and database services are deployed on the HPC cluster.

Available Service Recipes

Recipe Service GPU Monitoring
ollama.yaml Ollama LLM Yes No
ollama_with_cadvisor.yaml Ollama LLM Yes Yes
redis.yaml Redis No No
redis_with_cadvisor.yaml Redis No Yes
chroma.yaml Chroma No No
chroma_with_cadvisor.yaml Chroma No Yes
mysql.yaml MySQL No No
mysql_with_cadvisor.yaml MySQL No Yes
prometheus_with_cadvisor.yaml Prometheus No Yes
grafana.yaml Grafana No No

Recipe Fields Reference

Required Fields

Field Type Description
name string Unique service identifier
container.docker_source string Docker image source
container.image_path string Local SIF path

Resource Fields

Field Type Default Description
resources.nodes int 1 Number of nodes
resources.ntasks int 1 Number of tasks
resources.cpus_per_task int 1 CPUs per task
resources.mem string "4G" Memory allocation
resources.time string "01:00:00" Time limit
resources.partition string "cpu" SLURM partition
resources.qos string "default" Quality of service
resources.gres string - GPU resources

Optional Fields

Field Type Description
description string Human-readable description
environment dict Environment variables
ports list Exposed ports
enable_cadvisor bool Enable monitoring
cadvisor_port int cAdvisor port (default: 8080)
command string Override container command
args list Command arguments

Example: Ollama Service

# recipes/services/ollama.yaml
service:
  name: ollama
  description: "Ollama LLM inference server with GPU acceleration"

  container:
    docker_source: docker://ollama/ollama:latest
    image_path: $HOME/containers/ollama_latest.sif

  resources:
    nodes: 1
    ntasks: 1
    cpus_per_task: 4
    mem: "32G"
    time: "04:00:00"
    partition: gpu
    qos: default
    gres: "gpu:1"

  environment:
    OLLAMA_HOST: "0.0.0.0:11434"
    OLLAMA_MODELS: "$HOME/.ollama/models"
    OLLAMA_NUM_PARALLEL: "4"
    OLLAMA_KEEP_ALIVE: "5m"

  ports:
    - 11434

Example: Redis with Monitoring

# recipes/services/redis_with_cadvisor.yaml
service:
  name: redis
  description: "Redis in-memory database with cAdvisor monitoring"

  container:
    docker_source: docker://redis:latest
    image_path: $HOME/containers/redis_latest.sif

  resources:
    nodes: 1
    cpus_per_task: 4
    mem: "8G"
    time: "02:00:00"
    partition: cpu

  environment:
    REDIS_PORT: "6379"
    REDIS_BIND: "0.0.0.0"

  ports:
    - 6379

  # Enable cAdvisor sidecar
  enable_cadvisor: true
  cadvisor_port: 8080

Example: Prometheus

# recipes/services/prometheus_with_cadvisor.yaml
service:
  name: prometheus
  description: "Prometheus metrics collection"

  container:
    docker_source: docker://prom/prometheus:latest
    image_path: $HOME/containers/prometheus.sif

  # Services to monitor (resolved at runtime)
  monitoring_targets:
    - service_id: "ollama_abc123"
      job_name: "ollama-cadvisor"
      port: 8080
    - service_id: "redis_xyz789"
      job_name: "redis-cadvisor"
      port: 8080

  resources:
    cpus_per_task: 2
    mem: "4G"
    time: "02:00:00"
    partition: cpu

  environment:
    PROMETHEUS_RETENTION_TIME: "15d"

  ports:
    - 9090

Using Service Recipes

# Basic usage
python main.py --recipe recipes/services/ollama.yaml

# With verbose output
python main.py --verbose --recipe recipes/services/redis.yaml

# Check what's running
python main.py --status

Generated SLURM Script

A service recipe generates a SLURM script like:

#!/bin/bash
#SBATCH --job-name=ollama_a1b2c3d4
#SBATCH --account=p200981
#SBATCH --partition=gpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --gres=gpu:1
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

module purge
module load Apptainer/1.2.4-GCCcore-12.3.0

# Service setup
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_MODELS=$HOME/.ollama/models
mkdir -p $HOME/.ollama/models

# Container execution
apptainer exec --nv \
    --bind $HOME/.ollama:/root/.ollama \
    $HOME/containers/ollama_latest.sif \
    ollama serve &

# Health check
sleep 10
curl -s http://localhost:11434/api/tags

wait

Next: Client Recipes | Writing Custom Recipes