Service Recipes¶

Service recipes define how AI and database services are deployed on the HPC cluster.

Available Service Recipes¶

Recipe	Service	GPU	Monitoring
`ollama.yaml`	Ollama LLM	Yes	No
`ollama_with_cadvisor.yaml`	Ollama LLM	Yes	Yes
`redis.yaml`	Redis	No	No
`redis_with_cadvisor.yaml`	Redis	No	Yes
`chroma.yaml`	Chroma	No	No
`chroma_with_cadvisor.yaml`	Chroma	No	Yes
`mysql.yaml`	MySQL	No	No
`mysql_with_cadvisor.yaml`	MySQL	No	Yes
`prometheus_with_cadvisor.yaml`	Prometheus	No	Yes
`grafana.yaml`	Grafana	No	No

Recipe Fields Reference¶

Required Fields¶

Field	Type	Description
`name`	string	Unique service identifier
`container.docker_source`	string	Docker image source
`container.image_path`	string	Local SIF path

Resource Fields¶

Field	Type	Default	Description
`resources.nodes`	int	1	Number of nodes
`resources.ntasks`	int	1	Number of tasks
`resources.cpus_per_task`	int	1	CPUs per task
`resources.mem`	string	"4G"	Memory allocation
`resources.time`	string	"01:00:00"	Time limit
`resources.partition`	string	"cpu"	SLURM partition
`resources.qos`	string	"default"	Quality of service
`resources.gres`	string	-	GPU resources

Optional Fields¶

Field	Type	Description
`description`	string	Human-readable description
`environment`	dict	Environment variables
`ports`	list	Exposed ports
`enable_cadvisor`	bool	Enable monitoring
`cadvisor_port`	int	cAdvisor port (default: 8080)
`command`	string	Override container command
`args`	list	Command arguments

Example: Ollama Service¶

# recipes/services/ollama.yaml
service:
  name: ollama
  description: "Ollama LLM inference server with GPU acceleration"

  container:
    docker_source: docker://ollama/ollama:latest
    image_path: $HOME/containers/ollama_latest.sif

  resources:
    nodes: 1
    ntasks: 1
    cpus_per_task: 4
    mem: "32G"
    time: "04:00:00"
    partition: gpu
    qos: default
    gres: "gpu:1"

  environment:
    OLLAMA_HOST: "0.0.0.0:11434"
    OLLAMA_MODELS: "$HOME/.ollama/models"
    OLLAMA_NUM_PARALLEL: "4"
    OLLAMA_KEEP_ALIVE: "5m"

  ports:
    - 11434

Example: Redis with Monitoring¶

# recipes/services/redis_with_cadvisor.yaml
service:
  name: redis
  description: "Redis in-memory database with cAdvisor monitoring"

  container:
    docker_source: docker://redis:latest
    image_path: $HOME/containers/redis_latest.sif

  resources:
    nodes: 1
    cpus_per_task: 4
    mem: "8G"
    time: "02:00:00"
    partition: cpu

  environment:
    REDIS_PORT: "6379"
    REDIS_BIND: "0.0.0.0"

  ports:
    - 6379

  # Enable cAdvisor sidecar
  enable_cadvisor: true
  cadvisor_port: 8080

Example: Prometheus¶

# recipes/services/prometheus_with_cadvisor.yaml
service:
  name: prometheus
  description: "Prometheus metrics collection"

  container:
    docker_source: docker://prom/prometheus:latest
    image_path: $HOME/containers/prometheus.sif

  # Services to monitor (resolved at runtime)
  monitoring_targets:
    - service_id: "ollama_abc123"
      job_name: "ollama-cadvisor"
      port: 8080
    - service_id: "redis_xyz789"
      job_name: "redis-cadvisor"
      port: 8080

  resources:
    cpus_per_task: 2
    mem: "4G"
    time: "02:00:00"
    partition: cpu

  environment:
    PROMETHEUS_RETENTION_TIME: "15d"

  ports:
    - 9090

Using Service Recipes¶

# Basic usage
python main.py --recipe recipes/services/ollama.yaml

# With verbose output
python main.py --verbose --recipe recipes/services/redis.yaml

# Check what's running
python main.py --status

Generated SLURM Script¶

A service recipe generates a SLURM script like:

#!/bin/bash
#SBATCH --job-name=ollama_a1b2c3d4
#SBATCH --account=p200981
#SBATCH --partition=gpu
#SBATCH --qos=default
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --gres=gpu:1
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

module purge
module load Apptainer/1.2.4-GCCcore-12.3.0

# Service setup
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_MODELS=$HOME/.ollama/models
mkdir -p $HOME/.ollama/models

# Container execution
apptainer exec --nv \
    --bind $HOME/.ollama:/root/.ollama \
    $HOME/containers/ollama_latest.sif \
    ollama serve &

# Health check
sleep 10
curl -s http://localhost:11434/api/tags

wait

Next: Client Recipes | Writing Custom Recipes