Services Overview¶
The orchestrator supports multiple AI and database services, each optimized for HPC deployment via Apptainer containers.
Available Services¶
| Service | Type | Default Port | GPU Required | Description |
|---|---|---|---|---|
| Ollama | LLM Inference | 11434 | Yes | High-performance LLM server |
| Redis | In-Memory DB | 6379 | No | Key-value store with persistence |
| Chroma | Vector DB | 8000 | No | Vector similarity search |
| MySQL | RDBMS | 3306 | No | Relational database |
| Prometheus | Monitoring | 9090 | No | Metrics collection |
| Grafana | Visualization | 3000 | No | Dashboard visualization |
Service Architecture¶
Each service follows a common pattern:
graph LR
subgraph SLURM["SLURM Job"]
subgraph Container["Apptainer Container"]
Service["Service Process"]
end
cAdvisor["cAdvisor<br/>(optional)"]
end
Client["Benchmark Client"] --> Service
cAdvisor --> Prometheus["Prometheus"]
Prometheus --> Grafana["Grafana"]
style Service fill:#C8E6C9
style cAdvisor fill:#FFE0B2
Common Configuration¶
All services share these common recipe fields:
service:
name: service_name
description: "Service description"
# Container configuration
container:
docker_source: docker://image:tag
image_path: $HOME/containers/image.sif
# SLURM resources
resources:
nodes: 1
ntasks: 1
cpus_per_task: 4
mem: "16G"
time: "02:00:00"
partition: gpu # or cpu
qos: default
# Environment variables
environment:
VAR_NAME: "value"
# Exposed ports
ports:
- 8080
# Optional monitoring
enable_cadvisor: true
cadvisor_port: 8080
Service Lifecycle¶
stateDiagram-v2
[*] --> Recipe: Load YAML
Recipe --> Script: Generate SLURM script
Script --> Submitted: sbatch
Submitted --> Pending: Queued
Pending --> Starting: Resources allocated
Starting --> Running: Container ready
Running --> Running: Health checks pass
Running --> Stopped: scancel or timeout
Stopped --> [*]
Monitoring Integration¶
Services can be deployed with cAdvisor for container metrics:
This enables collection of:
- CPU Usage:
container_cpu_usage_seconds_total - Memory Usage:
container_memory_usage_bytes - Network I/O:
container_network_*_bytes_total - Filesystem:
container_fs_usage_bytes
Quick Start Commands¶
# List available services
python main.py --list-services
# Start a service
python main.py --recipe recipes/services/ollama.yaml
# Check status
python main.py --status
# Stop a service
python main.py --stop-service <service_id>
Service-Specific Documentation¶
- Ollama (LLM) - GPU-accelerated language model inference
- Redis - In-memory database with benchmarking
- Chroma - Vector similarity search
- MySQL - Relational database
- Prometheus - Metrics collection and storage
- Grafana - Visualization and dashboards