Getting Started¶
This guide will help you set up and run the HPC AI Benchmarking Orchestrator on the MeluXina supercomputer.
Overview¶
The HPC AI Benchmarking Orchestrator is a Python framework for deploying, benchmarking, and monitoring containerized AI services on HPC clusters. It automates the complex workflow of SLURM job submission, container management, and metrics collection.
Key Capabilities:
- Deploy AI services (Ollama, Redis, Chroma, MySQL) via SLURM
- Run automated benchmark workloads with configurable parameters
- Collect real-time metrics via Prometheus and cAdvisor
- Visualize performance through Grafana dashboards
- Generate benchmark reports and analysis
System Requirements¶
Local Machine¶
- Python 3.9+
- SSH client with key-based authentication
- Git
HPC Cluster (MeluXina)¶
- SLURM workload manager
- Apptainer/Singularity for containerization
- Access to GPU nodes (for Ollama)
- Project allocation (account:
p200981or your project)
Quick Overview¶
sequenceDiagram
participant User
participant CLI as main.py
participant SSH as SSHClient
participant SLURM
participant Container as Apptainer
User->>CLI: python main.py --recipe service.yaml
CLI->>SSH: Connect to MeluXina
SSH->>SLURM: sbatch job_script.sh
SLURM->>Container: Launch on compute node
Container-->>SLURM: Service running
SLURM-->>SSH: Job ID + Node
SSH-->>CLI: Service info
CLI-->>User: Service started: abc123
Workflow¶
- Configure - Set up
config.yamlwith HPC credentials - Deploy - Start services using YAML recipes
- Benchmark - Run client workloads against services
- Monitor - View metrics in Grafana dashboards
- Analyze - Download and process results
Next Steps¶
- Installation Guide - Detailed setup instructions
- Quick Start - Run your first benchmark in 5 minutes
- Architecture - Understand the system design
Continue to Installation →