Grafana Dashboards¶
The orchestrator provides three pre-configured Grafana dashboards for monitoring services and benchmarks.
Accessing Grafana¶
1. Create SSH Tunnel¶
# Find Grafana node from status or script output
python main.py --status
# Example: grafana_xxx | RUNNING | mel0164
# Create tunnel
ssh -i ~/.ssh/id_ed25519_mlux -L 3000:mel0164:3000 -N u103227@login.lxp.lu -p 8822
2. Open Browser¶
Navigate to http://localhost:3000
Default credentials: admin / admin
Overview Dashboard¶
URL: /d/overview/overview
System-wide view of all monitored containers.
Panels¶
| Panel | Description | Query |
|---|---|---|
| Active Targets | Count of UP scrape targets | count(up == 1) |
| Running Containers | Number of containers | count(container_last_seen{name=~".+"}) |
| Avg CPU % | Average CPU usage | avg(rate(container_cpu_usage_seconds_total[1m])) * 100 |
| Total Memory | Sum of memory usage | sum(container_memory_usage_bytes) |
| Network RX | Receive rate | sum(rate(container_network_receive_bytes_total[1m])) |
| Network TX | Transmit rate | sum(rate(container_network_transmit_bytes_total[1m])) |
| CPU Timeline | CPU usage over time | rate(container_cpu_usage_seconds_total{name=~".+"}[1m]) |
| Memory Timeline | Memory usage over time | container_memory_usage_bytes{name=~".+"} |
| Network Traffic | Bidirectional traffic | RX and TX combined |
| Target Status | Table of scrape targets | up |
Service Monitoring Dashboard¶
URL: /d/service-monitoring/service-monitoring
Detailed metrics for selected containers.
Variables¶
| Variable | Description |
|---|---|
$container |
Multi-select container filter |
$job |
Multi-select job filter |
Panels¶
| Panel | Description |
|---|---|
| CPU Bar Gauge | Current CPU usage by container |
| Memory Bar Gauge | Current memory by container |
| Network RX Bar | Receive rate by container |
| Network TX Bar | Transmit rate by container |
| CPU Timeline | CPU over time with legend |
| Memory Total | Total memory timeline |
| Memory Breakdown | Working set + cache |
| Network Throughput | Bidirectional view |
| Cumulative I/O | Total bytes transferred |
| Filesystem Usage | Disk usage bars |
| Memory Limit % | Gauge showing % of limit |
Using the Dashboard¶
- Use the Container dropdown to filter by specific containers
- Use the Job dropdown to filter by Prometheus job
- Adjust time range in the top-right
- Click on legend items to show/hide series
Benchmark Dashboard¶
URL: /d/benchmarks/benchmarks
Performance-focused view during benchmark runs.
Panels¶
| Panel | Description |
|---|---|
| Summary Stats | Avg/Peak CPU and Memory |
| Live CPU Timeline | 30-second window for responsiveness |
| Live Memory Timeline | Current memory state |
| Network RX Rate | Receive throughput |
| Network TX Rate | Transmit throughput |
| CPU Heatmap | Visual CPU distribution |
| Avg CPU Bar | Average over benchmark period |
| Avg Memory Bar | Average over benchmark period |
| Target Health | Scrape target status |
Best for¶
- Watching benchmark progress in real-time
- Comparing resource usage between containers
- Identifying performance bottlenecks
Customizing Dashboards¶
Add a Panel¶
- Click Add panel button
- Choose visualization type
- Enter PromQL query
- Configure display options
- Save dashboard
Example Custom Panel¶
CPU usage gauge:
Memory percentage:
container_memory_usage_bytes{name="ollama"} /
container_spec_memory_limit_bytes{name="ollama"} * 100
Save Custom Dashboards¶
- Make changes
- Click Save (disk icon)
- Optionally export as JSON
Useful PromQL Queries¶
Per-Container CPU¶
Memory Working Set¶
Network by Container¶
# Receive rate
rate(container_network_receive_bytes_total{name=~".+"}[1m])
# Transmit rate (negative for bidirectional view)
-rate(container_network_transmit_bytes_total{name=~".+"}[1m])
Container Count¶
Troubleshooting¶
No Data Displayed¶
- Check Prometheus is running and accessible
- Verify datasource configuration
- Ensure cAdvisor targets are being scraped
- Check time range includes recent data
Connection Refused¶
- Verify SSH tunnel is active
- Check Grafana is running (
python main.py --status) - Confirm correct node in tunnel command
Datasource Error¶
- Go to Configuration → Data Sources
- Click on Prometheus datasource
- Verify URL matches Prometheus node
- Click Test to verify connection
See also: Monitoring Overview | Prometheus Metrics