Grafana Dashboards¶

The orchestrator provides three pre-configured Grafana dashboards for monitoring services and benchmarks.

Accessing Grafana¶

1. Create SSH Tunnel¶

# Find Grafana node from status or script output
python main.py --status
# Example: grafana_xxx | RUNNING | mel0164

# Create tunnel
ssh -i ~/.ssh/id_ed25519_mlux -L 3000:mel0164:3000 -N u103227@login.lxp.lu -p 8822

2. Open Browser¶

Navigate to http://localhost:3000

Default credentials: admin / admin

Overview Dashboard¶

URL: /d/overview/overview

System-wide view of all monitored containers.

Panels¶

Panel	Description	Query
Active Targets	Count of UP scrape targets	`count(up == 1)`
Running Containers	Number of containers	`count(container_last_seen{name=~".+"})`
Avg CPU %	Average CPU usage	`avg(rate(container_cpu_usage_seconds_total[1m])) * 100`
Total Memory	Sum of memory usage	`sum(container_memory_usage_bytes)`
Network RX	Receive rate	`sum(rate(container_network_receive_bytes_total[1m]))`
Network TX	Transmit rate	`sum(rate(container_network_transmit_bytes_total[1m]))`
CPU Timeline	CPU usage over time	`rate(container_cpu_usage_seconds_total{name=~".+"}[1m])`
Memory Timeline	Memory usage over time	`container_memory_usage_bytes{name=~".+"}`
Network Traffic	Bidirectional traffic	RX and TX combined
Target Status	Table of scrape targets	`up`

Service Monitoring Dashboard¶

URL: /d/service-monitoring/service-monitoring

Detailed metrics for selected containers.

Variables¶

Variable	Description
`$container`	Multi-select container filter
`$job`	Multi-select job filter

Panels¶

Panel	Description
CPU Bar Gauge	Current CPU usage by container
Memory Bar Gauge	Current memory by container
Network RX Bar	Receive rate by container
Network TX Bar	Transmit rate by container
CPU Timeline	CPU over time with legend
Memory Total	Total memory timeline
Memory Breakdown	Working set + cache
Network Throughput	Bidirectional view
Cumulative I/O	Total bytes transferred
Filesystem Usage	Disk usage bars
Memory Limit %	Gauge showing % of limit

Using the Dashboard¶

Use the Container dropdown to filter by specific containers
Use the Job dropdown to filter by Prometheus job
Adjust time range in the top-right
Click on legend items to show/hide series

Benchmark Dashboard¶

URL: /d/benchmarks/benchmarks

Performance-focused view during benchmark runs.

Panels¶

Panel	Description
Summary Stats	Avg/Peak CPU and Memory
Live CPU Timeline	30-second window for responsiveness
Live Memory Timeline	Current memory state
Network RX Rate	Receive throughput
Network TX Rate	Transmit throughput
CPU Heatmap	Visual CPU distribution
Avg CPU Bar	Average over benchmark period
Avg Memory Bar	Average over benchmark period
Target Health	Scrape target status

Best for¶

Watching benchmark progress in real-time
Comparing resource usage between containers
Identifying performance bottlenecks

Customizing Dashboards¶

Add a Panel¶

Click Add panel button
Choose visualization type
Enter PromQL query
Configure display options
Save dashboard

Example Custom Panel¶

CPU usage gauge:

rate(container_cpu_usage_seconds_total{name="ollama"}[1m]) * 100

Memory percentage:

container_memory_usage_bytes{name="ollama"} / 
container_spec_memory_limit_bytes{name="ollama"} * 100

Save Custom Dashboards¶

Make changes
Click Save (disk icon)
Optionally export as JSON

Useful PromQL Queries¶

Per-Container CPU¶

rate(container_cpu_usage_seconds_total{name=~".+"}[1m])

Memory Working Set¶

container_memory_working_set_bytes{name=~".+"}

Network by Container¶

# Receive rate
rate(container_network_receive_bytes_total{name=~".+"}[1m])

# Transmit rate (negative for bidirectional view)
-rate(container_network_transmit_bytes_total{name=~".+"}[1m])

Container Count¶

count(container_last_seen{name=~".+"})

Troubleshooting¶

No Data Displayed¶

Check Prometheus is running and accessible
Verify datasource configuration
Ensure cAdvisor targets are being scraped
Check time range includes recent data

Connection Refused¶

Verify SSH tunnel is active
Check Grafana is running (python main.py --status)
Confirm correct node in tunnel command

Datasource Error¶

Go to Configuration → Data Sources
Click on Prometheus datasource
Verify URL matches Prometheus node
Click Test to verify connection