Unified Observability: Migrating to Grafana LGTM Stack with Alloy
Published:
TL;DR
Today’s work focused on modernizing the observability stack for the Solana Trading System:
- Promtail → Alloy Migration: Replaced deprecated Promtail with modern Grafana Alloy for log collection
- LGTM Stack Upgrade: Integrated Mimir (metrics), Tempo (traces), and Pyroscope (profiling) alongside Loki and Grafana
- Events Dashboard Update: Enhanced the Grafana Events Dashboard with real-time NATS event monitoring and visual improvements
- Cloud-Ready Architecture: Positioned the stack for easy migration to Grafana Cloud in the future
Why Migrate to LGTM Stack?
The Deprecation Timeline
Promtail, Grafana’s log collection agent, enters Long-Term Support (LTS) on February 13, 2025. While it won’t disappear immediately, active development has ceased in favor of Grafana Alloy, the next-generation unified observability agent.
What is LGTM?
LGTM stands for Loki, Grafana, Tempo, Mimir - Grafana’s complete open-source observability stack:
- Loki: Log aggregation and querying
- Grafana: Visualization and dashboards
- Tempo: Distributed tracing
- Mimir: Long-term metrics storage (Prometheus-compatible)
Adding Pyroscope for continuous profiling, we get the LGTM+ stack - a complete observability solution.
Migration Part 1: Promtail to Grafana Alloy
What is Grafana Alloy?
Grafana Alloy is a unified observability agent that collects logs, metrics, traces, and profiles. It replaces multiple specialized agents (Promtail, Grafana Agent, OTel Collector) with a single, efficient binary.
Key Benefits:
- 20-30% lower memory usage compared to Promtail
- Built-in web UI for debugging and monitoring
- Dynamic configuration reload without restarts
- Future-proof with active development
- Unified collection for logs, metrics, traces, and profiles
Architecture: Before and After
Before (Promtail):
Docker Containers → Promtail (9080) → Loki → Grafana
Windows Host Logs → Promtail-Host (9081) → Loki → Grafana
After (Grafana Alloy):
Docker Containers → Alloy (12345) [UI] → Loki → Grafana
Windows Host Logs → Alloy-Host (12346) [UI] → Loki → Grafana
Alloy Configuration Highlights
Alloy uses the River configuration language (similar to HCL/Terraform), which is more readable than Promtail’s YAML:
Docker Container Log Collection (alloy-config.alloy):
// Discover Docker containers via socket
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
refresh_interval = "5s"
}
// Collect logs from containers
loki.source.docker "docker_logs" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.docker_logs.output
forward_to = [loki.process.docker_json.receiver]
refresh_interval = "5s"
}
// Parse JSON logs and extract fields
loki.process "docker_json" {
forward_to = [loki.write.loki_endpoint.receiver]
stage.json {
expressions = {
level = "level",
service = "service_name",
event_type = "event_type",
trace_id = "traceId",
}
}
stage.timestamp {
source = "timestamp"
format = "RFC3339Nano"
}
}
// Write to Loki
loki.write "loki_endpoint" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
external_labels = {
cluster = "solana-trading-system",
source = "alloy-docker",
}
}
Windows Host Log Collection (alloy-host-services.alloy):
// Discover log files in C:\logs
local.file_match "host_logs" {
path_targets = [{
__address__ = "localhost",
__path__ = "C:\\logs\\solana-trading-system\\*.log",
}]
}
// Read and tail log files
loki.source.file "host_service_logs" {
targets = local.file_match.host_logs.targets
forward_to = [loki.process.host_json.receiver]
}
// Parse Go service logs
loki.process "host_json" {
forward_to = [loki.write.loki_endpoint.receiver]
stage.json {
expressions = {
level = "level",
service = "service",
environment = "environment",
}
}
}
Migration Process
The migration was designed for zero downtime:
- Parallel Run: Started Alloy alongside Promtail
- Validation: Both agents sent logs to Loki simultaneously
- Verification: Confirmed identical log coverage
- Cutover: Moved Promtail to
legacyprofile (disabled by default) - Monitoring: 24-hour observation period
Docker Compose Changes:
# Alloy for Docker container logs
alloy:
image: grafana/alloy:latest
container_name: trading-system-alloy
ports:
- "12345:12345" # Alloy UI
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./deployment/monitoring/alloy/alloy-config.alloy:/etc/alloy/config.alloy:ro
- alloy_data:/var/lib/alloy
# Alloy for Windows host logs
alloy-host:
image: grafana/alloy:latest
container_name: trading-system-alloy-host
ports:
- "12346:12345" # Alloy UI (different port)
volumes:
- C:\logs:/logs:ro # Windows host mount
- ./deployment/monitoring/alloy/alloy-host-services.alloy:/etc/alloy/config.alloy:ro
- alloy_host_data:/var/lib/alloy
# Promtail moved to legacy profile (disabled by default)
promtail:
profiles: [legacy] # Start with: docker-compose --profile legacy up
# ... existing config
Alloy Web UI
One of Alloy’s killer features is the built-in web UI for debugging and monitoring:
Docker Logs UI: http://localhost:12345 Host Logs UI: http://localhost:12346
The UI provides:
- Real-time component pipeline visualization
- Live metrics and counters
- Configuration validation
- Debug logs and traces
- Component health status
Migration Part 2: Complete LGTM Stack
Architecture Overview
The LGTM stack provides unified observability across all signal types:
┌─────────────────────────────────────────────────────────┐
│ Trading System Services │
│ (Go, TypeScript, Rust) │
└──────────────┬──────────────┬──────────────┬────────────┘
│ │ │
Logs │ Metrics│ Traces │
│ │ │
┌──────▼────┐ ┌──────▼────┐ ┌─────▼──────┐
│ Alloy │ │Prometheus │ │ OTel │
│ (Agent) │ │ (Scrape) │ │ Collector │
└──────┬────┘ └──────┬────┘ └─────┬──────┘
│ │ │
┌──────▼────┐ ┌──────▼────┐ ┌─────▼──────┐
│ Loki │ │ Mimir │ │ Tempo │
│ (Logs) │ │ (Metrics) │ │ (Traces) │
└──────┬────┘ └──────┬────┘ └─────┬──────┘
│ │ │
└──────────────┴──────────────┘
│
┌──────▼────────┐
│ Grafana │
│ (Dashboards) │
└───────────────┘
Component Roles
1. Loki (Logs)
- Aggregates logs from Alloy
- Provides LogQL query language
- Indexes labels, not full text (cost-efficient)
- Retention: 7 days
2. Mimir (Metrics)
- Long-term Prometheus-compatible metrics storage
- Receives metrics via Prometheus remote_write
- Horizontally scalable
- Retention: 30+ days
3. Tempo (Traces)
- Distributed tracing storage
- Receives traces from OTel Collector
- Trace-to-logs correlation
- Retention: 24 hours
4. Pyroscope (Profiling)
- Continuous profiling (CPU, memory, goroutines)
- Integration with Go, Rust, TypeScript services
- Flame graph visualization
- Status: Ready, not yet instrumented
Data Sources Configuration
Grafana now has four primary data sources:
# datasource.yml
apiVersion: 1
datasources:
# Logs
- name: Loki
type: loki
uid: loki
url: http://loki:3100
isDefault: false
# Metrics (new, primary)
- name: Mimir
type: prometheus
uid: mimir
url: http://mimir:9009/prometheus
isDefault: true
# Metrics (legacy, via Prometheus scrape)
- name: Prometheus (Legacy)
type: prometheus
uid: prometheus
url: http://prometheus:9090
# Traces
- name: Tempo
type: tempo
uid: tempo
url: http://tempo:3200
# Profiles
- name: Pyroscope
type: pyroscope
uid: pyroscope
url: http://pyroscope:4040
Prometheus Remote Write to Mimir
Prometheus now writes metrics to Mimir for long-term storage:
prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'quote-service'
static_configs:
- targets: ['host.docker.internal:8080']
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8888']
# Remote write to Mimir
remote_write:
- url: http://mimir:9009/api/v1/push
queue_config:
capacity: 10000
max_samples_per_send: 5000
batch_send_deadline: 5s
Why Local LGTM Instead of Grafana Cloud?
Currently running the LGTM stack locally with Docker Compose, but the architecture is designed for easy cloud migration:
Current Benefits:
- No cost for unlimited metrics/logs/traces
- Full control over data and retention
- Low latency (all local)
- Development flexibility (easy to experiment)
Future Grafana Cloud Migration:
- Simplified operations (no infrastructure management)
- Automatic scaling (handle production load)
- Global availability (99.9% SLA)
- Advanced features (ML anomaly detection, alerting, etc.)
The migration path is simple - just change endpoint URLs in Alloy, Prometheus, and OTel Collector configs.
Events Dashboard Update
Overview
Updated the Trading System Events Dashboard to monitor real-time NATS events flowing through the system. This dashboard is critical for observing arbitrage opportunities, market data updates, and system health.

Dashboard Sections
1. Event Overview
- Events by Type: Bar chart showing distribution (PriceUpdate, SwapRoute, SlotUpdate, PoolStateChange, etc.)
- Event Rate: Time series showing events per second with mean/max statistics
2. Arbitrage Opportunities
- Opportunities by DEX Pair: Track which trading pairs show arbitrage potential
- Profit Estimates: Monitor estimated profit in USD
- Triangular Arbitrage: Detect multi-hop arbitrage paths
3. Market Data Events
- Price Updates by Source: Track price changes from different DEXes
- Liquidity Updates by DEX: Monitor pool liquidity across protocols
- Large Trades by DEX: Alert on significant transactions
- Spread Updates: Track price spreads between exchanges
- Volume Spikes: Detect unusual trading activity
4. System Health Events
- System Lifecycle: Service starts, stops, connections
- Connection Events: NATS, RPC, WebSocket connectivity
- Drift Events: Clock sync and validator drift monitoring
- Stall Events: Transaction processing delays
5. Recent Events
- Live Event Stream: Real-time log viewer with JSON inspection
- Critical Events: Filtered view of errors and warnings
LogQL Queries
The dashboard uses Loki’s LogQL to query events from the event-logger-service:
Event Rate Calculation:
sum by (event_type) (
count_over_time(
{service="event-logger-service"}
| json
| event_type != ""
[$__interval]
)
)
Arbitrage Profit Tracking:
{service="event-logger-service"}
| json
| event_type="ArbitrageOpportunity"
| line_format ""
Price Updates by Source:
sum by (source) (
count_over_time(
{service="event-logger-service"}
| json
| event_type="PriceUpdate"
[$__interval]
)
)
Critical Event Filtering:
{service="event-logger-service"}
| json
| level=~"error|fatal|critical"
| line_format " [] : "
Dashboard Features
- Auto-refresh: 5-second refresh for real-time monitoring
- Time ranges: Adjustable from 5 minutes to 24 hours
- Interactive: Click events to see full JSON payload
- Color-coded: Visual distinction between event types
- Alerting-ready: Thresholds for profit opportunities and error rates
Data Flow to Dashboard
Services (Go/Rust/TypeScript)
↓ Publish events
NATS JetStream
↓ Subscribe
Event-Logger-Service
↓ Structured logging
Alloy
↓ Log collection
Loki
↓ LogQL queries
Grafana Events Dashboard
Performance Characteristics
Resource Usage Comparison
Promtail vs Alloy:
| Metric | Promtail | Alloy | Improvement |
|---|---|---|---|
| Memory | ~80 MB | ~50-60 MB | 20-30% ↓ |
| CPU | 1-2% | 1-2% | Similar |
| Startup Time | ~5s | ~3s | 40% faster |
LGTM Stack (Local Docker):
| Service | Memory | CPU | Purpose |
|---|---|---|---|
| Loki | ~200 MB | 2-3% | Log aggregation |
| Mimir | ~300 MB | 3-5% | Metrics storage |
| Tempo | ~150 MB | 1-2% | Trace storage |
| Pyroscope | ~100 MB | 1% | Profile storage |
| Grafana | ~150 MB | 1-2% | Visualization |
| Total | ~900 MB | 8-13% | Full stack |
Throughput
- Log Processing: 10,000-50,000 logs/sec (Alloy)
- Metrics Ingestion: 1M samples/sec (Mimir)
- Trace Ingestion: 10,000 spans/sec (Tempo)
- Latency: < 100ms from event to visualization
Configuration Files
All configuration files are version-controlled in the repository:
Alloy:
LGTM Stack:
- deployment/monitoring/mimir/mimir.yaml
- deployment/monitoring/tempo/tempo.yaml
- deployment/monitoring/prometheus/prometheus.yml
Grafana:
- deployment/monitoring/grafana/provisioning/dashboards/events-dashboard.json
- deployment/monitoring/grafana/provisioning/datasources/datasource.yml
Docker Compose:
Quick Start Guide
Local Development Setup
Start LGTM Stack:
cd deployment/docker
docker-compose up -d
Access Services:
- Grafana: http://localhost:3000
- Alloy UI (Docker): http://localhost:12345
- Alloy UI (Host): http://localhost:12346
- Loki: http://localhost:3100
- Mimir: http://localhost:9009
- Tempo: http://localhost:3200
- Prometheus: http://localhost:9090
Verify Data Flow:
# Check Alloy is collecting logs
curl http://localhost:12345/metrics
# Query Loki for recent logs
curl -s 'http://localhost:3100/loki/api/v1/query?query={service=~".+"}' | jq
# Query Mimir for metrics
curl -s 'http://localhost:9009/prometheus/api/v1/query?query=up' | jq
# Check event logs
curl -s 'http://localhost:3100/loki/api/v1/query?query={service="event-logger-service"}' | jq
Validation
The repository includes a comprehensive validation script:
cd deployment/monitoring/alloy
.\validate-alloy.ps1
# Expected output:
# ✅ Alloy containers running
# ✅ Alloy UI accessible
# ✅ Logs flowing to Loki
# ✅ No critical errors
# ✅ Resource usage healthy
See the Quick Start Guide for detailed instructions.
Impact and Benefits
Immediate Benefits
- Modern Tooling: Using actively developed Grafana Alloy instead of deprecated Promtail
- Unified Observability: Single stack (LGTM) for logs, metrics, traces, and profiles
- Better Performance: 20-30% lower resource usage with Alloy
- Enhanced Debugging: Built-in Alloy UI for troubleshooting
- Real-Time Event Monitoring: Updated Events Dashboard for NATS event visibility
Long-Term Benefits
- Future-Proof: LGTM stack is Grafana’s strategic direction
- Cloud-Ready: Easy migration path to Grafana Cloud when needed
- Cost-Efficient: Local LGTM for development, cloud for production
- Scalability: Mimir and Tempo scale horizontally
- Standardization: Industry-standard observability stack
Developer Experience
- Single Dashboard: All observability signals in Grafana
- Correlation: Logs ↔ Metrics ↔ Traces ↔ Profiles
- Fast Queries: Optimized for real-time analysis
- Alerting: Rule-based alerts on metrics and logs
- Visualization: Rich, customizable dashboards
Next Steps
Phase 1: Stability
- Migrate Promtail to Alloy
- Integrate Mimir, Tempo, Pyroscope
- Update Events Dashboard
- Monitor resource usage and performance
- Update remaining dashboards to use Mimir
Phase 2: Instrumentation
- Add OpenTelemetry SDK to TypeScript services
- Add OpenTelemetry SDK to Go services (quote-service)
- Implement NATS trace propagation (context in event headers)
- Add Pyroscope profiling to quote-service
- Verify end-to-end trace collection
Phase 3: Advanced Features
- Configure Grafana alerting rules
- Set up SLO (Service Level Objectives) tracking
- Implement exemplars (metrics → traces linking)
- Create unified dashboards with all signal types
- Performance profiling optimization
Phase 4: Production
- Evaluate Grafana Cloud migration
- Configure long-term retention policies
- Set up multi-region observability
- Implement advanced ML anomaly detection
- Create runbooks and alerting workflows
Troubleshooting
Alloy Not Collecting Logs
Check Alloy status:
docker-compose logs alloy | grep -i error
curl http://localhost:12345/ready
Verify Docker socket access:
docker-compose exec alloy ls -la /var/run/docker.sock
No Data in Mimir
Check Prometheus remote_write:
docker-compose logs prometheus | grep remote_write
curl 'http://localhost:9009/prometheus/api/v1/query?query=up'
Restart Prometheus:
docker-compose restart prometheus
Events Not Showing in Dashboard
Check event-logger is running:
docker-compose logs event-logger-service --tail 20
Query Loki directly:
curl -s 'http://localhost:3100/loki/api/v1/query?query={service="event-logger-service"}' | jq
Check time range in Grafana:
- Try “Last 15 minutes” or “Last hour”
- Wait 30-60 seconds for indexing
Conclusion
Today’s work modernized the Solana Trading System’s observability stack with a comprehensive migration to the Grafana LGTM platform. By replacing deprecated Promtail with Grafana Alloy and integrating Mimir, Tempo, and Pyroscope, we now have a unified, cloud-ready observability solution that provides deep visibility into logs, metrics, traces, and profiles.
The updated Events Dashboard gives real-time insight into NATS events, arbitrage opportunities, and system health - critical for monitoring a high-frequency trading system. The architecture is positioned for easy migration to Grafana Cloud when production scale demands it, while maintaining cost efficiency during development with a local Docker deployment.
Key Achievements:
- ✅ Zero-downtime migration from Promtail to Alloy
- ✅ Complete LGTM stack integration (Loki, Grafana, Tempo, Mimir)
- ✅ Enhanced Events Dashboard with real-time NATS monitoring
- ✅ Cloud-ready architecture for future scaling
- ✅ 20-30% reduction in log collection resource usage
The next phase will focus on instrumenting services with OpenTelemetry for distributed tracing and adding continuous profiling to identify performance bottlenecks.
Related Posts:
- Day’s Work: Observability and Monitoring for Solana Trading System
- Day’s Work: NATS Event Publishing and TypeScript Scanner Service Foundation
- Getting Started: Building a Solana Trading System from Prototypes
Technical Documentation:
Connect
- GitHub: github.com/guidebee
- LinkedIn: linkedin.com/in/guidebee
This is post #7 in the Solana Trading System development series. Follow along as I document the journey from working prototypes to production HFT system.
