Unified Observability: Migrating to Grafana LGTM Stack with Alloy

13 minute read

Published: December 10, 2025

TL;DR

Today’s work focused on modernizing the observability stack for the Solana Trading System:

Promtail → Alloy Migration: Replaced deprecated Promtail with modern Grafana Alloy for log collection
LGTM Stack Upgrade: Integrated Mimir (metrics), Tempo (traces), and Pyroscope (profiling) alongside Loki and Grafana
Events Dashboard Update: Enhanced the Grafana Events Dashboard with real-time NATS event monitoring and visual improvements
Cloud-Ready Architecture: Positioned the stack for easy migration to Grafana Cloud in the future

Why Migrate to LGTM Stack?

The Deprecation Timeline

Promtail, Grafana’s log collection agent, enters Long-Term Support (LTS) on February 13, 2025. While it won’t disappear immediately, active development has ceased in favor of Grafana Alloy, the next-generation unified observability agent.

What is LGTM?

LGTM stands for Loki, Grafana, Tempo, Mimir - Grafana’s complete open-source observability stack:

Loki: Log aggregation and querying
Grafana: Visualization and dashboards
Tempo: Distributed tracing
Mimir: Long-term metrics storage (Prometheus-compatible)

Adding Pyroscope for continuous profiling, we get the LGTM+ stack - a complete observability solution.

Migration Part 1: Promtail to Grafana Alloy

What is Grafana Alloy?

Grafana Alloy is a unified observability agent that collects logs, metrics, traces, and profiles. It replaces multiple specialized agents (Promtail, Grafana Agent, OTel Collector) with a single, efficient binary.

Key Benefits:

20-30% lower memory usage compared to Promtail
Built-in web UI for debugging and monitoring
Dynamic configuration reload without restarts
Future-proof with active development
Unified collection for logs, metrics, traces, and profiles

Architecture: Before and After

Before (Promtail):

Docker Containers → Promtail (9080) → Loki → Grafana
Windows Host Logs → Promtail-Host (9081) → Loki → Grafana

After (Grafana Alloy):

Docker Containers → Alloy (12345) [UI] → Loki → Grafana
Windows Host Logs → Alloy-Host (12346) [UI] → Loki → Grafana

Alloy Configuration Highlights

Alloy uses the River configuration language (similar to HCL/Terraform), which is more readable than Promtail’s YAML:

Docker Container Log Collection (alloy-config.alloy):

// Discover Docker containers via socket
discovery.docker "containers" {
  host             = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}

// Collect logs from containers
loki.source.docker "docker_logs" {
  host             = "unix:///var/run/docker.sock"
  targets          = discovery.relabel.docker_logs.output
  forward_to       = [loki.process.docker_json.receiver]
  refresh_interval = "5s"
}

// Parse JSON logs and extract fields
loki.process "docker_json" {
  forward_to = [loki.write.loki_endpoint.receiver]

  stage.json {
    expressions = {
      level       = "level",
      service     = "service_name",
      event_type  = "event_type",
      trace_id    = "traceId",
    }
  }

  stage.timestamp {
    source = "timestamp"
    format = "RFC3339Nano"
  }
}

// Write to Loki
loki.write "loki_endpoint" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
  external_labels = {
    cluster = "solana-trading-system",
    source  = "alloy-docker",
  }
}

Windows Host Log Collection (alloy-host-services.alloy):

// Discover log files in C:\logs
local.file_match "host_logs" {
  path_targets = [{
    __address__ = "localhost",
    __path__    = "C:\\logs\\solana-trading-system\\*.log",
  }]
}

// Read and tail log files
loki.source.file "host_service_logs" {
  targets    = local.file_match.host_logs.targets
  forward_to = [loki.process.host_json.receiver]
}

// Parse Go service logs
loki.process "host_json" {
  forward_to = [loki.write.loki_endpoint.receiver]

  stage.json {
    expressions = {
      level       = "level",
      service     = "service",
      environment = "environment",
    }
  }
}

Migration Process

The migration was designed for zero downtime:

Parallel Run: Started Alloy alongside Promtail
Validation: Both agents sent logs to Loki simultaneously
Verification: Confirmed identical log coverage
Cutover: Moved Promtail to legacy profile (disabled by default)
Monitoring: 24-hour observation period

Docker Compose Changes:

# Alloy for Docker container logs
alloy:
  image: grafana/alloy:latest
  container_name: trading-system-alloy
  ports:
    - "12345:12345"  # Alloy UI
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock:ro
    - ./deployment/monitoring/alloy/alloy-config.alloy:/etc/alloy/config.alloy:ro
    - alloy_data:/var/lib/alloy

# Alloy for Windows host logs
alloy-host:
  image: grafana/alloy:latest
  container_name: trading-system-alloy-host
  ports:
    - "12346:12345"  # Alloy UI (different port)
  volumes:
    - C:\logs:/logs:ro  # Windows host mount
    - ./deployment/monitoring/alloy/alloy-host-services.alloy:/etc/alloy/config.alloy:ro
    - alloy_host_data:/var/lib/alloy

# Promtail moved to legacy profile (disabled by default)
promtail:
  profiles: [legacy]  # Start with: docker-compose --profile legacy up
  # ... existing config

Alloy Web UI

One of Alloy’s killer features is the built-in web UI for debugging and monitoring:

Docker Logs UI: http://localhost:12345 Host Logs UI: http://localhost:12346

The UI provides:

Real-time component pipeline visualization
Live metrics and counters
Configuration validation
Debug logs and traces
Component health status

Migration Part 2: Complete LGTM Stack

Architecture Overview

The LGTM stack provides unified observability across all signal types:

┌─────────────────────────────────────────────────────────┐
│                   Trading System Services                │
│                (Go, TypeScript, Rust)                   │
└──────────────┬──────────────┬──────────────┬────────────┘
               │              │              │
        Logs   │       Metrics│       Traces │
               │              │              │
        ┌──────▼────┐  ┌──────▼────┐  ┌─────▼──────┐
        │   Alloy   │  │Prometheus │  │ OTel       │
        │  (Agent)  │  │  (Scrape) │  │ Collector  │
        └──────┬────┘  └──────┬────┘  └─────┬──────┘
               │              │              │
        ┌──────▼────┐  ┌──────▼────┐  ┌─────▼──────┐
        │   Loki    │  │   Mimir   │  │   Tempo    │
        │  (Logs)   │  │ (Metrics) │  │  (Traces)  │
        └──────┬────┘  └──────┬────┘  └─────┬──────┘
               │              │              │
               └──────────────┴──────────────┘
                              │
                       ┌──────▼────────┐
                       │    Grafana    │
                       │ (Dashboards)  │
                       └───────────────┘

Component Roles

1. Loki (Logs)

Aggregates logs from Alloy
Provides LogQL query language
Indexes labels, not full text (cost-efficient)
Retention: 7 days

2. Mimir (Metrics)

Long-term Prometheus-compatible metrics storage
Receives metrics via Prometheus remote_write
Horizontally scalable
Retention: 30+ days

3. Tempo (Traces)

Distributed tracing storage
Receives traces from OTel Collector
Trace-to-logs correlation
Retention: 24 hours

4. Pyroscope (Profiling)

Continuous profiling (CPU, memory, goroutines)
Integration with Go, Rust, TypeScript services
Flame graph visualization
Status: Ready, not yet instrumented

Data Sources Configuration

Grafana now has four primary data sources:

# datasource.yml
apiVersion: 1

datasources:
  # Logs
  - name: Loki
    type: loki
    uid: loki
    url: http://loki:3100
    isDefault: false

  # Metrics (new, primary)
  - name: Mimir
    type: prometheus
    uid: mimir
    url: http://mimir:9009/prometheus
    isDefault: true

  # Metrics (legacy, via Prometheus scrape)
  - name: Prometheus (Legacy)
    type: prometheus
    uid: prometheus
    url: http://prometheus:9090

  # Traces
  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200

  # Profiles
  - name: Pyroscope
    type: pyroscope
    uid: pyroscope
    url: http://pyroscope:4040

Prometheus Remote Write to Mimir

Prometheus now writes metrics to Mimir for long-term storage:

prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'quote-service'
    static_configs:
      - targets: ['host.docker.internal:8080']

  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8888']

# Remote write to Mimir
remote_write:
  - url: http://mimir:9009/api/v1/push
    queue_config:
      capacity: 10000
      max_samples_per_send: 5000
      batch_send_deadline: 5s

Why Local LGTM Instead of Grafana Cloud?

Currently running the LGTM stack locally with Docker Compose, but the architecture is designed for easy cloud migration:

Current Benefits:

No cost for unlimited metrics/logs/traces
Full control over data and retention
Low latency (all local)
Development flexibility (easy to experiment)

Future Grafana Cloud Migration:

Simplified operations (no infrastructure management)
Automatic scaling (handle production load)
Global availability (99.9% SLA)
Advanced features (ML anomaly detection, alerting, etc.)

The migration path is simple - just change endpoint URLs in Alloy, Prometheus, and OTel Collector configs.

Events Dashboard Update

Overview

Updated the Trading System Events Dashboard to monitor real-time NATS events flowing through the system. This dashboard is critical for observing arbitrage opportunities, market data updates, and system health.

Events Dashboard

Dashboard Sections

1. Event Overview

Events by Type: Bar chart showing distribution (PriceUpdate, SwapRoute, SlotUpdate, PoolStateChange, etc.)
Event Rate: Time series showing events per second with mean/max statistics

2. Arbitrage Opportunities

Opportunities by DEX Pair: Track which trading pairs show arbitrage potential
Profit Estimates: Monitor estimated profit in USD
Triangular Arbitrage: Detect multi-hop arbitrage paths

3. Market Data Events

Price Updates by Source: Track price changes from different DEXes
Liquidity Updates by DEX: Monitor pool liquidity across protocols
Large Trades by DEX: Alert on significant transactions
Spread Updates: Track price spreads between exchanges
Volume Spikes: Detect unusual trading activity

4. System Health Events

System Lifecycle: Service starts, stops, connections
Connection Events: NATS, RPC, WebSocket connectivity
Drift Events: Clock sync and validator drift monitoring
Stall Events: Transaction processing delays

5. Recent Events

Live Event Stream: Real-time log viewer with JSON inspection
Critical Events: Filtered view of errors and warnings

LogQL Queries

The dashboard uses Loki’s LogQL to query events from the event-logger-service:

Event Rate Calculation:

sum by (event_type) (
  count_over_time(
    {service="event-logger-service"}
    | json
    | event_type != ""
    [$__interval]
  )
)

Arbitrage Profit Tracking:

{service="event-logger-service"}
| json
| event_type="ArbitrageOpportunity"
| line_format ""

Price Updates by Source:

sum by (source) (
  count_over_time(
    {service="event-logger-service"}
    | json
    | event_type="PriceUpdate"
    [$__interval]
  )
)

Critical Event Filtering:

{service="event-logger-service"}
| json
| level=~"error|fatal|critical"
| line_format " [] : "

Dashboard Features

Auto-refresh: 5-second refresh for real-time monitoring
Time ranges: Adjustable from 5 minutes to 24 hours
Interactive: Click events to see full JSON payload
Color-coded: Visual distinction between event types
Alerting-ready: Thresholds for profit opportunities and error rates

Data Flow to Dashboard

Services (Go/Rust/TypeScript)
    ↓ Publish events
NATS JetStream
    ↓ Subscribe
Event-Logger-Service
    ↓ Structured logging
Alloy
    ↓ Log collection
Loki
    ↓ LogQL queries
Grafana Events Dashboard

Performance Characteristics

Resource Usage Comparison

Promtail vs Alloy:

Metric	Promtail	Alloy	Improvement
Memory	~80 MB	~50-60 MB	20-30% ↓
CPU	1-2%	1-2%	Similar
Startup Time	~5s	~3s	40% faster

LGTM Stack (Local Docker):

Service	Memory	CPU	Purpose
Loki	~200 MB	2-3%	Log aggregation
Mimir	~300 MB	3-5%	Metrics storage
Tempo	~150 MB	1-2%	Trace storage
Pyroscope	~100 MB	1%	Profile storage
Grafana	~150 MB	1-2%	Visualization
Total	~900 MB	8-13%	Full stack

Throughput

Log Processing: 10,000-50,000 logs/sec (Alloy)
Metrics Ingestion: 1M samples/sec (Mimir)
Trace Ingestion: 10,000 spans/sec (Tempo)
Latency: < 100ms from event to visualization

Configuration Files

All configuration files are version-controlled in the repository:

Alloy:

LGTM Stack:

Grafana:

Docker Compose:

deployment/docker/docker-compose.yml

Quick Start Guide

Local Development Setup

Start LGTM Stack:

cd deployment/docker
docker-compose up -d

Access Services:

Grafana: http://localhost:3000
Alloy UI (Docker): http://localhost:12345
Alloy UI (Host): http://localhost:12346
Loki: http://localhost:3100
Mimir: http://localhost:9009
Tempo: http://localhost:3200
Prometheus: http://localhost:9090

Verify Data Flow:

# Check Alloy is collecting logs
curl http://localhost:12345/metrics

# Query Loki for recent logs
curl -s 'http://localhost:3100/loki/api/v1/query?query={service=~".+"}' | jq

# Query Mimir for metrics
curl -s 'http://localhost:9009/prometheus/api/v1/query?query=up' | jq

# Check event logs
curl -s 'http://localhost:3100/loki/api/v1/query?query={service="event-logger-service"}' | jq

Validation

The repository includes a comprehensive validation script:

cd deployment/monitoring/alloy
.\validate-alloy.ps1

# Expected output:
# ✅ Alloy containers running
# ✅ Alloy UI accessible
# ✅ Logs flowing to Loki
# ✅ No critical errors
# ✅ Resource usage healthy

See the Quick Start Guide for detailed instructions.

Impact and Benefits

Immediate Benefits

Modern Tooling: Using actively developed Grafana Alloy instead of deprecated Promtail
Unified Observability: Single stack (LGTM) for logs, metrics, traces, and profiles
Better Performance: 20-30% lower resource usage with Alloy
Enhanced Debugging: Built-in Alloy UI for troubleshooting
Real-Time Event Monitoring: Updated Events Dashboard for NATS event visibility

Long-Term Benefits

Future-Proof: LGTM stack is Grafana’s strategic direction
Cloud-Ready: Easy migration path to Grafana Cloud when needed
Cost-Efficient: Local LGTM for development, cloud for production
Scalability: Mimir and Tempo scale horizontally
Standardization: Industry-standard observability stack

Developer Experience

Single Dashboard: All observability signals in Grafana
Correlation: Logs ↔ Metrics ↔ Traces ↔ Profiles
Fast Queries: Optimized for real-time analysis
Alerting: Rule-based alerts on metrics and logs
Visualization: Rich, customizable dashboards

Next Steps

Phase 1: Stability

Migrate Promtail to Alloy
Integrate Mimir, Tempo, Pyroscope
Update Events Dashboard
Monitor resource usage and performance
Update remaining dashboards to use Mimir

Phase 2: Instrumentation

Add OpenTelemetry SDK to TypeScript services
Add OpenTelemetry SDK to Go services (quote-service)
Implement NATS trace propagation (context in event headers)
Add Pyroscope profiling to quote-service
Verify end-to-end trace collection

Phase 3: Advanced Features

Configure Grafana alerting rules
Set up SLO (Service Level Objectives) tracking
Implement exemplars (metrics → traces linking)
Create unified dashboards with all signal types
Performance profiling optimization

Phase 4: Production

Evaluate Grafana Cloud migration
Configure long-term retention policies
Set up multi-region observability
Implement advanced ML anomaly detection
Create runbooks and alerting workflows

Troubleshooting

Alloy Not Collecting Logs

Check Alloy status:

docker-compose logs alloy | grep -i error
curl http://localhost:12345/ready

Verify Docker socket access:

docker-compose exec alloy ls -la /var/run/docker.sock

No Data in Mimir

Check Prometheus remote_write:

docker-compose logs prometheus | grep remote_write
curl 'http://localhost:9009/prometheus/api/v1/query?query=up'

Restart Prometheus:

docker-compose restart prometheus

Events Not Showing in Dashboard

Check event-logger is running:

docker-compose logs event-logger-service --tail 20

Query Loki directly:

curl -s 'http://localhost:3100/loki/api/v1/query?query={service="event-logger-service"}' | jq

Check time range in Grafana:

Try “Last 15 minutes” or “Last hour”
Wait 30-60 seconds for indexing

Conclusion

Today’s work modernized the Solana Trading System’s observability stack with a comprehensive migration to the Grafana LGTM platform. By replacing deprecated Promtail with Grafana Alloy and integrating Mimir, Tempo, and Pyroscope, we now have a unified, cloud-ready observability solution that provides deep visibility into logs, metrics, traces, and profiles.

The updated Events Dashboard gives real-time insight into NATS events, arbitrage opportunities, and system health - critical for monitoring a high-frequency trading system. The architecture is positioned for easy migration to Grafana Cloud when production scale demands it, while maintaining cost efficiency during development with a local Docker deployment.

Key Achievements:

✅ Zero-downtime migration from Promtail to Alloy
✅ Complete LGTM stack integration (Loki, Grafana, Tempo, Mimir)
✅ Enhanced Events Dashboard with real-time NATS monitoring
✅ Cloud-ready architecture for future scaling
✅ 20-30% reduction in log collection resource usage

The next phase will focus on instrumenting services with OpenTelemetry for distributed tracing and adding continuous profiling to identify performance bottlenecks.

Related Posts:

Technical Documentation:

Connect

GitHub: github.com/guidebee
LinkedIn: linkedin.com/in/guidebee

This is post #7 in the Solana Trading System development series. Follow along as I document the journey from working prototypes to production HFT system.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

James Shen