← All insights
Observability DevOps 7 min read

You Can't Fix What You Can't See

Full OpenTelemetry instrumentation, correlation IDs across every layer, and .NET Aspire dashboards — observability as infrastructure, not an afterthought.

W
WebPrefer Engineering
March 2026

The diagnostic problem

A player's deposit is slow. Support gets the ticket. The first question is obvious: where did the time go? Was it the payment provider? The bonus evaluation that fires after a successful deposit? A database query that hit a lock? The SignalR push that notifies the player's browser?

Without instrumentation, the answer is "check the logs." And the logs are in different files, on different servers, written by different services, with no shared identifier connecting them. The developer opens four log files, eyeballs timestamps, and tries to reconstruct what happened by matching times that are close enough. This is log file archaeology — and it's the default diagnostic method for most platforms.

It doesn't scale. When you have one service, you can grep a log file. When you have an API layer, a business logic layer, a database, a message queue, three external integrations, and a real-time push layer, grepping log files is not diagnostics. It's guesswork with extra steps.

PAM was built with the assumption that every production issue would eventually need to be traced across multiple services. That assumption shaped the architecture. Observability isn't a feature we added — it's infrastructure that was there from the beginning.

OpenTelemetry in PAM

Every API request that enters PAM carries a trace ID. That trace ID is not a custom header we invented — it's an OpenTelemetry-standard W3C trace context. It propagates automatically through every layer of the stack: from the API controller, through business logic services, into database queries, across RabbitMQ messages, and out to external integration calls.

The instrumentation covers everything that matters:

The result is end-to-end visibility. A single trace ID reveals the complete lifecycle of any operation — from the moment the request hits the API to the moment the player sees the result in their browser. No log file archaeology required.

Correlation IDs

A trace ID tells you what happened during a single operation. A correlation ID tells you what happened across a player's entire session, or across all operations related to a specific transaction.

In PAM, correlation IDs flow from the player's browser through every layer of the stack. When a player initiates a deposit, the correlation ID is generated at the API boundary and attached to every log entry, every database operation, every message queue event, and every external integration call that results from that deposit.

This matters most for support. When a support agent asks "what happened to this transaction?", the answer is a correlation ID lookup — not a manual search across multiple systems. The agent enters the transaction ID in the back office. The system returns a complete timeline: the API request, the payment provider call, the wallet update, the bonus evaluation, the notification push. Every step, in order, with timing.

Correlation IDs also solve the cross-service problem that trace IDs alone cannot. When a deposit triggers an asynchronous bonus evaluation via RabbitMQ, the trace ID from the original deposit request is different from the trace ID of the bonus processing. But the correlation ID is the same. It ties the entire business operation together, even when the technical operations span multiple traces.

Every log entry in PAM includes the correlation ID. Log4Net is configured to write it as a structured field, which means log aggregation tools can filter by correlation ID across all services simultaneously. The days of searching five log files with grep and hoping the timestamps align are over.

What a trace looks like

Consider a concrete example: a player deposits 100 EUR. Here's what the trace reveals:

Total time: 1,311ms. The payment provider took 1,247ms of that. Without the trace, a developer would have spent an hour narrowing down whether the slowness was in the API, the database, or the provider. With the trace, the answer is visible in seconds.

.NET Aspire

Production observability is essential. But the fastest way to reduce production incidents is to catch problems before they get to production. That's where .NET Aspire comes in.

PAM uses Aspire as the local development orchestration host. When a developer starts the solution locally, Aspire spins up the full service graph — the API, the back office, the behavior engine, the cron service, the event queue — along with their dependencies: SQL Server, Redis, RabbitMQ. Everything runs locally, everything is connected, and everything is observable.

The Aspire dashboard provides the same observability locally that OpenTelemetry provides in production:

Aspire's impact

The Aspire dashboard mirrors production observability in the local development environment. Bugs that would previously surface as production incidents — race conditions between services, unexpected message ordering, slow queries under realistic load — are now identified and fixed during development. The gap between "works on my machine" and "works in production" shrinks because the local environment is structurally identical to the production environment, and the observability tooling is the same.

Why this matters operationally

Observability changes incident response fundamentally. Without it, the incident response process is: get alerted, check dashboards, check logs, form a hypothesis, test the hypothesis, repeat. The time from alert to diagnosis is measured in minutes or hours, and it depends heavily on whether the right person — the one who knows which log file to check — is available.

With full instrumentation, the incident response process is: get alerted, open the trace, see the problem. The diagnostic path is the trace itself. A slow payment provider call is visible as a long span. A database deadlock is visible as a failed span with the exception attached. A message queue backup is visible as a growing gap between publish time and consume time.

This changes who can diagnose problems. When the diagnostic path is a trace, any engineer can follow it — not just the one who wrote the code. Knowledge isn't locked in someone's head; it's in the instrumentation.

It also changes how we think about system health. When monitoring is quiet, it means something. It means every trace is completing within expected bounds, every dependency is responding normally, and every message is being consumed on time. Silence isn't the absence of information — it's a positive signal that the system is healthy.

Today

PAM ships with OpenTelemetry instrumentation across every service, correlation IDs in every log entry, and .NET Aspire orchestration for local development. Traces propagate from API requests through business logic, database queries, message queues, external integrations, and real-time pushes. The median time from incident alert to root cause identification has dropped from hours of log searching to minutes of trace inspection. Every new service and integration is instrumented from day one — observability is not optional, it's part of the definition of done.

The principle

You can't fix what you can't see. That sounds like a motivational poster, but it's an engineering constraint. Systems without observability aren't systems you understand — they're systems you hope are working. Hope is not a monitoring strategy.

The investment in instrumentation pays for itself the first time a production issue is diagnosed in five minutes instead of five hours. It pays for itself again every time a developer catches a performance regression locally before it ships. And it pays for itself continuously in the confidence it gives operators: when the dashboards are green, the system is genuinely healthy — not just quietly failing where nobody is looking.

Share this insight
Share on LinkedIn
Preview post text
More insights
Get in Touch

Ready to see it?

We offer live demos scoped to your specific operation type — whether you're launching a new brand, migrating from an existing platform, or evaluating options for a white-label deployment.

Address
Wahlbecksgatan 8, 582 13 Linköping, Sweden
Mikael Lindberg Castell
mikael@webprefer.com
CEO & Founder, WebPrefer AB