Work/The Compass

Analytics & Attribution · Marketing Intelligence

The Compass

The core insight was that monitoring and observability are not the same thing.

A multi-brand platform was operating with fragmented monitoring — different dashboards, different alert thresholds, different tooling for different properties. Signal: 99.98% uptime SLA

Automation & Systems

The Compass

99.98% uptime

SLA

Problem / System

94% of problems resolved before anyone gets paged.

A multi-brand platform was operating with fragmented monitoring — different dashboards, different alert thresholds, different tooling for different properties.

Inside Graston Technique®Built for conversion

System framing

94% of problems resolved before anyone gets paged.

A multi-brand platform was operating with fragmented monitoring — different dashboards, different alert thresholds, different tooling for different properties.

Inside Graston Technique®Automation & Systems99.98% uptime

The Challenge

A multi-brand platform was operating with fragmented monitoring — different dashboards, different alert thresholds, different tooling for different properties.

The Approach

The core insight was that monitoring and observability are not the same thing.

The Build

Datadog APM & Distributed Tracing

Instrumented all application services with Datadog APM, enabling distributed tracing across request lifecycles. Service maps built for every major user journey — sign-up, checkout, content delivery — so anomalies in one service could be correlated with upstream and downstream impact instantly.

AWS CloudWatch Unified Integration

Centralized CloudWatch metrics and logs across all AWS accounts and regions into a single aggregation layer. Infrastructure signals (EC2 CPU, RDS connections, Lambda cold starts, S3 request errors) fed into the same observability plane as application traces.

Custom Anomaly Scoring Model

Built a custom scoring model that evaluated incoming signals against rolling historical baselines rather than static thresholds. Seasonal traffic patterns, deployment-correlated spikes, and known-good variance windows were all factored in. The model assigned anomaly scores rather than binary alerts.

The Outcome

99.98% uptime SLA — held consistently across all brand properties. 94% of issues auto-resolved before a human was ever notified. Mean time to resolution dropped 40%. The platform stopped finding out about problems from users.

The platform went from reactive firefighting to proactive infrastructure management. The observability layer was no longer something engineers checked during incidents. It became something the system acted on continuously.

Ready to stop firefighting and start running infrastructure that manages itself?

Let's talk about what that looks like.