Napkin Math #19: Metrics For Your Web Application's Dashboards
In the beginning of the year I was helping readwise.io get some of their observability up to snuff. A few weeks later "What should I monitor?" came up on another call, so I decided to list out the metrics that I expect from a great dashboard:
- Web Backend (e.g. Django, Node, Rails, Go, ..)
- Response Time
p50
,p90
,p99
,sum
,avg
- Throughput by HTTP status
- Worker Utilization
- Request Queuing Time
- Service calls
- Database(s), caches, internal services, third-party APIs, ..
- Enqueued jobs are important!
- [Circuit Breaker tripping][cb]
/min
- Errors, throughput, latency
p50
,p90
,p99
- Throttling
- Cache hits and misses
%
- CPU and Memory Utilization
- Exception counts
/min
- Response Time
- Job Backend (e.g. Sidekiq, Celery, Bull, ..)
- Job Execution Time
p50
,p90
,p99
,sum
,avg
- Throughput by Job Status
{error, success, retry}
- Worker Utilization
- Time in Queue
- Queue Sizes
- Don't forget scheduled jobs and retries!
- Service calls
p50
,p90
,p99
,count
,by type
- Throttling
- CPU and Memory Utilization
- Exception counts
/min
- Job Execution Time
More details about what these all mean in the latest napkin post!
Any favourites of yours missing? Let me know.
P.S. On Thursday night, eastern time, I'll be doing a short talk about napkin math on memory bandwidth.
Don't miss what's next. Subscribe to Napkin Math: