Series: QA Leadership · Article 3 of 9
It was a Friday evening. The team's DDR was 91%. Regression passed beautifully. Confidence Score: 89% - GO. The release shipped. And forty minutes later the alerts started rolling in.
Friday · 18:47 · Production
18:47ALERT: Timeouts on connections to the external payments API - 503 for 34% of requests
18:51DevOps: checking the logs... this isn't our code. Something with the SSL config in the new environment.
19:03QA Lead: but it wasn't a bug - everything passed in the tests.
19:04PM: the client just wrote in. They haven't been able to process transactions for 17 minutes.
19:22Rollback complete. Downtime: 35 minutes. The production SSL certificate differed from staging.
The next day at the retrospective, one question came up: “How is this possible - DDR 91%, and the client couldn’t pay for half an hour?”
The answer is both simple and painful: because DDR measured only bugs in the code. And the problem was in the infrastructure configuration. And that is exactly the gap this article is about.
The client doesn't distinguish whether the service went down because of a code bug, a bad SSL certificate, or a wrong feature flag. To them - and to your business - it's all the same thing: production is down.
Escaped Problem - a broader definition
In the previous article we talked about DDR - the metric for defect detection effectiveness. DDR asks: how many bugs do we catch before they reach production? But that definition assumes the only problems are bugs in the application code.
Reality is different. An Escaped Problem is any problem discovered by a customer or by monitoring after deployment - regardless of its source. Four categories, four entirely different ways of arising, four different ways of preventing them.
Four types - one shared consequence
Before you start measuring, you need to know what you’re measuring. Here is the full taxonomy of escaped problems with the typical percentage share in the organizations I’ve worked with.
🐛
Code defects
~55% of cases
The classic bug - incorrect application behavior caused by an error in the programming logic. This is exactly what DDR from article 2 measures.
Wrong price calculation after a discount
NullPointerException on an edge case
Incorrect form validation
⚙️
Infrastructure problems
~20% of cases
The production environment behaves differently from the test one. The code is correct - but it doesn't work in the target context.
SSL certificate differs from staging
Insufficient server resources under load
Library version mismatch between environments
🔗
Integration failures
~15% of cases
External APIs, third-party systems, internal microservices - something that worked in tests fails in production because of a different call context.
Payments API returns a different format in prod
A timeout different from staging
Missing permissions in a service integration
↩️
Post-deployment regressions
~10% of cases
A feature worked before the release - after deployment it stopped. The cause: an unexpected interaction with new changes or configuration changes.
A feature flag overrode production settings
Cache wasn't cleared after deployment
A database migration changed the behavior of old records
The sum doesn’t add up to 100% - because a few percent are mixed situations, hard to classify cleanly. The proportions will differ in your organization - but the taxonomy itself is almost universal.
Code vs infra vs integration - the key differences
Each type of escaped problem has a different source, a different warning signal and a different prevention method. The table below is your navigation map.
| Type |
Who owns it |
Where to look for signals |
How to prevent it |
| Code |
Dev + QA |
Jira, automated tests, code review |
Test coverage, DDR, definition of done |
| Infra |
DevOps + QA |
Monitoring, environment diffs, IaC review |
Environment parity, infrastructure-as-code tests |
| Integrations |
Dev + QA + vendor |
API logs, contract tests, alerting |
Contract tests, mocking with prod-like data |
| Regressions |
QA + DevOps |
Post-deployment monitoring, smoke tests |
Post-deploy smoke suite, canary deployments |
Distribution of escaped problem types - a sample year
Code dominates, but infra and integrations are ~35% of problems combined, often left out of reports
Q1-Q4
How to collect and categorize - a practical guide
Most teams collect only bugs from Jira. That’s like measuring the temperature in one room and claiming you know the climate of the whole building. Here’s what to add and how to connect it.
Data sources
🗂️
Jira / tracker
Code defects reported by QA and devs. An "environment" field or a "production" tag lets you filter out escaped ones.
mandatory
📡
Alert monitoring
PagerDuty, Datadog, Grafana. Production incidents with a timestamp - the source for infra and integrations.
mandatory
🎧
Support tickets
Freshdesk, Zendesk. Problems reported by customers that never reach Jira as a bug.
important
🔖
Post-deploy logs
The first 30 minutes after deployment is the regression window. Splunk, ELK, CloudWatch - logs from that window.
important
💬
Slack / Teams
The #incidents or #prod-issues channel. This is often where problems land before anyone logs them officially.
supplementary
The categorization process - step by step
1
Collect every production event from the week / sprint
One log - regardless of source. Date, short description, downtime or user impact. At this stage you don't categorize - you only collect.
2
Assign a type to each event
Code / infra / integration / regression. One event - one type. If you're not sure - pick the most likely one and mark it "to verify".
3
Map it to a release
Which deployment brought the problem in? Sometimes it's obvious - an incident 30 minutes after deployment. Sometimes you have to look at the change history. Without this step you lose the ability to tie escaped problems to specific releases (the metric from article 5).
4
Compute the cost and log the resolution time
Time to detect, time to fix, who was involved. Even an approximation (DevOps ~3h, Dev ~1h) is enough - the cost details we cover in the next section.
Implementation checklist
Check which data sources you already have connected in your team.
Jira - "environment" field or "production" tag configured
Lets you filter bugs found in production down to the release.
Monitoring alerts land in one place (Slack / PagerDuty)
Every production alert should leave a trace you can analyze later.
Support tickets linked to Jira or logged separately
Without this you lose problems the customer reports directly - often the most serious ones.
Deployment history with exact dates and times
Essential for attributing incidents to specific releases.
Smoke tests run automatically after every deployment
They catch regressions in the first minutes - before they reach the customer.
A weekly incident review with type classification
A 15-minute ritual that turns raw data into a categorized history.
Cost
How much does each escaped problem type cost?
Each type of escaped problem has a different cost profile - a different detection time, a different fix time, different people involved. Below are estimates based on the median from typical enterprise organizations. Your numbers will differ - but the proportions are surprisingly consistent.
Code
Application code defect
Dev: 2-3h analysis + fix
QA: 1h verification
DevOps: 1h hotfix deploy
PM: 0.5h coordination
The most common type. A well-defined fix process. Lower escalation cost.
5-6h
per incident
risk: medium
Infra
Infrastructure / configuration problem
DevOps: 3-5h diagnosis + fix
Dev: 1h support
QA: 1h environment verification
PM: 1h + client communication
Often: a rollback of the whole release
Harder to diagnose. Often requires a rollback - not just a fix.
8-12h
per incident
risk: high
Integration
External integration failure
Dev: 2-4h diagnosis + workaround
DevOps: 2h configuration
PM: 2-3h vendor communication
Often: SLA breach with an external vendor
Part of the problem sits with the vendor. Resolution time depends on an external SLA.
8-16h
per incident
risk: critical
Regression
Post-deployment regression
QA: 2h scope identification
Dev: 2-3h interaction analysis
DevOps: 2h rollback or hotfix
Often: impact on several features at once
Insidious - because "the previous version worked". Requires deeper root-cause analysis.
7-10h
per incident
risk: high
Average cost across all types
~8h
per single escaped problem
Most expensive type
Integration
8-16h + external SLA
Most common type
Code
~55% of all cases
Data that says more than a single counter
Instead of one number “escaped bugs = 12” - two charts that give a completely different level of insight into what’s really going on.
Escaped problems by type - quarterly trend
Code shrinks faster - because it's better tested. Infra and integrations hold steady - they need different actions.
Q1-Q4 2025
Code
Infra
Integrations
Regressions
Cost by type - Q4 2025
Integrations are only 15% of cases - but they consume disproportionately more time and budget
work hours
How to present this to the business
The number of escaped problems alone stops being enough once you have the type distribution and the cost of each. Here’s how to turn that data into a narrative.
Instead of: *"we had 8 escaped bugs."* Say: *"we had 8 escaped problems - 5 code defects, 2 configuration problems and 1 integration failure. Total cost: about 68 hours. Infra and integrations need a separate strategy."*
Sprint review
"This sprint we had 3 escaped problems: 2 code defects and 1 environment configuration problem. Cost: about 22 hours. The configuration problem was the most expensive - and we have a plan to not repeat it."
1:1 with EM
"Looking at the trend - code defects are dropping. But infra and integration problems hold at a steady level. That needs a different intervention than more testing - we need better environment parity and contract tests."
Board
"In Q4 we had 8 escaped problems at a total cost of about 68 work hours. For comparison - in Q1 there were 18 at about 160 hours. The biggest saving came from contract tests rolled out in Q2."
What the full taxonomy changes
Once you start categorizing escaped problems instead of just counting them - the conversation changes fundamentally. You stop saying how many and start saying what and why.
4
types of escaped problems to track
5×
cost difference: code vs integration
35%
problems missed when you measure only bugs in the code
15min
a weekly review is enough for full categorization
The client doesn't report a problem labeled "type: infrastructure". To them - and to your business - one thing matters: does it work. Measure everything that can stop working.
In the next article
Article four covers Issues per Release - a code-maturity metric that reshapes the conversation with the Engineering Manager. It doesn’t ask how many bugs you found - it asks how clean the code you received for testing was.
Spoiler: this is the metric that often reveals the problem lies not with QA but with the development process - and it gives you the data to have that conversation from a position of facts, not opinions.
Series: QA metrics the business wants to hear
-
01
Diagnosis, three pillars, five metrics, the QA → KPI mapping model
-
02
Formula, thresholds, historical data, seasonality, pitfalls
-
03
Escaped Bugs & Problems you are here
Taxonomy, data collection, the cost of each type, how to report
-
04
How this metric reshapes the conversation with the Engineering Manager
-
05
Pinpointing problems, not just watching trends
-
06
Number of Releases - the context metric
Why 3 bugs with 2 releases is a disaster, and with 15 - a success
-
07
Release Confidence Score step by step
Three calculation models, rollout, concrete examples from practice
-
08
Storytelling with metrics - building a narrative
How to turn a table of numbers into a business argument
-
09
3 anti-patterns that destroy QA credibility
Too many metrics, no context, jargon - and how to avoid each