Series: QA Leadership · Article 3 of 9

It was a Friday evening. The team's DDR was 91%. Regression passed beautifully. Confidence Score: 89% - GO. The release shipped. And forty minutes later the alerts started rolling in.

Friday · 18:47 · Production
18:47ALERT: Timeouts on connections to the external payments API - 503 for 34% of requests
18:51DevOps: checking the logs... this isn't our code. Something with the SSL config in the new environment.
19:03QA Lead: but it wasn't a bug - everything passed in the tests.
19:04PM: the client just wrote in. They haven't been able to process transactions for 17 minutes.
19:22Rollback complete. Downtime: 35 minutes. The production SSL certificate differed from staging.

The next day at the retrospective, one question came up: “How is this possible - DDR 91%, and the client couldn’t pay for half an hour?”

The answer is both simple and painful: because DDR measured only bugs in the code. And the problem was in the infrastructure configuration. And that is exactly the gap this article is about.

The client doesn't distinguish whether the service went down because of a code bug, a bad SSL certificate, or a wrong feature flag. To them - and to your business - it's all the same thing: production is down.

Escaped Problem - a broader definition

In the previous article we talked about DDR - the metric for defect detection effectiveness. DDR asks: how many bugs do we catch before they reach production? But that definition assumes the only problems are bugs in the application code.

Reality is different. An Escaped Problem is any problem discovered by a customer or by monitoring after deployment - regardless of its source. Four categories, four entirely different ways of arising, four different ways of preventing them.

Four types - one shared consequence

Before you start measuring, you need to know what you’re measuring. Here is the full taxonomy of escaped problems with the typical percentage share in the organizations I’ve worked with.

🐛
Code defects
~55% of cases
The classic bug - incorrect application behavior caused by an error in the programming logic. This is exactly what DDR from article 2 measures.
Wrong price calculation after a discount NullPointerException on an edge case Incorrect form validation
⚙️
Infrastructure problems
~20% of cases
The production environment behaves differently from the test one. The code is correct - but it doesn't work in the target context.
SSL certificate differs from staging Insufficient server resources under load Library version mismatch between environments
🔗
Integration failures
~15% of cases
External APIs, third-party systems, internal microservices - something that worked in tests fails in production because of a different call context.
Payments API returns a different format in prod A timeout different from staging Missing permissions in a service integration
↩️
Post-deployment regressions
~10% of cases
A feature worked before the release - after deployment it stopped. The cause: an unexpected interaction with new changes or configuration changes.
A feature flag overrode production settings Cache wasn't cleared after deployment A database migration changed the behavior of old records

The sum doesn’t add up to 100% - because a few percent are mixed situations, hard to classify cleanly. The proportions will differ in your organization - but the taxonomy itself is almost universal.

Code vs infra vs integration - the key differences

Each type of escaped problem has a different source, a different warning signal and a different prevention method. The table below is your navigation map.

Type Who owns it Where to look for signals How to prevent it
Code Dev + QA Jira, automated tests, code review Test coverage, DDR, definition of done
Infra DevOps + QA Monitoring, environment diffs, IaC review Environment parity, infrastructure-as-code tests
Integrations Dev + QA + vendor API logs, contract tests, alerting Contract tests, mocking with prod-like data
Regressions QA + DevOps Post-deployment monitoring, smoke tests Post-deploy smoke suite, canary deployments
Distribution of escaped problem types - a sample year
Code dominates, but infra and integrations are ~35% of problems combined, often left out of reports
Q1-Q4

How to collect and categorize - a practical guide

Most teams collect only bugs from Jira. That’s like measuring the temperature in one room and claiming you know the climate of the whole building. Here’s what to add and how to connect it.

Data sources

🗂️
Jira / tracker
Code defects reported by QA and devs. An "environment" field or a "production" tag lets you filter out escaped ones.
mandatory
📡
Alert monitoring
PagerDuty, Datadog, Grafana. Production incidents with a timestamp - the source for infra and integrations.
mandatory
🎧
Support tickets
Freshdesk, Zendesk. Problems reported by customers that never reach Jira as a bug.
important
🔖
Post-deploy logs
The first 30 minutes after deployment is the regression window. Splunk, ELK, CloudWatch - logs from that window.
important
💬
Slack / Teams
The #incidents or #prod-issues channel. This is often where problems land before anyone logs them officially.
supplementary

The categorization process - step by step

1
Collect every production event from the week / sprint
One log - regardless of source. Date, short description, downtime or user impact. At this stage you don't categorize - you only collect.
2
Assign a type to each event
Code / infra / integration / regression. One event - one type. If you're not sure - pick the most likely one and mark it "to verify".
3
Map it to a release
Which deployment brought the problem in? Sometimes it's obvious - an incident 30 minutes after deployment. Sometimes you have to look at the change history. Without this step you lose the ability to tie escaped problems to specific releases (the metric from article 5).
4
Compute the cost and log the resolution time
Time to detect, time to fix, who was involved. Even an approximation (DevOps ~3h, Dev ~1h) is enough - the cost details we cover in the next section.

Implementation checklist

Check which data sources you already have connected in your team.

Jira - "environment" field or "production" tag configured
Lets you filter bugs found in production down to the release.
Monitoring alerts land in one place (Slack / PagerDuty)
Every production alert should leave a trace you can analyze later.
Support tickets linked to Jira or logged separately
Without this you lose problems the customer reports directly - often the most serious ones.
Deployment history with exact dates and times
Essential for attributing incidents to specific releases.
Smoke tests run automatically after every deployment
They catch regressions in the first minutes - before they reach the customer.
A weekly incident review with type classification
A 15-minute ritual that turns raw data into a categorized history.
Cost

How much does each escaped problem type cost?

Each type of escaped problem has a different cost profile - a different detection time, a different fix time, different people involved. Below are estimates based on the median from typical enterprise organizations. Your numbers will differ - but the proportions are surprisingly consistent.

Code
Application code defect
Dev: 2-3h analysis + fix QA: 1h verification DevOps: 1h hotfix deploy PM: 0.5h coordination
The most common type. A well-defined fix process. Lower escalation cost.
5-6h
per incident
risk: medium
Infra
Infrastructure / configuration problem
DevOps: 3-5h diagnosis + fix Dev: 1h support QA: 1h environment verification PM: 1h + client communication Often: a rollback of the whole release
Harder to diagnose. Often requires a rollback - not just a fix.
8-12h
per incident
risk: high
Integration
External integration failure
Dev: 2-4h diagnosis + workaround DevOps: 2h configuration PM: 2-3h vendor communication Often: SLA breach with an external vendor
Part of the problem sits with the vendor. Resolution time depends on an external SLA.
8-16h
per incident
risk: critical
Regression
Post-deployment regression
QA: 2h scope identification Dev: 2-3h interaction analysis DevOps: 2h rollback or hotfix Often: impact on several features at once
Insidious - because "the previous version worked". Requires deeper root-cause analysis.
7-10h
per incident
risk: high
Average cost across all types
~8h
per single escaped problem
Most expensive type
Integration
8-16h + external SLA
Most common type
Code
~55% of all cases

Data that says more than a single counter

Instead of one number “escaped bugs = 12” - two charts that give a completely different level of insight into what’s really going on.

Escaped problems by type - quarterly trend
Code shrinks faster - because it's better tested. Infra and integrations hold steady - they need different actions.
Q1-Q4 2025
Code Infra Integrations Regressions
Cost by type - Q4 2025
Integrations are only 15% of cases - but they consume disproportionately more time and budget
work hours

How to present this to the business

The number of escaped problems alone stops being enough once you have the type distribution and the cost of each. Here’s how to turn that data into a narrative.

Instead of: *"we had 8 escaped bugs."* Say: *"we had 8 escaped problems - 5 code defects, 2 configuration problems and 1 integration failure. Total cost: about 68 hours. Infra and integrations need a separate strategy."*
Sprint review "This sprint we had 3 escaped problems: 2 code defects and 1 environment configuration problem. Cost: about 22 hours. The configuration problem was the most expensive - and we have a plan to not repeat it."
1:1 with EM "Looking at the trend - code defects are dropping. But infra and integration problems hold at a steady level. That needs a different intervention than more testing - we need better environment parity and contract tests."
Board "In Q4 we had 8 escaped problems at a total cost of about 68 work hours. For comparison - in Q1 there were 18 at about 160 hours. The biggest saving came from contract tests rolled out in Q2."

What the full taxonomy changes

Once you start categorizing escaped problems instead of just counting them - the conversation changes fundamentally. You stop saying how many and start saying what and why.

4
types of escaped problems to track
cost difference: code vs integration
35%
problems missed when you measure only bugs in the code
15min
a weekly review is enough for full categorization
The client doesn't report a problem labeled "type: infrastructure". To them - and to your business - one thing matters: does it work. Measure everything that can stop working.

In the next article

Article four covers Issues per Release - a code-maturity metric that reshapes the conversation with the Engineering Manager. It doesn’t ask how many bugs you found - it asks how clean the code you received for testing was.

Spoiler: this is the metric that often reveals the problem lies not with QA but with the development process - and it gives you the data to have that conversation from a position of facts, not opinions.

Series: QA metrics the business wants to hear
  • 01
    Diagnosis, three pillars, five metrics, the QA → KPI mapping model
  • 02
    Formula, thresholds, historical data, seasonality, pitfalls
  • 03
    Escaped Bugs & Problems you are here
    Taxonomy, data collection, the cost of each type, how to report
  • 04
    How this metric reshapes the conversation with the Engineering Manager
  • 05
    Pinpointing problems, not just watching trends
  • 06
    Number of Releases - the context metric
    Why 3 bugs with 2 releases is a disaster, and with 15 - a success
  • 07
    Release Confidence Score step by step
    Three calculation models, rollout, concrete examples from practice
  • 08
    Storytelling with metrics - building a narrative
    How to turn a table of numbers into a business argument
  • 09
    3 anti-patterns that destroy QA credibility
    Too many metrics, no context, jargon - and how to avoid each