It starts like most outages do: with a tiny wobble no one can quite name yet. By the time someone does, DORA has already set the script.

09:03 — The wobble
Nina, the on-call SRE, sees card auth latency creep from 180ms to 900ms. It’s payday Friday; chat pings multiply. She triggers the major-incident channel. The first question isn’t “Why?” It’s “What severity?” Under DORA, words map to actions. High means the clock starts, owners are assigned, and every step needs an artifact you can show later.
09:07 — The call you didn’t want
Mateo from Vendor Management joins. The payment gateway’s EU-West region is unhealthy; failover is supposed to be automatic. “Supposed to” doesn’t count today. Nina opens the runbook titled Auth Spiral — Cloud Region Degradation. It’s no longer a dusty PDF. It’s code, dashboards, comms templates—the whole muscle memory.
09:12 — Paper meets product
DORA’s boring lines come alive: identify, protect, detect, respond, recover. The team speaks in verbs. Detect: confirm blast radius. Respond: rebalance to EU-North; slow retry storms with back-pressure. Recover: drain queues, verify reconciliations. Meanwhile, Comms drafts the external note. There’s no heroism in silence anymore; transparency is a requirement with timers.
09:37 — The drop
Latency returns to baseline. Finance asks if payouts were delayed. Ops confirms: some merchants saw 11–17 minutes of lag; nothing lost, everything traced. The evidence pack grows: screenshots, timestamps, runbook diffs, who-did-what-when. Under DORA, if you can’t show it, it didn’t happen.
10:02 — The after
The incident lead shifts from “fix it” to “prove it.” Classification, impact, root cause hypothesis, mitigations, partner notifications. Legal checks the threshold for “major incident” reporting. The checklist is calm, almost clinical. That’s the point: when it’s chaos outside, procedure is the life raft.
11:40 — The vendor mirror
A cloud architect from the gateway joins. Mateo is polite but precise: “Share your detection time, time-to-mitigation, time-to-recovery; last TLPT findings on this path; RTO/RPO for our slice; sub-outsourcers that touched it.” A year ago, this would feel awkward. Today, it’s just the table stakes. DORA didn’t only regulate you—it pulled your suppliers into the light.
13:15 — The test you schedule on purpose
The CTO green-lights a game-day: simulate the same failure next sprint and film the metrics. The board doesn’t want colors; it wants trends. Detection median down from 9 to 6 minutes. Blast radius cut by half. Evidence captured automatically. “Resilience is a feature,” the COO says. No one smirks anymore.
14:30 — The contract you wish you had last year
Procurement pushes a DORA annex: reporting SLAs, forensic access, sub-vendor transparency, exit support, portability drills. This is where architecture meets law. Redundancy isn’t just more servers; it’s obligations that hold when the internet forgets to be polite.
16:00 — The quiet metric
Finance updates a number only they watch: churn risk after incidents. When outages come with fast detection, plain-language comms, and honest post-mortems, complaint volume drops. Compliance, strangely, becomes part of retention.
What changed (and why it sticks)
- One language for risk. Risk, engineering, and legal use the same severity matrix and the same evidence trail.
- Incidents as rehearsals. Game-days and TLPT (threat-led penetration testing) are not “someday.” They go in the sprint like any backlog item.
- Vendors as extensions of you. If they fail, you answer; if you answer, you need their proof. Contracts now read like runbooks.
- Boards ask better questions. Not “are we compliant?” but “how fast did we detect, contain, and recover—trendline, please.”
Your next 90 days (the story you’ll be proud to tell)
- Map the truth. Inventory critical services down to regions and queues; tag owners, RTO/RPO, and signals.
- Automate the trail. If your timeline is a manual doc, it will be wrong. Pipe alerts, actions, and notes into a system of record.
- Renegotiate reality. Add DORA clauses to key suppliers; request their last-year evidence and store it—emails don’t count.
- Practice, on purpose. Run one live failover drill and one TLPT scoping exercise; publish before/after metrics to the board.
- Make it human. Write outage comms in plain language; add a merchant-first FAQ. Clarity reduces tickets more than any chatbot.