Debugging an API Performance Crisis0/5
Principal

Saturday, January 17, 2026

Debugging an API Performance Crisis

Lead the investigation and resolution of a critical production incident where P95 latency spiked 40x after a deployment.

Incident ResponseDatabase PerformanceQuery OptimizationLock ContentionConnection Pool ManagementProduction DebuggingPostgres Internals

00The Situation

You're interviewing for a Principal Engineer role at a fintech company. During the interview, they present you with a real production incident they recently experienced.

The Situation:

  • A critical API endpoint (/api/transactions) suddenly started experiencing severe performance degradation
  • P95 latency went from 200ms to 8 seconds, degrading gradually over 2 hours as the connection pool saturated
  • The endpoint serves transaction history for mobile and web clients (queries by user_id with optional category filters)
  • Traffic volume is normal (no spike detected)
  • The degradation started after a routine deployment
  • The on-call team rolled back the application code, but the database migration was NOT rolled back (the new index remains)
  • Database CPU is at 40% (normal), but I/O wait is elevated at 35% and active connections are at 95% of pool capacity
  • No errors in logs, just slow responses (requests are queuing, not failing)

The deployment that was rolled back included:

  • A new feature to filter/display transactions by category (adds a WHERE clause on transaction_category)
  • A bug fix for date filtering (changed date range query logic)
  • A dependency update (Rails 6.1 → 7.0) - a major version upgrade
  • Database migration to add an index on transaction_category (on a 2TB+ transactions table)

Critical Detail: The index creation on the 2TB table took 45 minutes to complete during the deployment window. The new feature's queries use this index via the category filter.

The team is stuck. They've rolled back the code, but the problem persists. Users are complaining, and the business is losing money.

💭

The key clue here is that the rollback didn't fix the issue. What does this tell you about where to look?

1

Incident Assessment

5 min

Before diving into technical debugging, establish incident command and assess the situation.

💭

Think about this first

What's your first action when you arrive at this incident?

2

Root Cause Investigation

20 min

Use the available clues to form and test hypotheses about the root cause. Multiple factors could be at play—systematically investigate each.

💭

Think about this first

The code rollback didn't fix it, but the database migration wasn't rolled back. What are your hypotheses?

3

Resolution & Prevention

15 min

Apply the appropriate fix based on diagnosis and establish measures to prevent recurrence.

💭

Think about this first

Based on your diagnosis, what's the fix? And how do you prevent this class of issue?