# Engineering Resilience: Two Lessons from Building Under Pressure

> Source: <https://dev.to/afeh/engineering-resilience-two-lessons-from-building-under-pressure-1b69>
> Published: 2026-06-13 09:10:02+00:00

**A reflection on performance optimization at scale and building reliability mechanisms; two tasks that defined my internship.**

Every engineering internship has its share of "aha" moments; those late-night debugging sessions where a breakthrough finally clicks, or the PR that takes seven commits to get right. As I wrap up my time as an Intern with HNG, I want to write about two tasks that stuck with me. Not because they were the hardest, but because they taught me something real about building systems that have to work.

One was individual, optimizing a demographic intelligence API to handle millions of records with sub-second query times. The other was a team effort; building reliability mechanisms into an AI-powered interview platform so that when things break (and they will), the system degrades gracefully instead of falling apart.

Lets walk through both.

Insighta IQ is a demographic intelligence API; think "find me Nigerian females aged 20 to 45." I nicknamed it Stereo API(Stereo short for stereotyping of course). Users query a PostgreSQL database of millions of demographic profiles through a FastAPI backend, via both CLI and web clients.

Stage 4 asked us to make it perform under serious assumptions:

The task had three parts: query optimization, query normalization, and large-scale CSV ingestion. The CSV ingestion piece was the one that kept me up at night.

Users needed to upload CSV files with up to **500,000 rows** of profile data. These weren't trivial constraints:

On top of that, the database is hosted remotely, so every query — including every INSERT, incurs network latency. And we were already under read pressure from the query workload.

I broke the problem into layers:

Here's what the ingestion pipeline looked like:

```
File upload → Chunked read (256KB) → Decode (UTF-8 → Latin-1 fallback)
  → CSV DictReader (streaming) → Validate row → Batch (10k rows)
    → Deduplicate in-batch → Single DB query for existing names
      → Bulk INSERT (ON CONFLICT DO NOTHING) → Report summary
```

The first approach was naive. I read the entire CSV into memory, parsed every row, and then tried to insert everything at once. For a 100KB test file, it worked beautifully. For the 500,000-row file? Memory exploded.

**Fix**: I switched to streaming with `csv.DictReader`

over a `StringIO`

buffer. But I still needed to validate rows against each other (duplicate names within the same upload) and against the database. That meant holding state across batches of 10,000 rows.

The second problem was **intra-batch deduplication**. Without it, two rows with the same name in the same upload would both pass validation, only to conflict during insert. The fix was a `global_seen: Set[str]`

; a set of all names already inserted in this upload session, passed between batch flushes.

The third problem was **edge cases in CSV data I never anticipated**:

`"twenty-five"`

instead of `"25"`

`"ng"`

, `"NG"`

, `"Ng"`

)Each edge case got its own validator function. The `_parse_row`

function became a gauntlet of pure functions — no database calls, just deterministic validation. If a row failed any check, it was counted and skipped, but processing continued.

**Batch everything. Validate early. Never trust user input.**

The batch size of 10,000 wasn't arbitrary. Below 5,000, the overhead of DB round trips dominated. Above 20,000, memory pressure started climbing without significant throughput gains. 10,000 was the sweet spot.

The pattern that emerged; stream, validate, batch, insert, report — is something I now see everywhere: ETL pipelines, message queues, log processors. It's a universal pattern for handling lots of data with limited resources.

The caching and query optimization work (indexes on `gender`

, `age`

, `country_id`

, composite indexes, connection pooling, result caching with 5-minute TTL) brought the query side in line with our P50/P95 targets. But the CSV ingestion piece — that's what I remember most vividly, because it forced me to think about resource constraints, failure modes, and graceful degradation all at once.

The CSV ingestion task wasn't just about writing code. It was about **engineering for constraints**: memory, concurrency, consistency, partial failure. Every decision (batch size, decoding fallback strategy, `ON CONFLICT DO NOTHING`

vs pre-query) was a trade-off. I had to justify each one. That process of articulating *why* a decision is right, not just that it works is what I think separates engineering from coding.

MeetMind is an AI-powered interview platform. Candidates join live audio interviews with an AI interviewer, and the system records transcripts, generates assessments, sends emails, and provides a chat interface for recruiters to query interview data. I really had a lot of fun building this with my teammates (though I nearly lost my mind a couple of times).

The catch? It depends on several third-party APIs:

Any of these could fail at any moment from network blips, rate limits, service outages, or transient errors. Before this task, a failed API call meant a 500 error and a frustrated user.

My PR to address that issue added two reliability mechanisms:

`retry_async`

utility that wraps any async function with automatic retries, logging, and backoffTwo concrete failure scenarios:

**Scenario A:** A candidate finishes an interview, and the system fires off a background task to generate an assessment summary via Gemini. Halfway through, Gemini returns a 429 (rate limit). The assessment fails. The summary is stuck in "generating" status. The recruiter sees a blank report.

**Scenario B:** Real-time interview transcript turns are stored one by one as the AI interviewer and candidate speak. If the persistence pipeline drops a few turns — or fails entirely — the transcript is incomplete. The recruiter can't review the interview. The AI can't generate an assessment.

Both scenarios had the same root cause: **no mechanism for transient failure recovery**.

For the retry mechanism, I wanted something that:

The signature was clean:

``` python
async def retry_async(
    func: Callable[..., T],
    *args,
    max_retries: int = 3,
    initial_delay: float = 2.0,
    backoff_factor: float = 2.0,
    exceptions: tuple[type[BaseException], ...] = (Exception,),
    task_name: str = "Task",
    **kwargs,
) -> T:
```

For the transcript fallback, the pattern was: try the primary data source first, and if unavailable, reconstruct from the session's `transcript_json`

field. The fallback had to be transparent to callers — `get_chat_history`

, `get_transcript`

, and `get_transcript_export`

all work the same way regardless of which data source backs them.

The first bug was an off-by-one in the backoff calculation. I had `delay *= backoff_factor`

happening *before* the sleep, so the first retry was 2x the initial delay instead of the initial delay itself. It felt trivial, but it meant retries took longer than necessary — 2s, 4s, 8s instead of 1s, 2s, 4s.

**Fix**: Moved the `delay *= backoff_factor`

to after the sleep.

The second issue was **exception type granularity**. Initially, the retry caught all `Exception`

subclasses. But some exceptions shouldn't be retried — like `ValueError`

from bad user input, or `KeyError`

from a missing dictionary key. A retry won't fix a programming error.

**Fix**: Made the `exceptions`

tuple a parameter so callers can specify which exceptions are retryable. For Gemini calls, we retry `(Exception,)`

since the API can fail for many transient reasons. For email delivery, we retry the Resend-specific exception.

The third issue was **the transcript fallback format mismatch**. The session's `transcript_json`

stored timestamps as Unix seconds, but the transcript endpoints expected them formatted as `"HH:MM:SS"`

elapsed time. The fallback function needed to reproduce the same relative timestamp calculation that the primary path used.

**Fix**: Created `_format_elapsed_timestamp`

as a shared utility and used it in both the primary and fallback paths. The fallback also had to generate deterministic UUIDs for each fallback turn (using `uuid.uuid5`

with a namespace) since there were no database IDs available.

**Graceful degradation > perfect failure.** The goal isn't to never fail — it's to fail in a way that doesn't cascade into user-facing errors. The retry mechanism handles transient failures silently. The fallback mechanism ensures that even if the real-time pipeline drops data, users can still review interview transcripts. The user never needs to know something went wrong.

I also learned that **reliability is visible in the logs**. After the retry mechanism was deployed, we stopped seeing "Assessment generation failed" errors in Sentry. Instead, we saw `"Attempt 1/3 failed for Generate assessment... Retrying in 2.00s..."`

— which is an entirely different class of log. It means the system *handled* the failure rather than *succumbing* to it.

This task taught me that reliability isn't a feature you add at the end — it's a design philosophy that affects how you structure every external dependency call. The `retry_async`

utility now wraps every Gemini generation, every email send, every document embedding. It's invisible infrastructure. But it's the difference between a system that occasionally returns 500s and one that absorbs transient failures and moves on. It is important to build Fault-tolerant systems.

And the transcript fallback taught me something subtler: **data can come from multiple sources, and that's okay**. The real-time turns are the ideal data source. But the session JSON is always there as a safety net. Engineering for multiple data paths — with clear fallback semantics — makes the system more robust without adding complexity to the API surface.

Looking back, both tasks taught me the same lesson from different angles:

**Design for failure, optimize for reality.**

The CSV ingestion system assumes every row could be bad and handles it gracefully without stopping. The retry mechanism assumes every API call could fail and recovers transparently. The transcript fallback assumes the primary data path could be incomplete and provides an alternative.

Systems that work under pressure aren't the ones that never break. They're the ones that break gracefully, recover quickly, and leave useful evidence behind when they can't.

That's what I'll carry forward from this internship.
