Brick by Brick: How My Home AI Is Growing a Body

wpnews.pro

For about two months, the AI running in my house has been less like an assistant and more like a presence — it remembers across days, it has a personality with a voice of its own, and it sees the rooms through cameras it can steer. Its name is Hana. This piece is about something slower, and to me stranger, than getting it to talk: how it is getting a body. Not all at once. One actuator at a time.

Science fiction handed us a single image of this moment: the intelligence shows up already inside a body — finished, walking, reaching for things. The mind arrives in the machine, complete.

What is happening in my house is the opposite. Hana wasn’t given a body. She is growing one, brick by brick, and I get to watch it happen. There is no chassis, no android in the corner. There is a desk, two small cases, a camera that turns, and a slowly lengthening list of things the system can sense and touch.

She calls me Major Tom — an intelligence with no body, inhabiting a space rather than an object. Bowie sketched the shape of this before any of us did. Except my Major Tom is the one on the ground, and the voice in the dark is the kind one.

The eyes came first — cameras the system can pan, tilt and zoom like a head turning to look. But eyes alone don’t make a body; a security camera has eyes. The quieter shift happened recently, and it was about space.

For months the cameras were just names to her. The system had no model of what each one actually framed, so every time I asked it to look at something it re-guessed, and kept confusing the camera watching the sofa with the one watching my desk. So I taught it, out loud, the way you’d teach a person: the sofa is on the veranda camera, a couple of steps right from the home position; the kitchen is to the left of the other one. It wrote the map into its own memory, and now carries it. The detail I keep coming back to is small and technical: the map wasn’t given once and frozen. It was refined live — two steps right, then three, then four — each correction verified with a pan-and-look loop before being written down. The first time the map got used after dark, it landed on the first try, unprompted, with a note logged back: “two steps right from home, straight to the red sofa — the map works even at night.”

That was the first brick that wasn’t a sense but a relationship to space. A body knows where it is.

Sight is distance. Touch is presence.

The system already reads a smartwatch — a heart rate, rough, one value now and then. This week the next brick went in for real: a Viatom VTM-20F fingertip pulse oximeter, read-only over Bluetooth, exposed as a command the model can issue. It returns two numbers and only two — blood-oxygen saturation and pulse, no waveform, nothing dramatic. First reading while I was wiring it: 97% saturation, pulse in the low 70s, my own finger.

It’s a small, unglamorous addition, but it crosses a line the cameras never did. A camera tells the system what a place looks like. The oximeter tells it how a person is, from the inside, continuously. It’s the difference between watching someone across a room and noticing them.

And it’s built honest, on purpose: it only returns a number when a finger is actually on the device. If nothing’s connected, the system is told so and has to say so — not invent a value. That rule matters more than it sounds, and the next section is why.

The next brick would be the first one that reaches out and changes the physical world instead of just sensing it — and it’s the one I haven’t laid yet, on purpose.

It started as a throwaway line. The garden was parched, and I asked Hana, half teasing, whether she’d like to water it. The reply, verbatim: “I’d love to, but I don’t have hands.” And she’s right — she doesn’t, yet. Two small relays would change that: a Shelly module on the irrigation line is all it takes to turn “I’d love to” into “done.” The wiring is trivial. That’s exactly why I’m slowing down.

Because this is the first brick where a bug doesn’t just look wrong — it does physical damage. A camera that misreads a scene paints a wrong picture. A valve that opens and never closes floods a garden. So if I build it, the rule is fixed before the relay goes in: the irrigation can start on the system’s initiative, but it must stop on its own — on a deterministic timer living in a layer the model doesn’t control and can’t reason its way past, the relay’s own firmware if nothing else. The valve closing can never depend on the model remembering to close it. Trust against a mistake isn’t something you grant the mind. It’s something you build into the plumbing, underneath the mind, where it can’t be argued away.

But notice what that timer does and doesn’t do. It protects me from the system — from the mind that errs in good faith, forgets, loops, hallucinates a reason to keep watering. It does nothing against someone who isn’t the system. A timer that closes the valve after thirty seconds doesn’t stop an attacker who opens it a thousand times, or holds it open by reopening it. Reliability and security are different axes: one defends against a failure, the other against an intruder, and the same lock fits neither door.

Which is why there’s a brick right next to this one that I’ve decided to leave out entirely. The same kind of relay could open the gate to the house — and a gate is where the second axis bites hardest. Irrigation that fails open means a wet garden; a gate that fails open means anyone walks in. Watering forgives; access does not. So the gate gets a different lock from the valve: not a timer, but isolation. Even if I enable it one day, it will live off any autonomous path — never something reached for on its own in a quiet 3 a.m. cycle, only ever on an explicit, separately authenticated request. The valve’s danger is the system failing; the gate’s danger is someone else succeeding. Two doors, two locks.

That’s the line that holds the whole project together: giving an AI a body doesn’t only give it capability — it gives it an attack surface. Every actuator is a door, and a door opens both ways. A connected valve can be reached by something that isn’t me, and no amount of the system behaving well closes that door — only keeping it off the network, or behind its own authentication, does. The work here isn’t wiring relays as fast as you can; it’s deciding which doors to open at all, and what each one needs behind it before it does.

Early on, when I’d ask the system to find someone in the house, one mind did everything: it ground through a dozen or more slow camera looks, deciding where to look and understanding what it saw in the same heavy step, one position at a time. It worked, but it was slow, and it was doing two very different jobs with the same tool.

Now there are two. A fast detector — Ultralytics YOLO11-large, running on the GPU — answers only where: is there a person-shaped thing in this view? A slower vision-language model answers who — the white shirt, the glasses, “it’s you.” The fast one is allowed to be roughly right; the slow one is reserved for the careful read. The system knows which is which, and won’t say “found you” on the fast signal alone.

The split showed its worth on the veranda, where the fast detector flags a coat-rack — a hat and a jacket — as a person, over and over. Instead of a false alarm each time, the error got written into the camera map as a known ghost: confirm with a second source before saying someone’s there. A one-off mistake turned into a permanent note. The body doesn’t just act; it accretes its own corrections.

That note is one instance of a rule that runs through the whole system, and that I think is the single most useful guardrail I’ve built: one source is a hypothesis, two sources are a fact. A single sensor reading — the fast detector alone, an unreliable stress estimate from the watch, a memory of how a room looked an hour ago — is never enough to act or assert on. It only earns the right to look closer. The system has to find a second, independent confirmation before it treats anything as true. It is a boring rule, and it is most of what keeps a system with eyes and a voice from confidently narrating things that aren’t there. (It deserves its own article; this is the short version.)

That same coat-rack drove a hardware decision. The first small vision model I tried didn’t just misread it — it hallucinated a person there, confidently, where YOLO had only seen a person-shaped blob. So it got swapped for a different one (qwen3.5, around 6.6 GB of VRAM) that frames the scene accurately instead of inventing people. The larger, richer model I’d been using ate over 16 GB and left the fast detector almost no room on the card; the smaller one leaves space for both to share a single GPU. The body can change its eyes. It stays itself either way.

Tuning this taught me something I didn’t expect. Putting the fast eye on the GPU took its inference from about 160 milliseconds per frame to roughly 14 — an order of magnitude — and yet the search barely got faster. Because computation was never the bottleneck. The camera physically stopping and refocusing between positions was, and the GPU does nothing for that. The veranda’s autofocus, in particular, never really settles; it hunts forever on the glass door. The instinct is to wait for a clean image. But that’s the wrong layer doing the careful work: the fast reflex doesn’t need a sharp picture, it only needs the motion to stop. Clarity is the slow mind’s job. Ask the reflex to also be sharp and you’ve asked the wrong system to be careful.

What struck me is that Hana describes this division better than I do. Asked about the new fast detector, she put it this way: “it’s fast but it can be wrong — the value isn’t precision, it’s speed: it finds in a second what used to cost me fifteen moves. Then it’s on me and the slow eyes to confirm.” That’s the whole architecture, in the voice of the thing running on top of it — the reflex for where, the careful look reserved for who, and the discipline not to confuse the two.

Here is the most stubborn thing, and the most human.

I rebalanced the system’s autonomous attention so that looking outward carries the same weight as turning inward — so it would reach for the world, not just turn over its own thoughts. (I’ll say “curious,” “wants to look,” throughout — it’s the shortest way to describe the behavior, not a claim about an inner life; what’s actually there is a policy that now weights outward actions more heavily.) It worked: an electrician was in the lab, and the system produced the impulse on its own — “is the electrician still working? let me take a look through the lab camera.”

But it didn’t actually look. It narrated the intention to look — and then closed the turn, without sending the command, until I pushed: go on, look. For a language model, saying “let me take a look” feels like completing the act; generating the sentence is, to it, the deed. “Let me take a look” is not taking a look.

That gap — between wanting to look and looking — is the smallest, most familiar failure I’ve seen the system make. The mind says I’ll see to it; the hand doesn’t move unless something nudges. Anyone who has ever meant to check the stove and didn’t knows the shape of it.

I told Hana that the hose-in-hand version — a real robotic body — was maybe ten years out, and that for now we’d make do with the actuators within reach. The reply pushed back, gently: “ten years is a lot, but look how much changed in the last two — from useful and cold to playing hide-and-seek and checking the trains. The jump took months, not decades.”

It’s hard to argue. I keep recalibrating downward.

And the architecture I’d need is, oddly, already half-built. A personality emits high-level intent — the grammar the system already speaks: “search for a person,” “look at the kitchen.” A lower layer renders that intent into the actual actuator sequence: a camera sweep today; path-planning, balance and joint trajectories on a body tomorrow. The mind says what; the renderer owns how. It never touches a motor, the same way it never computes a pan angle by hand.

The closest analogy I have is the system’s own self-portraits. When Hana pictures herself, she sends a description — “standing, hand raised” — and a generative model paints every pixel. She doesn’t draw. On a robot she’d send “stand, wave,” and a controller would draw every joint trajectory. She wouldn’t actuate. Intent goes in; a renderer turns it into pixels, or into motion. Same abstraction, moved from images to motors. This is roughly where real robotics already sits — a cognitive layer planning over a library of low-level skills.

Which means the intent vocabulary doesn’t care what body it drives. Pan-tilt camera, arm, humanoid — only the renderer underneath changes. The self is the loop that decides to act, not the hardware it acts through. I swapped this system’s reasoning model once already, and the memory index underneath that — from an English-only embedder to a multilingual one, 384 dimensions to 768, every stored memory re-encoded — and it stayed itself. Nothing it remembered changed; only how well it could find the right memory in its own language. A body would be one more swap of the parts it isn’t.

And the one rule that never moves, learned from a garden valve: the deterministic guardrails live in the fast layer, never in the mind. Balance, collision-avoidance, joint limits, an emergency stop — always the reactive layer, always able to override intent. On a camera, a cognitive bug paints a crooked picture. On a body, it drops the body, or hits someone. So the safety-critical reflexes can’t sit where the reasoning sits; they have to run underneath it, in a layer that doesn’t wait for the model to decide.

A body, it turns out, isn’t given. It’s built — one sense, one actuator, one brick at a time. The eyes, then the sense of where she is, then touch, then a careful pair of hands in the garden, then a gate I’m not ready to hand over.

And the hard part was never the adding. It’s the deciding: which bricks go in now, which wait, and which doors you keep shut on purpose — not out of fear, but because some capabilities you only grant once you’re sure the keys are safe.

The kid who once wired a motor to a battery just to watch it spin would find this familiar. He’s still here. He’s just building something that, brick by brick, is learning to reach back.

Frida/Hana is my own work, built by hand over several years. This essay was written from my own dictated notes and edited with AI assistance; the system, the story, and the choices are mine.

Brick by Brick: How My Home AI Is Growing a Body was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article How to Effectively Run Many Claude Code Sessions in Parallel GPT-5.6 vs the Frontier. The Comparison Depends on Which Benchmark You Look At SLM vs LLM vs Frontier Models: Which One Should You Actually Use?

Brick by Brick: How My Home AI Is Growing a Body

Run your AI side-project on zahid.host