Last week, I looked at my desktop and felt… judged.
Not by a person. By myself.
There were folders with names like “final_FINAL_v3”, files just sitting there with no home, and three different “temp” folders that were clearly not temporary anymore.
I did what any developer does when they’re mildly embarrassed but also curious — I turned it into a project.
I asked myself: what if AI could look at my desktop and tell me what it actually says about me?
Not just “you have too many files.” Like, really tell me. Roast me. Help me. Give me feedback like a technical interviewer would.
So I built it.
And yes — I know what you’re thinking.
“Priyanka, you could’ve just uploaded the screenshot to ChatGPT and asked it to roast you. Why build a whole app?”
Fair point. Completely valid. You’re not wrong.
But here’s the thing — I’m not just using AI. I’m learning it. Every part of it. The vision models, the prompt engineering, the API calls, the middleware, the way a system prompt completely changes how a model behaves on the same image.
I want to touch every piece myself. Because that’s the only way I’ll actually understand it.
And honestly? I think you’d do the same. If you’re the kind of person who reads a post like this instead of just Googling the answer — you get it. You don’t want to just consume AI. You want to know how it works from the inside.
That’s why I built it.
The app takes a screenshot of your desktop and lets you choose how you want it analyzed:
Three prompts. One image. Wildly different outputs.
And honestly? All three were useful — just in very different ways.
Stack:
No frontend yet. Just a clean REST API you hit with Postman.
The endpoint is simple:
POST /analyzeBody: { image: <file>, mode: "roast" | "serious" | "interview" }
Here’s the part I found genuinely interesting to build.
The whole app runs through one core idea: vision + prompt engineering = very different outputs from the same image.
When you upload a screenshot, it gets converted to base64 and sent to NVIDIA NIM’s vision model along with a system prompt. The model then “sees” the image and responds based on whatever personality I gave it.
The routing logic is dead simple:
// visionService.jsconst image = imageBuffer.toString("base64");const response = await axios.post( "https://integrate.api.nvidia.com/v1/chat/completions", { model: "meta/llama-3.2-11b-vision-instruct", messages: [ { role: "user", content: [ { type: "text", text: getPrompt(mode) }, { type: "image_url", image_url: { url: `data:image/png;base64,${image}` } } ] } ] }, { headers: { Authorization: `Bearer ${process.env.NVIDIA_API_KEY}` } });
That’s it. The model receives both the prompt and the image together. The prompt decides the personality. The image gives it something to work with.
The prompt service just switches between three files based on mode:
// promptService.jsfunction getPrompt(mode) { if (mode === "serious") return seriousPrompt; if (mode === "roast") return roastPrompt; if (mode === "interview") return interviewPrompt;}
Clean, modular, easy to extend. Want a “therapist mode” that diagnoses your digital anxiety? Add one more prompt file. Done.
This is where the real engineering happened. Not in the code. In the prompts.
Each mode has a completely different system prompt that shapes how the model reasons about the same image.
Roast Mode tells the model it’s a “sharp-witted internet roast comedian.” It’s instructed to write 8–12 short punchy lines — each one capable of standing alone as a meme or tweet — and end with a “finishing blow” that ties the whole roast together.
Serious Mode gives the model the persona of a productivity consultant. For each observation, it must output three things: what it sees, how that affects your workflow, and a specific actionable fix. No vague suggestions. Real recommendations.
Interview Mode is the one I’m most proud of. The model plays an engineering manager doing a mock interview assessment. It separates observations (what’s actually visible) from inferences (what that might suggest) and even assigns a confidence level (High / Medium / Low) to each one. It ends with an overall impression — the kind of thing an interviewer might say after seeing your setup.
That structure matters. It forces the model to be honest about uncertainty instead of making stuff up confidently. Which, honestly, more AI outputs should do.
Okay. I ran it on my actual desktop. Roast mode. POST /analyze, mode: roast, my real screenshot attached.
Status: 200 OK. Time: 1m 43s. Size: 3.28 KB of pure humiliation.
Here’s what came back (real output, zero edits):
“Alright, folks, gather ‘round. We’ve got a real treat for you today — a desktop that’s so cluttered, it’s like a digital tornado swept through and left a trail of chaos in its wake.”
It then went after my wallpaper:
“It’s like someone took a screenshot of a Windows 11 tutorial and said, ‘You know what would be great? If we made this the background of our entire computer.’ It’s like a Windows 11-themed party, and everyone’s invited… except for organization.”
Then it found my file names. A folder called Accomplish a START. Another called onePage4P.... And my personal favourite — DS4 roadmap f...:
“I’m pretty sure that’s just a placeholder for the real file, which is probably something like ‘DS4 roadmap for my life after I quit my job and became a professional desk organizer.’”
It clocked my browser tabs:
“It’s like a never-ending list of ‘I’ll just quickly check this one thing…’ that turns into a 10-minute rabbit hole of YouTube videos and cat memes. I’m pretty sure the average person doesn’t need 27 tabs open at once.”
And the finishing blow:
“A desktop so cluttered, it’s like a digital tornado swept through and left a trail of chaos in its wake. But hey, at least it’s a good story to tell.”
I sat there for a full minute just staring at the screen.
The AI wasn’t wrong.
I went in thinking this would just be a fun weekend project.
I came out with three things I didn’t expect:
1. The same image, analyzed differently, gives you genuinely different value. Roast mode made me laugh. Serious mode made me uncomfortable. Interview mode made me think. None of them were wrong.
2. Prompt engineering is product design. The three prompts in this app are doing all the heavy lifting. Swap them out and you have a completely different product. The model is just the engine — the prompt is the steering wheel.
3. Vision models are underrated for developer tools. We talk a lot about text-based LLMs. But a model that can look at a screenshot and give structured feedback? That’s a whole category of tools we haven’t fully explored yet.
A few things I want to add:
If you want to build this yourself, the full code is on my GitHub: desktop-analyzer
You’ll need a free NVIDIA NIM API key to run it.
Your desktop is a snapshot of how your brain is working right now.
It shows what you’re actively thinking about, what you’ve been avoiding, what you care about enough to keep visible. An AI with a vision model can read that — not perfectly, but surprisingly well.
And sometimes, the most useful thing isn’t a code review or a mock interview question.
It’s someone (or something) looking at your actual working environment and saying: here’s what this tells me about you.
Even if that something starts by roasting your wallpaper.
I’m Priyanka — Senior Software Engineer, builder of AI things, and apparently someone who needs to clean their desktop. I write about AI every day on this series. If this was useful, follow along for Day 15.
Technical details in this post are based on my own implementation and publicly available documentation. Results from AI vision models may vary depending on image content and model version.
What Does My Desktop Say About Me? I Built an AI to Find Out. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.