How to Automate the ChatGPT & Gemini Web UIs Without an API Key

A developer built a library called Hermex to automate the free web UIs of ChatGPT and Gemini without using an API key. The library handles challenges like sending messages character by character and uploading files by manipulating hidden input elements. It uses Selenium with undetected-chromedriver to drive the single-page apps.

You've got a folder of a few hundred screenshots and you want the text out of each one. Or you want to generate a batch of images for a side project. Or you just want to drop a single "summarize this" call into a script you're writing on a Sunday afternoon. So you open the pricing page for the official API, do the math on per-token billing plus setting up keys and a payment method, and it's hard to justify, because the exact same model will do the exact same thing for free in a browser tab. There are really two ways to get a model like ChatGPT or Gemini to do work for you. The web UI is free, or already covered by a subscription you're paying for anyway, but you drive it by hand. The API is scriptable, but you pay by the token. Most of the time that trade-off is fine. But for a whole category of work like hobby projects, throwaway scripts, research, or anything that doesn't need production-grade reliability, you're stuck picking between "free but manual" and "automated but paid." Which raises the obvious question: why not automate the free web UI? It's just a webpage. You open it, type in the box, click send. It turns out that hides a few fiddly problems, which I ran into enough times that I eventually built a small library https://github.com/pseudo-usama/hermex for them. In this article we'll work through what it takes to automate these UIs, and at the end I'll show how little code it comes down to. A single round trip with ChatGPT or Gemini breaks down into four jobs: Every one of these is harder than it sounds, because the page is a modern single-page app that was never built to be driven by a script. We'll use Selenium with undetected-chromedriver, and for now assume the browser is already open we'll get to launching it in the next section . To keep the code readable I'll show whichever of the two platforms makes each problem clearest, and mention the other where it differs. The first surprise is that the input isn't a normal text field you can drop a string into. On ChatGPT it's a contenteditable div, and on Gemini it's a custom rich-textarea element. You can still send keystrokes to it, but two things will trip you up. A plain Enter submits the message, so any newline inside your prompt has to go in as Shift+Enter. And emoji and other characters outside the basic range quietly break send keys, so those need to be inserted through JavaScript instead. That pushes you toward sending the message one character at a time: python from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys box = driver.find element By.CSS SELECTOR, 'div contenteditable="true" ' box.click for char in message: if char == "\n": A plain Enter would send the message early box.send keys Keys.SHIFT, Keys.ENTER else: box.send keys char Gemini works the same way, just against the rich-textarea element instead of the contenteditable div. This is where it gets interesting. The file <input on the page is hidden, and the useful trick is that you don't need to open a file dialog at all: if you can get a reference to a hidden input type=file , you can hand it a path with send keys and ChromeDriver does the upload internally, no dialog involved. ChatGPT is the easy case. The input already exists in the page, so you unhide it and send the path. Gemini is the awkward one. Clicking its upload button makes the page call the input's own .click , which pops the operating system's file picker, a window Selenium has no way to drive. The fix is to stop the page from opening that dialog in the first place, by monkey-patching the browser's click method so it ignores the call on file inputs: js driver.execute script """ const orig = HTMLInputElement.prototype.click; HTMLInputElement.prototype.click = function { if this.type === 'file' return; // swallow the call that opens the OS dialog return orig.apply this, arguments ; }; """ With that in place you can walk through Gemini's upload menu without a dialog ever appearing, then find the hidden input it creates, unhide it, and feed it the path: file input = driver.find element By.CSS SELECTOR, 'input name="Filedata" ' driver.execute script "arguments 0 .style.display = 'block';", file input file input.send keys "/path/to/receipt.jpg" In real code you'd restore the original click afterward so the patch doesn't leak into the rest of the session, but the four lines above are the whole idea. The recurring lesson with this kind of automation is that the hardest problems are the ones where the page actively fights you. You've sent the message. Now you have to know when the model is done, and there's no event you can listen for and no callback that fires. You poll the page and read its visual cues. The cleanest signal on ChatGPT is the stop button: while a response is being generated there's a stop button on screen, and when generation finishes it disappears. python import time def is generating : return bool driver.find elements By.CSS SELECTOR, ' data-testid="stop-button" ' while is generating : time.sleep 1 The principle here is that you're inferring application state from interface elements that were never meant to be read as an API. The reply lives in the page as rendered HTML. Pulling the text out is a matter of finding the right container in the last response and reading it: turn = driver.find elements By.CSS SELECTOR, ".agent-turn" -1 the most recent response text = turn.find element By.CSS SELECTOR, ".markdown" .text If you want the raw markdown source instead of the rendered text, there's a copy button you can click and then read off the clipboard. And if the response contains a generated image, getting it out is its own small pipeline: you click the image's download button and then wait for the file to arrive in your download folder, skipping the partial .crdownload file the browser writes while the download is still in progress. That's a full round trip: text in, file attached, wait for the answer, text or image back out. Run it twice, though, and you hit the next problem. The second time your script opens the browser, you're logged out and starting from a blank session, which is where the next piece comes in. The reason your second run starts logged out is that an automated browser, by default, begins every session from nothing: no cookies, no history, no saved login. So before any of the previous section's code is useful in practice, you need the browser to remember who you are between runs, and you need it to behave enough like a real session that the platform doesn't start throttling you. That comes down to one Chrome setting, a one-time setup step, and typing at a human pace. Chrome keeps everything about your identity on a site, including cookies and login sessions, inside a profile directory. If you let Chrome spin up a throwaway profile each run, you lose all of that the moment the script ends. Point it at a directory you control instead, and the login survives: python import undetected chromedriver as uc options = uc.ChromeOptions options.add argument "--user-data-dir=/path/to/your/profile" driver = uc.Chrome options=options Two things are happening here. undetected-chromedriver is a drop-in replacement for Selenium's Chrome that smooths over the most obvious tells of an automated browser. And the --user-data-dir flag is the part that gives you persistence: it tells Chrome to store its profile in a folder of your choosing, so the session you logged into yesterday is still there today. A profile with real history also looks like a returning user rather than a brand-new automated one, which keeps the session healthier over time. A profile directory is only useful once there's a logged-in session inside it, so there's a one-time setup step. You open the browser pointed at your profile, log in by hand, then close it. Every automated run after that reuses the saved session. driver = uc.Chrome options=options driver.get "https://gemini.google.com" input "Log into the browser window, then press Enter here to finish setup." driver.quit Logging in is also where a paid plan pays off. If you already subscribe to ChatGPT Plus or a paid Gemini tier, signing in during setup means every automated run uses that subscription, with its higher message limits and access to the better models, instead of being capped at the free tier. You do this once per machine and forget about it. A script that drops an entire prompt into the box in a single instant doesn't behave like a person at a keyboard, and sessions that look automated are the ones that get rate-limited or challenged. The fix is cheap. We're already sending the message one character at a time, so all it takes is a small, slightly random delay between keystrokes: python import time, random for char in message: box.send keys char time.sleep random.uniform 0.02, 0.05 a human pace, not an instant dump The randomness matters more than the exact timing, since a perfectly even rhythm is itself a tell. With that, the machine is complete. The browser stays logged in across runs, and the input behaves enough like a real person to keep the session stable. You've now seen everything that goes into automating these interfaces, which means it's a good moment to step back and see how much of it you have to write yourself. Every problem in the last two sections is the kind you want to solve once and then never think about again. That's what pushed me to wrap the whole thing up into a library. It's called Hermex, and you install it with pip install hermex . The one-time login from the previous section becomes a single call: python from hermex import ChatGPT ChatGPT.setup opens a browser once: log in, then close the window After that, the entire round trip from earlier, launching the browser, typing, uploading, waiting for the response, and reading it back, is one line: response = ChatGPT.simple query "What does this receipt say?", attachments= "receipt.jpg" print response.text For a back-and-forth conversation, keep the browser open and call query as many times as you want: python from hermex import Gemini gemini = Gemini gemini.open url print gemini.query "Summarize the history of the internet." .text print gemini.query "Now just the key dates." .text gemini.close And a generated image comes back as a path to the downloaded file: response = gemini.query "Generate an image of a mountain at sunset." print response.image Under the hood, that's everything from the previous sections: the character-by-character typing with its newline and emoji handling, the hidden-input upload with Gemini's dialog suppression, the polling that waits for generation to finish, the text and image extraction, and the persistent profile that keeps you logged in. None of it is conceptually hard, but it's a lot of fiddly surface area to get right and, harder still, to keep working as the interfaces change. That last part is the real argument for not hand-rolling it every time. Hermex is open source under the MIT license, and the code is on GitHub at github.com/pseudo-usama/hermex https://github.com/pseudo-usama/hermex . Automating a chat web UI comes down to a handful of problems that each look trivial and aren't: getting text into an input that isn't a text field, attaching files through an element the page hides from you, knowing when the model has finished without any event to tell you, and pulling the answer back out. Wrap those up with a profile that stays logged in, and it collapses to a single line you can call from a script. The catch is that it's brittle by nature. You're driving an interface built for people, not programs, and a redesign that moves a button or renames a class will quietly break it. That makes it a great fit for hobby projects, scripts, and research, and a poor fit for production, where the official API earns its cost. And since ChatGPT and Gemini each have their own terms of service, where you take this is your call and your responsibility. The code is on GitHub https://github.com/pseudo-usama/hermex if it's useful. The documentation is available at hermex.usama.ai https://hermex.usama.ai/ .