AI Halucinations, User AI Surfing, and AI Ratings

A user observes that AI models like ChatGPT, Gemini, GROK, and Claude frequently hallucinate, leading users to switch between models. The author notes that end users often lack understanding of AI limitations, while researchers focus on technical improvements rather than user education. This disparity raises concerns about how AI quality is evaluated by the majority of users.

So I’ve been thinking about something lately, and I’m wondering what all your opinions might be on the subject. There are obviously different models that are offered for public use for various tasks. If we look at some of the big ones like Gemini, ChatGPT, GROK, Claude, how do the companies that host them rank whether or not a model is overall valuable? Now that’s an interesting question in and of itself, but let’s look at something else. I got to watch this firsthand with my business partner. I’ve also seen quite a few videos on it. One of the videos is pretty good: ChatGPT gets a basic physics question wrong. ChatGPT was not inherently asked to solve the physics. Instead, it was asked to observe a visual field a video feed of a user holding a pencil at both ends . The user explained that he was going to let go of one end of the pencil and wanted ChatGPT to tell him what was happening. And ChatGPT very confidently stated that the pencil would fall to the ground. The interesting thing is that the video concludes the video concludes that ChatGPT hallucinates. I do not know if the user was trying to build a bigger point out of that, but I think the point he made about it hallucinating is enough for my point. How often do new users or even intermediate users try to find out why an AI hallucinates? We don’t really know much about how an AI generates answers, even though we use the term loosely. It’s interesting how semantic Cloude in AI works, how they build statements and things out of probability. In this example, I actually know what happened. The AIs are mostly trained on text information, so video feed information has a lower priority. That means ChatGPT likely wasn’t specifically observing the video feed. It was more likely relying on pre-trained answers. Is it very likely the user in this case would have intuited that, or even searched for it? Does the average user feel particularly motivated to do that? Or are they going to do what I’ve seen in quite a few threads: just surf from model to model, using the model as long as it looks accurate and like it’s doing great, until it starts hallucinating? I wrote a series of papers, and I posted one here about prompt engineering. Prompt engineering is an interesting area of research. It doesn’t seem to me like it’s being pushed as much as it was before. I’m not sure anybody is focusing on the end user as much as they seem to be focusing on teaching researchers how to use AI. In Watching my business partner work, I realized something. He had been working with AI for several years and had studied prompt engineering. As soon as a more powerful model became available on a completely different platform, he would drop one AI and switch to it, often using it for only a week or so before moving on to another one because it hallucinated, got basic things wrong, etc. When I personally observed them in our shared office, and in the few times we worked directly together in the same location, I noticed that all of the AIs were making the same basic mistakes. And notice I’m not talking about different models under the same brand, like different Gemini models. I’m talking about wide gambits of different entire companies and platforms. This made me wonder how many people actually noticed that. How many end users using AI actually notice this? I also wonder about the volume of end users versus people who are actually working with, researching, building, and training AIs. How many end users have no idea what the AI even is and think it’s just an advanced chatbot? How many people are actually researching, working on, and building AI, and for all intents and purposes know a great deal about AI? Why is any of this even important? If you think about it logically, if the end user the people don’t know much about AI to start with maybe they researched it, maybe they didn’t , and they vastly outnumber researchers and builders, does that mean they get less weight in evaluations of AI when they choose to rate different AI models? If their volume gives them more weight, why would I want to consider that angle? The question I asked earlier: What is it that different platforms use to rank their AI models? Why is that important? I have noticed a bias in AI. I’m sure I’m not the only one who has noticed this. Different AI models tend to be very short-term task-satisfaction weighted, meaning they tend to take the shortest possible path to satisfy the user’s inquiry. For simple questions, this may be fine, especially for long conversations. For trying to build complex programs, it can be pretty bad. We end up constantly asking why an AI trained on very high-level programming language and very architecturally sound programs builds such bad programming sets and codes so poorly. And don’t get me wrong, many times the code works. It just runs poorly. And yes I know models are getting better at this but that’s not the point I’m looking at. What I’m observing is where these two things coincide: are major platforms getting downvoted on AI hallucinations by users who may or may not really understand how the AI works? Is that information feeding back into the platform, reinforcing the overall perspective that the AI model they’ve deployed may not be all that good? Think about this: if you’re working with an AI that is a large language model, a large part of its content is generated through the user interacting with the context window, basically having a conversation. How much of the A.I.'s hallucination is driven by miscommunication and how much of that hallucination is generated by the ether, the unknown, the void? Is there anything in the conversation that can be managed by the user through conversation if conversation is indeed the main form of information transfer? How much hallucination cannot be managed by conversation? And do end users know this? I’m going to go out on a limb and say that if the overwhelming industry byline is that "AI is amazing, AI is the future, AI is awesome, everybody needs AI, and that the AI just knows what you want… " That the end user probably does not know this. So what do you think, community? What are your thoughts on this overall? Do you think end-users, who may or may not have a huge understanding of AI, can be a driving force in telling a platform whether its models overall are very good and able to satisfy user requests? If that is indeed the case, could there be downstream consequences for users from the huge gap between what they expect and what the model can actually do? Because make no mistake, even in highly trained models if they are used poorly or interacted with poorly, they can render poor results.