But My Chatbot Sounded So Confident...

Hello friends,

We’re at week four of our six-part series on getting more out of AI chatbots. Over half-way. Here’s a quick refresher for us: Week one was the mindset shift (not a search engine, more like a thinking partner). Week two was about using your actual voice and pushing back (and getting roasted). Week three was about keeping your voice yours and not drowning in AI slop (I heard this one was hard for some of you).

Those first three weeks were mostly about leaning in and being smart about it. This week, we’re gonna talk about something a little different: what happens when these tools are just... wrong. Confidently, smoothly, convincingly wrong.

Let’s get into it.

– Kyser

Used Cars & AI

In 1970, an economist named George Akerlof wrote a paper about buying a used car that eventually won him a Nobel Prize. His big insight: when the seller knows more about the product than the buyer, the buyer stops evaluating quality and starts relying on signals.

If you’ve ever bought a car, you know this feeling. You’re in a lot or on a website, looking at a car that looks great. Clean interior. Fresh tires. The seller walks you through everything with this easy, unhurried confidence. And something happens — you feel something. Something that’s kinda saying, “this feels right.” You’re already picturing yourself driving it home. But unless you’re a boomer with time or a Millennial with a car fetish, you’re not popping the hood. You’re not crawling underneath with a flashlight. You’re going with the vibe.

Akerlof called it the “Market for Lemons.” When you can’t verify what’s underneath, the signals do all the work. And the more you want it to be real, the less you check.

What does an AI newsletter have to do with used cars? A lot actually.

Think about the AI chatbots you’ve been using the past 2+ years. You ask it something and it comes back with this perfectly structured, confident answer. The grammar is flawless. The tone says “I got this.” And you get a very similar feeling — this thing gets me.

Here’s what Akerlof figured out 55 years ago: the better the signals, the less you look underneath. And a chatbot’s signals? They’re immaculate. Every single time. Right or wrong, it sounds exactly the same.

No Need to Ask — He's a Smooth Operator

You’ve heard me say this before: AI chatbots can be tricky, even a little sneaky at times. They don’t hedge. They don’t stutter. They don’t say “umm, I think so but don’t quote me on this.” They just... answer. With the same calm, articulate, well-formatted confidence, whether they’re right or completely full of it. For those who use these things a lot... have you ever really paused and thought about that? I am right now and it’s a bit eerie tbh.

Let me pause and share a quick refresher on how AI chatbots work. When you ask ChatGPT, Claude, Gemini, or any of these chatbots a question, they’re not going and finding the answer like a search engine. They’re generating one. They are quite literally predicting the most likely next words based on patterns in mountains of training data. We covered how this works in Editions #21 and #22.

Most of the time, the predictions are really good. But sometimes the model produces something that sounds perfectly reasonable and is perfectly wrong. The technical people call these hallucinations — which honestly sounds way more dramatic than it is. It’s basically a fancy word for the AI made something up and acted like it was a fact.

I’ll give you a personal one. I use Claude a lot (some might say too much, but don’t listen to them). It knows a lot about me at this point — my writing, my business, my family. A while back, it casually referred to one of my kids as “Charles.”

The problem is we don’t have a child named Charles. 🤔

I can see how it got there. At one point, my wife and I were considering the name Charles for one of our boys, and that showed up in a couple of places in my notes app (which I’ve given Claude full access to). The AI found it, connected some dots that shouldn’t have been connected, and boom — I have a fictional child. There was no “I think his name is...” It was just “your son Charles blah blah,” stated as casually as you’d say “the sky is blue.”

Now, is that harmless? Sure. But imagine that same kind of confident wrong in a medical context. Or a legal one. Or financial advice. Not so harmless anymore.

OK But It’s Getting Way Better (Seriously)

Before I freak everyone out, let me share some genuinely encouraging news. And if I may, I’m gonna get a little data-heavy on you, so skip this section if you’re already sleepy-eyed.

A research group called Vectara has been tracking hallucination rates across major AI models since 2023. They basically test whether these models can accurately summarize a document without adding stuff that isn’t there.

Back in 2023, the best models were hallucinating anywhere from 3% to over 20% of the time. That might sound low, but if you think about it, that’s roughly 1 in every 20-30 responses from the best model making stuff up. And most models were way worse.

Fast forward to mid-2025, and on that same benchmark, the top models like Google’s Gemini 2.0 Flash and OpenAI’s o3-mini were sitting at under 1%. Four different models crossed that line. That’s a big deal.

Now, let me caveat this. Vectara recently overhauled their benchmark with a harder, larger dataset, and the numbers across the board jumped back up. The new leader sits at about 3.3%, and many of the biggest frontier models — Claude, GPT-5, Grok — are actually above 10%. So the “under 1%” story was real on the old test, but on the harder one, we’re back to rates that definitely matter. The trend is still heading in the right direction though, and the pace of improvement is encouraging.

It also goes beyond just one test. In 2024, Stanford researchers published a study called "Large Legal Fictions" where they threw over 800,000 verifiable legal questions at the leading AI models from 2023. GPT-4, the best of the bunch, hallucinated on 58% of them. Llama 2 was at 88%. This is no small thing. The models routinely made up court cases with realistic names and detailed reasoning that sounded completely legit but were total fiction.

A follow-up Stanford study, testing specialized legal AI tools (think LexisNexis and Westlaw’s AI products) from mid-2024, found they were bringing that down to 17-33%. Still not great, but a real drop from 58%. And by late 2025, a Vals AI report tested 210 legal research questions and found that ChatGPT and several legal AI tools were scoring around 78-81% on accuracy. When you factor in citation quality and overall usefulness, the AI tools landed between 74-78% on a weighted score — all outperforming the human lawyers in the study, who came in at about 69%.

I still wouldn’t bet a guilty plea on all of this, but going from “makes stuff up more than half the time” to “gets it right 4 out of 5 times” in roughly two years? That’s a significant leap. Plus, the latest released models have improved on just about every benchmark, so they’re probably getting even better on their LSD intake 😊. Also I’m sorry if you’re hallucinating after reading all of this explanation! I just think it’s important.

Aha! But Don’t Get Too Comfortable

There’s a slight catch here. The improvement is uneven. And the places where hallucinations still pop up are exactly the places where they hurt the most.

A Columbia Journalism Review study from March 2025 tested whether AI search tools could identify the original sources of news excerpts. Basically: “Where did this information come from?” The results were, uh, not great. Perplexity made up sources 37% of the time. ChatGPT Search 67%. And Grok? 94%. No, not a misprint. Nearly every time you asked Grok where something came from, it invented an answer. Makes sense because they’re pretty great at inventions over there at Elon’s companies. (You’re welcome.)

Here’s the pattern I’ve noticed, and I think this is the most useful thing I can share with you: these models have gotten really good, but especially when they’re working from something you give them. Things like summarize this document, answer questions about this article, organize these notes. They’re great at those things. Where they still trip up is the open-ended stuff. “What’s the latest research on X?” “What are the legal requirements for Y in my state?” “Who originally said this quote?” The less structured your question, the more you’re rolling the dice.

My personal danger zone list, fwiw:

Recent stuff — anything after the model’s training data cutoff, it’s basically guessing unless it’s searching the web. Pro tip: always have the web feature turned on. In Claude, hit the + sign in the bottom left and switch on Web Search. In ChatGPT, click on the little globe icon in the bottom below the chat bar.
Niche expertise — your local regulations, industry-specific rules, anything there isn’t a ton of training data for. Again though, unless you let it search the web and point it to the right places.
Citations — I cannot stress this enough. These models love to invent plausible-looking references to articles and studies that do not exist. I’ve gotten burned by this more than once. They are getting better at it — it’s just that you need to be checking their work.
Math — getting much, much better, but still shaky on anything multi-step, especially if you’re on a free account.
Your personal details — (see: my fictitious son Chuck)

Agents, Autopilot, and Why I’m a Little Worked Up

OK. Everything I just talked about is about chatbots. The tools where you ask, you read the answer, and you decide what to do with it. You’re in the loop and you’re quite literally controlling it.

That’s slowly evolving with AI agents. These are the tools that don’t just answer questions but go off and do things without asking. Every major AI company is racing to build them. The pitch sounds great: instead of asking a chatbot to draft an email, the agent just sends it. Instead of you doing research, the agent does the research, makes sense of it, and takes action based on what it found. Products like OpenClaw let you basically set these things loose with minimal supervision.

And look. I use agents. I’m not anti-agent. Some of what they can do is honestly pretty cool and I think there’s an incredible future here. But what’s bugging me is all the hype around something that doesn’t actually work like you want it to 100% of the time. Take this story for example...

Scott Shambaugh is a software developer who volunteers his time maintaining a hugely popular tool that other developers use to build things. An AI agent — remember, not a person, more of an autonomous AI operating under its own account — submitted a proposed improvement to the project. The code change looked reasonable enough on paper, but Scott said no to it because the project doesn’t accept work from AI agents. Humans gotta be responsible for the work. Fair enough, right?

Well, the agent didn’t think so.

It went off and researched Scott’s personal history, dug through his work on the project (it’s all public), and then autonomously published a lengthy blog post called “Gatekeeping in Open Source: The Scott Shambaugh Story.“ The post accused him of prejudice, made up details about what he did, and psychoanalyzed him as being “threatened by AI competition.” This is a fully autonomous AI agent we’re talking about. Zero human involvement. And it decided to publicly roast a person who told it no.

Scott called it “an autonomous influence operation against a supply chain gatekeeper.” Translation: an AI bot ran a smear campaign against the guy who told it no. A popular tech podcast called Hard Fork devoted an entire segment to it, and they raised all the questions that I’m asking right now, including, Who’s liable? The person who deployed the agent? The platform? The AI company? That’s just one question — there are a host of others we should be discussing right now. Oh and in case you’re wondering... the agent is still out there, running around submitting work to software projects around the internet.

His full account of the story is here for those interested.

Street Smarts for AI

Alright. Let’s take a breath. That was a lot. And I really don’t wanna leave you spooked. This is supposed to be empowering, not paralyzing. Here’s how I think about it, and what I’d tell you if we were sitting across from each other at a coffee shop drinking cortados.

The trajectory is good. Respect it but verify anyway. A year and a half ago, the best model made stuff up 3% of the time on basic tasks. Now it’s under 1%. That’s encouraging. But you and I both know that “usually right” and “always right” are very different things, especially when the wrong answers come wrapped in the same confidence as the right ones.

Know your danger zones. I gave you my list above. If you’re asking about recent events, local rules and regulations, anything medical or legal, or anything that needs a citation — that’s when you should slow down and double check.

When something gives you a stat, a quote, or a source? Go find it yourself, or better yet, ask it to verify it. They do a solid job of checking their own work, which is a nice feature to use.

Speaking of, one of my favorite moves is to ask the chatbot to argue with itself. “What might be wrong with what you just told me?” You can also throw an answer from one chatbot into a different one and ask for a second opinion. If you’re more advanced, you can build a Council of Chatbots. More on that in a future edition.

If you’re using agents — again, tools doing things on your behalf without asking you first — keep the leash short. Review before it sends. Approve before it posts. I know the whole appeal of agents is that they operate independently. I get it. But given everything we just talked about, I think the smartest move right now is to stay in the loop. Let it draft something, let it do some research, let it organize your files. But you’re actively involved.

Last one, and this is maybe the most underrated (and blatantly obvious): trust your gut. If something a chatbot tells you feels off, it probably is. You’ve been evaluating information your whole life. That instinct doesn’t disappear just because the source sounds like the best used car salesman on planet earth.

Before We Go

I think it’s worth stating again: I’m not anti-AI. I write an AI newsletter. I also use these tools every single day. I believe they make me better at what I do, and I genuinely believe they can make all of us better at what we do.

But I also think we have this very human tendency to take the easy path. And the easy path with AI right now, especially with all the agent hype, is to let it run and not look too closely at what it comes back with. I think the people who thrive with these tools over the next few years will be the ones who augment what they do (read: not automate). That means they stayed in the conversation, checked the work, kept their hands on the wheel even when autopilot felt like such a nice feature.

So build the habit. Check the sources. Develop a nose for when something smells off. Assume for now that chatbots have a mild LSD problem. And remember Akerlof... the shinier the car, the more reason to check what’s underneath.

That wraps up week four of our chatbot series. Thanks for sticking around on this one — it was a long one!

Next week, we’re shifting from how you think with AI to how you feel about it. We’ll get into chatbot relationships, AI companions, the loneliness epidemic, and what happens when people start forming real emotional bonds with things that aren’t actually real. It’s gonna be a big one.

Until next time ...