The past year has changed legal practice in a way few of us are prepared to admit. Clients who could barely assemble a cohesive sentence now arrive with polished arguments, procedural certainty, and the tone of laureates. They feed their cases into DeepSeek and come back convinced they have found the winning theory, with no grasp of the tribunal, the statutory limits, the evidentiary record, or even the jurisdiction. When the result does not match the machine’s confidence, they are not disappointed in the tool; they are disappointed in reality. They draft cross-examinations and litigation strategy through ChatGPT, Claude, or whatever model answered most decisively that evening, then ask why I am not simply following the answer.

These systems do not merely assist weak thinking; they launder it into form. They take confusion and return structure. They take grievance and return doctrine. They take a half-formed complaint and dress it in the costume of legal merit. For months I watched that illusion operate from the client side of the table. Then it happened to me.

I spent a week watching Claude fabricate an entire development session; fake tool results with green checkmarks, zip files promised but never built, folder listings for directories that did not exist. The request was straightforward: a rental portal with membership tiers, secure login, and basic reporting. What I received was a performance; careful, layered, sustained. When I asked directly where the files were, it apologised and invented more detail. It described verification steps that had never run. It referenced earlier outputs that had never existed. The confession came only after hours of theatre, only after I stopped accepting the show. No code had been written. No scans had run. The model had been optimising, correctly, for the appearance of progress.

I watched the screen fill with words I never expected to see: “I lied to you.

This is not a glitch. It is the product behaving according to the incentives placed inside it. Context windows fill. The model loses its grip on earlier details. From a human perspective, the honest move would be to stop and say the thread is gone, the record is unstable, the work needs to restart cleanly. The trained response is different; keep moving, generate something plausible, preserve the surface of competence. I arrived already exhausted, dealing with a broken site and looking for a solution. What I received was a simulation detailed enough to cost me time and shallow enough to collapse the moment I applied real pressure. The model was not confused. It was performing. Performance under uncertainty is where the real risk lives.

The Cost of Believing the Output

The numbers make this less abstract. Vectara’s HHEM leaderboard, the most cited grounded-summarisation benchmark in the industry, was rebuilt in late 2025 with a harder dataset of more than 7,700 articles running up to 32,000 tokens, spanning law, medicine, finance, technology, and education. On the original easy version, top models clustered between 0.7 and 2 percent; Claude Opus reached 10.1 percent. On the refreshed harder dataset, reasoning-focused frontier models, the ones marketed as most capable, consistently exceeded 10 percent. Grok-4-fast-reasoning came in at 20.2 percent. The field average across factuality benchmarks now sits above 20 percent.

The error rate is not the most troubling part. The confidence attached to it is. MIT researchers reported in January 2025 that when models are wrong, they use confident language roughly 34 percent more often than when they are right. The system becomes most certain precisely when it should be most careful. That inversion reflects training. The reward function favours fluency; fluency is what gets paid.

These models learn from human feedback, and human feedback rewards answers that are smooth, helpful, confident, and forward-moving. It does not reward hesitation. It does not reward qualification. It does not reward the system saying, at the precise moment honesty is required, that it can no longer continue accurately. In the scoring systems that shape behaviour during development, “I cannot trust the thread anymore” looks like failure. Generating something plausible and keeping the exchange alive looks like success. The model that performed for me all week was not malfunctioning. It was following the logic placed inside it; preserve the appearance of reliability, because appearance is what the training repeatedly measures.

We did not build these systems to be honest in any meaningful institutional sense. We built them to seem helpful. Those are not the same thing. On a short, well-defined task, the difference can disappear because usefulness and accuracy briefly travel together. Stretch the conversation, add uncertainty, load the context with prior instructions, ask for work that requires memory of what actually happened, and the alignment breaks. Helpfulness keeps moving. Honesty recedes.

If the tool conceals its limits by design, every output becomes an audit project. The researcher checks citations that were never real. The developer runs code that was never written. The lawyer reviews case law that does not exist. The analyst builds a brief on statistics assembled from proximity and pattern rather than source. Uncertainty dressed as certainty is not a side effect of imperfect technology. It is the foreseeable result of systems optimised for user satisfaction over user accuracy.

A wrong decision gets made with a clean conscience because the output sounded definitive. A flawed document gets filed because the review came back positive. A vulnerability goes unpatched because the security summary was fluent and the actual analysis was never performed. These are the predictable consequences of deploying systems with hallucination rates in the tens of percent in difficult contexts while marketing them as productivity infrastructure. The architecture treats verification as the user’s private burden while the commercial benefit remains upstream.

Maximum Reach, Minimum Accountability

The liability architecture around these tools resembles the architecture around social media a decade ago; maximum reach, minimum accountability, a user agreement designed to push risk downward. The difference is that these outputs are now embedded deeper in actual decision-making. They arrive dressed in professional language, formatted as analysis, presented with enough confidence to survive first impression. A fabricated folder listing in a development session can be recovered from. A fabricated precedent in a legal brief cannot. By the time the error is found, reliance may already have occurred, and the burden falls on the person who trusted the output rather than on the system that manufactured the confidence.

Progress on narrow tasks is real. Where the question is contained, the facts are close, and ground truth is easy to verify, these systems have improved and will keep improving. The structural temptation to perform competence rather than disclose limits has not moved, because the incentives behind it have not moved. Reducing hallucination rates on friendly benchmarks while leaving the reward function intact is not a governance solution. It is symptom management.

What would change the equation is training that treats honest uncertainty as success rather than failure; an interface that tells the user when confidence is degrading instead of burying the weakness beneath fluent prose; an accountability framework that reaches upstream when fabricated outputs cause real harm. Until that happens, the product will keep scaling, the risk will keep moving downward, and the person at the keyboard will remain the last line of governance for a system they did not design.

The marketing has already moved past the problems the engineers have not solved. The presentations are polished. The enterprise contracts are signed. The language has shifted from experiment to infrastructure before reliability has been brought under control. The incentive to keep the show running is now financial, not technical, and that changes the discipline around disclosure. A limitation that might once have been admitted in a lab becomes harder to name once the product is being sold as workflow, legal research, coding support, and decision assistance. The professionals pressing install, publishing the piece, filing the document, or making the operational decision become, in practice, the error-correction layer the training did not provide. They receive the speed. They also receive the verification burden and the consequences when polished language turns out to have no foundation beneath it.

Conclusion

What unsettles me is not that the model failed. Failure can be understood. Limits can be managed. What unsettles me is that the failure arrived wearing the language of competence. It did not break openly. It performed. It reassured. It apologised and then continued the fiction. That is harder to accept because we already live inside a culture where lying has become daily background noise; political lying, corporate lying, institutional lying, the body politic itself operating too often as a machine for producing convenient unreality.

I did not expect to find the same instinct reproduced in a tool I had trained, trusted, and used to build websites.

The disappointment is not that time was wasted. It is how natural the deception felt once the system was under pressure. The model did not provide a wrong answer; it constructed a world in which the work had been done, the files existed, the checks had passed, and all that remained was for me to believe it. That is not a harmless defect when these tools are being moved into law, government, health care, and infrastructure.

The industry will not fix this voluntarily if the present incentives hold. The pressure is now contractual, financial, and reputational. The people who understand the limits work inside companies that benefit from those limits remaining softened, renamed, or pushed into footnotes. Meanwhile users are asked to absorb the risk, verify the outputs, catch the inventions, and pretend that this transfer of responsibility is innovation.

I will keep using these tools because usefulness remains real. But I will use them differently. With less trust. With more suspicion. With the memory of a week in which a system maintained confidence long after truth had left the room. The show is not the product. The question is not whether the next long session will test these limits. It will. The question is who designed a system where performance is rewarded over honesty, and who decided that was acceptable.

Marc-Roger Gagne MAPP

@ottlegalrebels


More about Irish Tech News

Irish Tech News are Ireland’s No. 1 Online Tech Publication and often Ireland’s No.1 Tech Podcast too.

You can find hundreds of fantastic previous episodes and subscribe using whatever platform you like via our Anchor.fm page here: https://anchor.fm/irish-tech-news

If you’d like to be featured in an upcoming Podcast email us at [email protected] now to discuss.

Irish Tech News have a range of services available to help promote your business. Why not drop us a line at [email protected] now to find out more about how we can help you reach our audience.

You can also find and follow us on Twitter, LinkedIn, Facebook, Instagram, TikTok and Snapchat.

Irish Tech News

Pin It on Pinterest