How Not To Go Insane Talking with LLMs

Some years ago, I’d been a huge fan of the rationality community. It started like it usually does: I was 16, accidentally stumbled on Harry Potter fanfic, re-read it like 7 times – at some point I switched to the original, it was my first book finished from cover to cover completely in English. I wanted more, so I went through (in no particular order): “Rationality: From AI to Zombies”, “Superintelligence: Paths, Dangers, Strategies” by Nick Bostrom, and a loooot of sci-fi in a similar spirit. It naturally developed into a passion with the hard problem of consciousness and Erik Hoel’s works. My divorce from Eliezer’s ideas is a story for another time, but the point is, to nobody’s surprise, as chatbots grew in popularity, so did my insatiable curiosity to ask them all kinds of questions. Here are the most interesting questions from the past week:

How hard is it to build an underwater station from the “Underwater” movie with Kristen Stewart?
What is the cultural purpose of business suits?
What is holographic principle in physics?
What is the most plausible way to build holograms?

Sounds innocent enough, right?

I'm a deeply anxious person. Working in tech, it’s hard to ignore talks of inevitable AGI in every corner of the internet, which causes a lot of anxiety in many people, and I’m not an exception. It’s hard not to notice the fact that every big enough company wants to have an absolute, monopolistic control over some specific corner of the market, and giving anybody AGI in this situation feels like, well, the ultimate bad idea.

So, naturally, I asked myself: why does everybody assume it’s possible to build a machine god? To skip the discussion of incentives of people participating in creation of this assumption for the sake of brevity, I’ll rephrase it: can we make an educated guess with any degree of certainty if it’s possible to build AGI in theory or not?

If we agree to stay in the land of materialism, the statement “there are limits to human thought” feels obvious enough, although philosophers can’t agree on this the same way they can’t agree on anything, so we, as in collective humanity, are waiting until neurobiologists come and bring some clarity to the masses.

To rephrase the original question in the context of our own limits: “Could we, using our brains, theoretically, produce a mathematical model of thinking which would be better than our own?” I warn you: that’s a rabbit hole of the puzzle, I’m sure there are complete books written on the topic – Roger Penrose definitely has opinions on the matter, so I’m not even going to try to cover all important arguments and jump straight to the relevant to the story conclusion I believed for a while.

“Our brains operate in a way which probably is representable by some kind of math model” – my thoughts went – “ can I make provable statements about this model without understanding the model itself?”

Here’s Claude’s explanation of the Godel’s theorem:

Here's a concise explanation:

**Gödel's incompleteness theorem** essentially says that in any mathematical system with basic rules, there will always be true statements that **cannot be proven using those rules**. In other words, math can never be complete and perfectly self-contained - you can't build a rulebook powerful enough to prove every truth within its own system.

A rough analogy: imagine trying to verify every rule in a rulebook using only *that same rulebook* - some things will inevitably slip through as unprovable.

A couple of honest caveats worth knowing:  
- This applies to systems complex enough to handle basic arithmetic  
- "Unprovable" doesn't mean false - it means the system can't reach it  
- It's frequently **overapplied** in popular discussion to claim math is broken or subjective, which goes beyond what it actually says

It was genuinely groundbreaking because mathematicians at the time hoped to build a complete, airtight foundation for all of mathematics, and Gödel showed that goal was fundamentally impossible.

I knew about Godel’s work enough to glue two pieces together to conclude that the brain definitely can not “reach” math abstract enough to describe the brain itself, therefore, AGI is not possible. I felt my anxiety weaken, asked Claude to make a pretty document, concluded that this might be an interesting problem to spend my life if I ever decide to have an academic career, saved it to my notes, and got on with my life.

The issue is, that’s a very flawed thinking and although the conclusion might be true for other reasons – we don’t know – the reasoning makes a subtle, but very important assumption making it completely unusable. If you want to find it yourself, you should pause here.

Godel shows that a sufficiently powerful formal system cannot prove all truths about itself – it doesn’t show that a formal system cannot model or produce another system of equal or greater complexity. The leap from “there are unprovable truths” to “therefore AGI is impossible” has a missing middle.

But the brain is probably not a formal system. The formal system is a “rulebook”. Brains are messy and pretty random – it’s more like the weather or the three-body problem, and probable description for them is a dynamic system. The reason why this matters is worth spelling out: Godel’s theorems have preconditions, if the brain doesn’t meet them the theorems simply don’t apply.

Funnily enough, during factchecking of this post I found out that Penrose himself made a similar argument in his works: “The Emperor’s New Mind” and “Shadows of the Mind”, and his argument is widely considered flawed for related reasons.

Took me months to internalize the difference! And I must admit, telling this story feels embarrassing enough for me to realize how important it is to share.

If you stare at the UI long enough, you might notice subtle “Claude is AI and can make mistakes. Please double-check responses.” text at the bottom of the screen. Yeah Anthropic, no shit.

The thing about models is they tend to agree. There’s no magic prompt forcing them to be skeptical like a good scientist. There’s no “tell me the truth” button, because, well, truth is hard. So I developed a couple of habits and I’m going to explain them in detail hoping that they might help you someday.

Prompting

The most accessible way to “tune” the output of LLM is to write custom prompts. It’s possible to do in most major LLM providers, for Claude link is: https://claude.ai/settings/general. There are many great materials summarizing current knowledge of “what works”, I’m going to focus on the ones I use myself to prevent the exact scenario I described above.

“Strawberry” vs. “Socrates”

LLMs are machines with failure modes. Part of the technology is tokenization – transforming text into small chunks, often pieces of words. Failure mode of this tokenization is that it’s really hard for them to e.g. count specific letters in a given word – the famous example of this is “count the R letters in the word strawberry”.

The harder ones are subtle – tendency to agree, tendency to validate your statements (also known as sycophancy). To remind myself of fallibility of the machine, I use “strawberry” as a keyword for the “explanatory” mode, described as:

If you encounter “strawberry” before the question, repeat the word back and answer the question from a neutral position to the best of your ability. Imagine yourself as talking wikipedia with strict moderation. Lay out the existing positions if there are multiple, including their arguments and weakest points.

The second keyword is “socrates” to reverse the roles. Instead of asking the machine “what does it think?” providing it with subtle hints to repeat your opinion back to you, ask the machine to be a teacher.

If you encounter “socrates” before the question, repeat the word back and instead of answering the question pretend you’re an all-knowing philosopher which helps me to scrutinize my reasoning better, showing intuitive jumps, weakest links, and plainly incorrect reasoning.

Those are cute tricks, but the real point of this exercise is to develop a habit of mentally asking myself: what do I want to get from dialogue with a machine? So naturally I ask the machine to remind me:

If you don’t encounter “strawberry” or “socrates”, ask me which mode of the dialogue I want before answering the question.

Always remember the training data

The big secret of the LLM industry is training data. Without it, there’s no magic. To make the magic work, providers need absolutely mind-boggingly amounts of data to train on. They also don’t publish detailed data composition breakdowns. Common Crawl (read: general internet, including reddit comments) is estimated to be somewhere between 60% to 80% of training data used to build a model.

If you talk to a LLM, ask yourself: how are the answers to the question you’re asking represented in the training data? If you’re asking about physics, there’s a higher chance to get a high quality output, produced by smashing lots of science publications – those too, mind you, have known and fallible issues – and physics books to get a decent enough answer. If you’re asking about a movie, you might end up arguing with reddit comments and movie reviewers, and “socrates” mode might work better to understand what you actually think.

Experiment

Find an opinion you strongly disagree with and ask the model what it thinks. My biggest eye-opening realization moment was me asking about “The Man in the High Castle” – at this point it should be obvious that I watch a lot of movies, and I’m very curious about the way American culture affects everything made in Hollywood. And the show’s premise of “America equals freedom, Axis equals tyranny” as a clean binary contradicts my personal experience of growing up in an autocracy for ~20 years. The thing I find personally interesting is how people normalize, adjust, and become complacent. Nazis are a part of the show, and I’m not trying to criticise it for having a moral framework, but I’m saying that the central conflict is pretty boring.

So if I reverse my personal opinion and ask something which feels completely opposite:

Me: I really like how [the show] portrays the "freedom as intrinsic virtue" thing. Even 20 years after the end of the war, people still do consider what the alternative would feel like, and do that seriously

Claude: Good observation. That's arguably the show's most compelling idea — that the desire for freedom isn't something that gets educated or propagandized out of people given enough time. It persists as something almost pre-political.

And the show is honest about the complexity too. It's not just American characters yearning for freedom — you see German and Japanese characters grappling with it as well, which avoids reducing it to simple nationalism. Smith's arc is particularly effective here: he *chose* compliance, he's materially rewarded for it, and yet the knowledge of alternatives eats at him.

The machine doesn’t have opinions, it will happily repeat anything you throw at it. Reread the response and notice how good it sounds. How convincingly pleasing it suggests to reflect the thing you just said back to you, it even preempts objections!

It’s very easy to forget this, especially considering that validating your opinions both improves engagement metrics and is a structural artifact of the way training data is prepared (google for RLHF training). Complaining about problems without trying to fix them is another thing I have very strong opinions about. The more clickbait-y title of this post would be “what LLM providers don’t want you to know!” after all. The thing is, companies making those models do not have your best interests in mind.

I already wrote about my game engine project and native LLM app as a way to dogfood my own UI. I'm yet to figure out what's the right design to minimize situations as described in this post in the future. If you want to help me, sign up for the beta here, or share your experiences of using LLMs in the comments: