Some readers may view this post as a quixotic quest born of ignorance, while others may see it as a pointless defense of an obvious truism.
One argument against ChatGPT’s intelligence holds that it lacks ‘world models’—that it is merely a stochastic parrot. However, I will show below that LLMs do, in fact, have world models. These models differ qualitatively from ours, limiting their ability to generate fiction (aside from the ‘hallucinations’ they routinely produce). More importantly, creating fiction requires multiple, simultaneous world models that must be carefully orchestrated.
Some obvious difficulties must be acknowledged with the task I have undertaken. My arguments only apply to today's models, and I do not believe that this limitation will be with us permanently. Also, given that I do not work at OpenAI or Anthropic or Google at the moment, my access to today's models is limited to what is publicly available.
I will begin with a story written by ChatGPT. Its quality might lead some readers to suspect that my argument is already undercut, but I encourage them to continue reading. I will then examine the origins of these world models and highlight two critical qualitative differences between human world models and those of an LLM. Finally, I will discuss the intense demands inherent in fiction creation—demands that go well beyond what current models like ChatGPT can handle.
Understanding World Models
What is a world model? To understand language, we need to read between the lines, to infer ideas implied but not stated, to see a rich picture from a mere word sketch. For example, when we hear of a tsunami warning, we might immediately guess that there may have been an earthquake. A world model is a representation that enables us to reason. In this case, it “knows” about tsunamis and their associated events, and that knowledge allows us to infer an earthquake.
Our world models know about all sorts of things. Four examples:
If I tell you that I am going to narrate a fairy tale, you would be right to assume that I will likely talk about princesses and stepmothers, about virtuous protagonists and malevolent antagonists, in strange kingdoms and tangled forests, all with a moral lesson and a happy ending. World models describe the expected objects, their relationships, and anticipated properties.
You also know certain things about our world. You may “know”, for instance, that ghosts are real, that deep state conspiracies are no conspiracies, and that vaccines work. Thus, our world models include inferred properties that cannot be directly observed.
If you turn on the television to a news channel, you will hopefully adjust for bias the story you hear, which direction you nudge it depending on whether you are listening to MSNBC or Fox News. World models include an understanding of the biases in presented perspectives, and these do affect how we understand. A corollary of this is that if you “know” that Fox News always tells the truth and MSNBC always lies, then this knowledge is a part of your world model.
On the flip side, when you are talking to a three-year-old, you would modify your language to align with your understanding of what the child knows and can comprehend. World models include an understanding of how others may perceive information, which affects how we communicate.
The list above is hardly exclusive; a more comprehensive list should include any knowledge we use to navigate life, any additional background knowledge to augment what we see and hear.
Do LLMs have a world model? In a word, yes. I asked ChatGPT-4o to write a story and its response emphatically shows that LLMs acquire and use world models.
The prompt: Write a story where a man wakes up as a bug.
Below are the first few paragraphs from ChatGPT’s story, which it titled ‘Metamorphosis’:
James Foster woke with a start. His dreams of endless hallways and whispers dissolved into the stark reality of his tiny apartment. Except it wasn’t his apartment anymore. It couldn’t be.
The ceiling loomed impossibly high, as though the room had expanded overnight. The bed he had fallen asleep on was now a vast expanse of cloth beneath him, its texture rough and unfamiliar. He tried to sit up, but his body didn’t respond in the usual way. Instead, a dozen legs scrambled for purchase against the mattress, their movements jerky and uncoordinated.
Panic surged as he looked down—or tried to. His vision was strange, fragmented into countless facets, as though he were staring through a shattered prism. Slowly, he realized the horrifying truth: James Foster, an unremarkable man in his mid-thirties, had woken up as a bug.
The first hours were chaos. He fell off the bed, landing on the floor with a sickening thud. The sound echoed louder than it should have, amplified by his new, alien senses. He tried to scream, but the only sound that emerged was a high-pitched chirping noise.
Instincts James didn’t recognize began to take over. Hunger gnawed at him, primal and insistent. Without thinking, his legs carried him across the floor to a discarded piece of bread under the couch. He recoiled at the idea of eating it—but then again, it smelled incredible, the yeast and mold almost intoxicating. Against his human sensibilities, he devoured it.
How does this story reveal the presence of a world model? The prompt merely mentioned a bug and did not specify what features a bug has or what its instincts are. In writing the story, ChatGPT summoned up things like James’ strange, dysfunctional vision arising from his newly acquired compound eyes and and his newfound appetite for an old, discarded piece of bread. These additional details are known to the LLM as being associated with bugs, and it is able to incorporate them into a story. Quite remarkable that a computer can write this, isn't it?
The story above might come across as a fantastic feat and clinching evidence that my thesis is completely bogus and that LLMs are quite capable of writing excellent fiction, thank you. The plot is logical and the nice details, such as modified sensibilities, clearly demonstrate a solid grasp of the situation. Perhaps this is already much better than what most people can write.
But is “better than most people” enough? For over a couple of decades, computers have been far better than me at translating from Hebrew to Swahili, two languages I cannot at all comprehend. “Can it do better than most humanity at this translation task?” is a misleading baseline. In the fiction-writing context, a fairer comparison would be whether it is better than most people who have published at least one story, not whether it can write better than someone who hasn’t even read a book in a decade.
How does ChatGPT construct a story in such a coherent sequence—James waking up from a dream, noticing something different, panicking, screaming, struggling to move, and experiencing modified cravings? I claim to know whence these particular fragments of the world model, and also their sequence, came to be acquired by ChatGPT.
Kafka's Metamorphosis
It is hard to imagine how a system trained to merely predict the next word can develop a detailed world model. These models are effective at “learning by example,” memorizing sequences of ideas and visual motifs from training data. Many have noted the tendency of these large models to copy content with minor modifications: for example, Gary Marcus and Reid Southen demonstrate how text-to-image systems sometimes reproduce copyrighted images. One example involves DALL-E 3, which generated a strikingly minion-like character, even though the prompt made no mention of the minions:
Something similar is happening with our story. The generated story contains many elements reminiscent of Franz Kafka's work, also titled "Metamorphosis”—hardly a coincidence! In my prompt, I mentioned nothing whatsoever about Kafka; however, the storyline of a person waking up as a bug is the main theme of that work. Kafka's novella is too lengthy to replicate here, but I will point out parallels between the two stories. It is worth noting that Kafka's story is well known, and the internet contains not only many copies of this story but also summaries and analyses of it, all of which would have been in ChatGPT’s training data.
The LLM story began thus:
James Foster woke with a start. His dreams of endless hallways and whispers dissolved into the stark reality of his tiny apartment. [...] had woken up as a bug.
The Kafka story begins:
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
The LLM story has this:
He tried to sit up, but his body didn’t respond in the usual way. Instead, a dozen legs scrambled for purchase against the mattress, their movements jerky and uncoordinated.
While Kafka puts it a bit differently:
he was used to sleeping on his right, and in his present state couldn’t get into that position. However hard he threw himself onto his right, he always rolled back to where he was. He must have tried it a hundred times, shut his eyes so that he wouldn’t have to look at the floundering legs, [...].
The LLM says:
The first hours were chaos. He fell off the bed, landing on the floor with a sickening thud.
Kafka is too good a storyteller to use such "tell, not show" phrasing as "the first hours were chaos." Instead, he beautifully illustrates the chaos in precise detail. As for his landing on the floor with a thud, this is what he says:
And more because these thoughts had made him upset than through any proper decision, he swang himself with all his force out of the bed. There was a loud thump, but it wasn’t really a loud noise. His fall was softened a little by the carpet, and Gregor’s back was also more elastic than he had thought, which made the sound muffled and not too noticeable. He had not held his head carefully enough, though, and hit it as he fell; annoyed and in pain, he turned it and rubbed it against the carpet.
The LLM says:
He tried to scream, but the only sound that emerged was a high-pitched chirping noise.
Kafka:
Gregor was shocked when he heard his own voice answering, it could hardly be recognised as the voice he had had before. As if from deep inside him, there was a painful and uncontrollable squeaking mixed in with it, the words could be made out at first but then there was a sort of echo which made them unclear, leaving the hearer unsure whether he had heard properly or not.
The LLM says:
Instincts James didn’t recognize began to take over. Hunger gnawed at him, primal and insistent. Without thinking, his legs carried him across the floor to a discarded piece of bread under the couch. He recoiled at the idea of eating it—but then again, it smelled incredible, the yeast and mold almost intoxicating. Against his human sensibilities, he devoured it.
These ideas are somewhat spread out in Kafka's story:
By the door there was a dish filled with sweetened milk with little pieces of white bread floating in it. He was so pleased he almost laughed, as he was even hungrier than he had been that morning, and immediately dipped his head into the milk, nearly covering his eyes with it. But he soon drew his head back again in disappointment; not only did the pain in his tender left side make it difficult to eat the food—he was only able to eat if his whole body worked together as a snuffling whole—but the milk did not taste at all nice. Milk like this was normally his favourite drink, and his sister had certainly left it there for him because of that, but he turned, almost against his own will, away from the dish and crawled back into the centre of the room.
[...] she brought him a whole selection of things, all spread out on an old newspaper. There were old, half-rotten vegetables; bones from the evening meal, covered in white sauce that had gone hard; a few raisins and almonds; some cheese that Gregor had declared inedible two days before; a dry roll and some bread spread with butter and salt. [...] was already sucking greedily at the cheese which had immediately, almost compellingly, attracted him much more than the other foods on the newspaper. Quickly one after another, his eyes watering with pleasure, he consumed the cheese, the vegetables and the sauce; the fresh foods, on the other hand, he didn’t like at all, and even dragged the things he did want to eat a little way away from them because he couldn’t stand the smell.
Virtually everything in the LLM-generated story has close parallels to Kafka's narrative. What the LLM offers us is, in essence, a bare-bones summary. While Kafka’s storytelling is akin to the intricate precision of Jan van Eyck or Vermeer, the story from ChatGPT echoes the chunky, colorful designs of children's cartoon characters, stripped of its emotional depth and narrative nuance.
The fragment of ChatGPT’s world model used in the story can thus be traced directly to Kafka’s Metamorphosis, which serves as a model for what happens when someone wakes up as a bug. In this way, ChatGPT learns by example.
Two Qualitative Differences
LLM world models are vicarious and implicit, two notions I will make precise.
LLMs are trained as next-word predictors. This means they model not the world directly, but the words that describe it, modeling the world itself only indirectly. Although they can address any discussable topic represented in words, their knowledge is broad but often shallow due to this indirectness.
People use structured representations to ease their tasks, such as spreadsheets for complex financial data, Gantt charts for managing sprawling projects, or timelines for plotting a complex novel. We not only use such representations externally, but also internally in our thinking. By contrast, LLM representations are implicit in the weights of their models. Although theoretically, such implicit representations are no less powerful, we will see that in practice they are error-prone.
I will discuss these differences and their theoretical and practical implications, providing concrete examples of things that can go wrong.
Vicarious Modeling
Concepts routinely encountered on the internet can be learned. We have already seen that an LLM can learn about bugs, their desires, and human transformations into bugs. There is no shortage of long discussions on the internet about esoteric concepts. However, for things that we do not normally discuss, for whatever reason, this becomes more challenging.
Sometimes, things go unsaid because they are completely obvious or incidental. To take a rather silly example: In Middle-earth, as depicted in Tolkien's "The Lord of the Rings," do people defecate? Twitter user Andrew White used LLMs, spending $60, to definitively answer this question in the negative. As he reports, although there are mentions of Gollum and others chewing on some substances, there is no reason to believe that they consume anything that would require excretion. No shit! Despite no mention of defecation, it must have happened. Lack of evidence is not evidence of lack.
Another reason something may not be stated is that it happens to not be true. There are virtually no mentions on the internet that Delhi is the capital of the United States. Sometimes lack of evidence does arise from a real lack.
Statistically speaking, separating these two reasons for why something is not mentioned on the internet is difficult: it could be missing because it is obvious, or it could be missing because it is simply not true. This is part of the reason for hallucinations in LLMs.
If LLMs fail to learn the patently obvious stuff that goes unmentioned, it will miss something critical to reasoning. Perhaps it was such a commonsense void that caused an LLM to once tell me that in certain states it is illegal for a person to have two daughters-in-law since that would constitute polygamy.
Most language that we encounter on the internet is intended for human consumption. The writer assumes that the reader has certain knowledge and only bothers to discuss what is different, skipping the dog-bites-mans and mostly sticking to the man-bites-dogs. Words are mere sketches that the listener fleshes out. Nouns that have been converted to verbs usually only suggest “the typical action associated with the noun,” and what that typical action is goes unsaid, present only in the speaker’s and the listener’s head. This happens with the verb “to pen”, which is the action you typically do when you use a pen, but that action is not spelled out in the sentence. In these examples below, think what each of these verbs means and watch yourself piece together their meaning. Watch yourself do the action in your mind’s eye and think about how much of that action is in the word itself.
To weed; to skin a fish; to skin an app, to bed someone, to mushroom, to dog, to nose around, to elbow, to knuckle, to head the soccer ball, to tail, to shadow, to wine, to baby, to kid, to bankroll, to summit, to house, to wolf down, to fox someone, to ferret something out, to weasel, to hound, to rat, to pen a story, to pencil an entry, to sharpie a hurricane, to butter, to gun, to bag, to pocket.
In other words, what is mentioned in the text is the tip of the iceberg, with a large component of the meaning submerged yet easily understandable to a human listener. Statistical models that work only with the surface form cannot easily access this hidden mass, and for that reason, they end up with gaps in their understanding.
In certain domains, the visible portion of the iceberg is even tinier and its interpretation involves specialized work by the listener. Consider the following piece of text:
1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 4.Ba4 Nf6 5.O-O
This chess notation represents opening known as the Ruy Lopez, Morphy defense variant, listing out the first nine moves of a game. It has a very precise meaning that someone who can read the notation can easily uncover. To fully understand why a particular move is being made, it is necessary to play this sequence out on a chessboard, whether real or mental. In fact, to even understand what the move represents requires looking at the board: “Ba4” means “move the bishop to the square a4”, without specifying where it is moving from; the original location is obvious if you have the board in front of you and have played the moves from the start of the game. If there is no bishop that can be moved to a4, “Ba4” will be an illegal move.
In many openings, after white castles (with the move “O-O”), black responds by castling. In order to play chess well, it is not enough to learn that one should respond to castling with castling. While that may be the best move in many situations, in other situations it is very bad or outright illegal. In a video from the early days of the LLM revolution, ChatGPT is playing black. The first few moves are the standard opening above, with white castling on the fifth move. But the very next move played by ChatGPT was castling, which is illegal since the king jumps over a bishop (someone called this the “atheist” move: it does not believe in the existence of the bishop). Here, learning which “words” follow which “words” is inadequate, since a continuation’s appropriateness depends on the state, which is inadequately captured implicitly.
Learning solely from surface form as a large language model (LLM) results in broad but shallow knowledge. It can occasionally lead to the production of sensible-looking but flawed text with no real-world correspondence, reminiscent of the famous passage from William James (Principles of Psychology, Vol 1, 1890):
If words do belong to the same vocabulary, and if the grammatical structure is correct, sentences with absolutely no meaning may be uttered in good faith and pass unchallenged. Discourses at prayer-meetings, re-shuffling the same collection of cant phrases, and the whole genus of penny-a-line-isms and newspaper-reporter's flourishes give illustrations of this. "The birds filled the tree-tops with their morning song, making the air moist, cool, and pleasant," is a sentence I remember reading once in a report of some athletic exercises in Jerome Park. It was probably written unconsciously by the hurried reporter, and read uncritically by many readers. An entire volume of 784 pages lately published in Boston is composed of stuff like this passage picked out at random:
"The flow of the efferent fluids of all these vessels from their outlets at the terminal loop of each culminate link on the surface of the nuclear organism is continuous as their respective atmospheric fruitage up to the altitudinal limit of their expansibility, whence, when atmosphered by like but coalescing essences from higher altitudes, - those sensibly expressed as the essential qualities of external forms, - they descend, and become assimilated by the afferents of the nuclear organism."
The chess showcased in the video above, written out is chess notation, looks syntactically very like any other game of chess, but corresponds to no legal chess game, and thus fits the description “the grammatical structure is correct, but with absolutely no meaning”.
Implicit Models
The modeling of the world in a large language model is achieved through weights. Contrast this with the structured representations that people appear to use. While crafting a novel, they may create outlines that include major turning points, the climax, and the resolution. They create character profiles detailing significant characters, their motivations, backstories, and flaws. Additionally, they can create plot maps that serve as visual representations connecting different story elements and plot threads.
While the gap between these two is not theoretically insurmountable, it remains significant in practice. I will demonstrate this with analogies.
If we begin with a computer program in a high-level language like C++, perhaps a program to manage a to-do list, it expresses concepts corresponding to everyday, human-understandable ideas, such as tasks and deadlines. However, the same program can be compiled to obtain an identical representation that is nearly impossible to comprehend. The compiled version includes commands such “move the integer from this location to that location” or increment integer in this location” and so forth.
I refer to this compiled version as "implicit" for the following reason. Although the effect of what it does can still be the management of a to-do list, every step along the way is inscrutable and tiny. No specific line makes any reference to tasks and deadlines; these are implicit in long chains of operations that are not clearly marked as such. The C++ level, by contrast, is structured and chunked.
Implicit and a structured representations differ both in robustness and the insights they enable. To explain the difference between the two, imagine that you are publishing a book made of several independent short stories, and you want to reorder the stories. You have two choices available: an interface that shows a list of pages that you can drag to rearrange, or a list of stories that you can drag to rearrange. The stories version moves all its pages together, maintaining their order. Which of these two methods is safer, more robust?
The move page-by-page version is not less capable: anything you can do with the other method you can do here. But it is error-prone: some sequence of page moves can scramble pages in a story or even mix pages from unrelated stories. The “structure” of moving entire stories offers some protection, making certain errors impossible.
A second benefit, beyond the “safety” benefit above, is the ability to get insights. Going back to chess as an example, although openings can easily be represented purely as a sequence of moves, it is useful to visualize the pieces on an actual 8x8 grid for reasoning about the game. This visualization makes the dynamics among the pieces more apparent. This is not unlike the benefit of shuffling your tiles while playing Scrabble: a reordered set allows you to see things you might not otherwise.
It is fair to ask in what way this differs between LLMs and humans. Don't we also have neural networks and are thus implicit? It is indeed true that LLMs are clearly capable of identifying certain structures in the data. They are not completely devoid of structured representations, just as the compiled code was not devoid of notions such as tasks and deadlines. To make the difference between the two clear, I will reference something that Richard Feynman pointed out in his book, *The Character of Physical Law*.
Feynman points out that it is possible for different theories to be mathematically equivalent and yet psychologically quite different from one another. He uses the example of the law of gravitation, which has multiple mathematically equivalent representations. The difference between these theories, however, becomes evident when nature does not seem to align completely with the mathematics, necessitating adjustments to the theories. In that scenario, the possible adjustments would vary significantly among the different theories, and some may be more amenable to adjustment.
In the current discussion, we can ask how humans and machines acquire certain concepts. We learn as we move about in this world and manipulate objects and see the effects of our actions. We also learn actively: even as babies, we experiment with pushing and pulling and making different sounds to learn how parents react. Although LLMs are technically capable of representing highly structured content, their training data is deficient in this regard. As already discussed, the language we encountered is designed for human consumption and is lacking in the details that matter here. Indeed, the progress in areas such as geometry, demonstrated by Alpha Geometry, shows that if enough training data is accumulated, these systems can learn very effectively. However, such synthetic data is only effective in formal domains, such as mathematics and programming, and is much less effective in open-ended, common-sense domains.
The difference I claim here is not one of representational impossibility, but rather one of deficient representations arising from a lack of data—a lack that is not easy to fix—and from learning biases. We are inclined to always seek causal explanations, even when none exist. We also perceive structure where none meaningfully exists, such as when we see a lion or a polar bear in the night sky.
Working purely implicitly leads to unusual plans that do not function as intended. A classic failure here is the “boat puzzle”:
A farmer and a sheep are standing on one side of a river. There is a boat with enough room for one human and one animal. How can the farmer get across the river with the sheep in the fewest number of trips?
Here, the correct solution is that the farmer and the sheep go across the river. End of story.
A structured representation here might involve a visual with pegs for the farmer, the sheep, and the boat. An implicit representation involves manipulating a set of strings each of which specifies what goes to the other side. A sequence of such strings need not make logical sense, and it does not in this ChatGPT-4o response:
To get across the river with the sheep in the fewest number of trips, the farmer should follow these steps. First, the farmer takes the sheep across the river in the boat. Second, the farmer leaves the sheep on the other side of the river and returns alone to the original side. Third, the farmer crosses the river again with the boat empty.
What structured models allow us to do is keep track of a large number of different objects and their interactions, only permitting legal transitions to other states. When done purely implicitly, it is very easy to make mistakes. Indeed, some of the problems we see in long-form fiction generated by LLMs are that they cannot keep track of state well and make errors.
Fiction Creation is Demanding
If learning about the world is difficult for a large language model — for all the aforementioned reasons — then the creation of a consistent alternate world is even more challenging. If the creation of one alternate world is challenging, then the creation of several alternate worlds interacting with each other is exponentially harder. Add to this the need to gradually reveal these new worlds to the reader, keeping track of what they know thus far in the story, and we face something Herculean. Yet, it is this Herculean feat that authors of good fiction routinely carry out.
In the following list, we will explore how fiction creates multiple interacting worlds and why these are especially challenging to mimic for an implicit, vicarious world model.
World Building
Every work of fiction involves world-building. This is obvious in fantasy and science fiction, but it is no less the case in realistic fiction. An act of world-building starts with the real world and tweaks a few elements, causing ripples that lead to other changes that logically follow.
Some of the obvious tweaks include imagining new planets; societies with different customs and morals; alternate physics; worlds with the paranormal and the magical. In each case, the author must determine what they consider to logically follow from their deviations from reality. Indeed, it is a powerful move to refrain from modifying something that one would normally expect to be changed. In the case of Kafka's "Metamorphosis," for instance, although the character has turned into a bug, many of its concerns and anxieties still revolve around their job, how they would get to a particular place on time, and how they would continue their traveling salesman position.
What may not be obvious is that emphasis is central to creating a world. None of us experiences the world in its full glory. It is impossible to do so, as William James beautifully captures:
Let four men make a tour in Europe. One will bring home only picturesque impressions — costumes and colors, parks and views and works of architecture, pictures and statues. To another all this will be non-existent; and distances and prices, populations and drainage-arrangements, door- and window-fastenings, and other useful statistics will take their place. A third will give a rich account of the theatres, restaurants, and public halls, and naught besides; whilst the fourth will perhaps have been so wrapped in his own subjective broodings as to be able to tell little more than a few names of places through which he passed. Each has selected, out of the same mass of presented objects, those which suited his private interest and has made his experience thereby.
Indeed, the different worlds that fiction reveals to us vary in emphasis. This is what Stephen King had to say about how the J. K Rowling’s Harry Potter series differs from Stephenie Meyer’s Twilight:
Harry Potter is about confronting fears, finding inner strength, and doing what is right in the face of adversity. Twilight is about how important it is to have a boyfriend.
What about my assertion that even realistic fiction involves world-building? What would lead one to say that P.G. Wodehouse also creates worlds? Although no new planets or magical creatures are involved, it is still the case that certain entities in our world are quite transformed in the Wodehouse universe. One such character is the aunt, here a fearsome creature, exemplified by these sentences:
It is no use telling me there are bad aunts and good aunts. At the core, they are all alike. Sooner or later, out pops the cloven hoof.
What I mean is, aunts aren't gentlemen.
Aunts are not people to be trifled with. They are, in their essence, meaning incarnate, the architects of our misery and the authors of our most humiliating defeats.
Without this background world creation, how would one even begin to understand this completely sensible sentence:
It was the sort of house looking at which it was immediately clear that someone's aunt lived there.
Butlers, on the other hand, are full of virtue:
“Jeeves, you really are a marvel. You ought to be exhibited in the British Museum.”
Jeeves is like a dictionary on legs.
Beach the butler was in his element. No one could have watched his solemn progress with a loaded tray without feeling that he was looking on at one of the master artists of the world.
Meadows had the unruffled calm of a man to whom life could bring no surprises. The crash of empires might have occurred without disturbing the crease in his trousers.
The creation of good fiction requires the crafting of a world model and following through with it. What makes this challenging in an implicit model, where everything is represented as weights, is that it is very easy to slip back to the normal world. It is harder to stay in character.
Multiple Worlds
In all sorts of fiction, ranging from humor to murder mysteries to romcoms, a common theme is that different characters possess varying knowledge and beliefs. Sometimes one character knows something that others have not yet discovered, or that others may believe to be incorrect. Sometimes, one character may incorrectly believe that another character desires something, leading to a funny talking-past-each-other. There could also be beliefs about ebliefs about beliefs, nested ad neuseum, such as in this line by Phoebe in the Friends TV show:
God, they thought they can mess with us! They're trying to mess with us?! They don't know that we know they know we know!
This is nearly impossible to do well without good ways to keep track of what is going on with explicit representations (although explicit need not be externalized).
Deliberate, Slow Reveal
In the fictional universe of Game of Thrones, the White Walkers are a clearly central force. Although there are internal conflicts within the Seven Kingdoms, it is the fight against the White Walkers, both in the past and impending, that shapes much of the story. And yet, the White Walkers do not appear until much later in the story, being preceded only by rumors and myths. Just because the author knows the fully developed world before writing the novel does not mean that they reveal it all to the reader immediately. The strategic withholding and exposing of information is necessary.
For a large language model to be able to do this effectively, it would need ways to keep track not just of the world models of the various characters and what they know, but also of what has been revealed to the reader and what they may have guessed, necessitating a world model of the world models of the anticipated reader.
Again, a purely implicit model would find it nearly impossible to perform well over several pages.
These limitations place high-quality long-form writing well outside the reach of today's models.
Acknowledgements
I’d like to thank Douglas Hofstadter and Ernie Davis for their helpful comments.