Welcome to the Quantiscience Podcast. Each episode we bring you stories about developments in science and mathematics. I'm Susan Vallett.
Certain grammatical rules never appear in any known language. By constructing artificial languages that have these rules, linguists can use neural networks to explore how people learn. That's next. Quantum Magazine is an editorially independent online publication supported by the Simons Foundation to enhance public understanding of science.
Learning a language can't be that hard. Every baby in the world manages to do it in a few years. Figuring out how the process works is another story. Linguists have devised elaborate theories to explain it, but recent advances in machine learning have added a new wrinkle.
When computer scientists began building the language models that power modern chatbots like ChatGPT, they set aside decades of research in linguistics, and their gamble seemed to pay off. But are their creations really learning?
Tal Linzen is a computational linguist at New York University. It's hard to know what to conclude from the behavior of those models. Even if they do something that looks like what a human does, they might be doing it for very different reasons. It's not just a matter of quibbling about definitions. If language models really are learning language, researchers may need new theories to explain how they do it.
But if the models are doing something more superficial, then perhaps machine learning has no insights to offer linguistics.
Noam Chomsky, a titan of the field of linguistics, has publicly argued for the latter view. In a scathing 2023 New York Times opinion piece, he and two co-authors laid out many arguments against language models, including one that at first sounds contradictory: language models are irrelevant to linguistics because they learn too well.
Specifically, the authors claimed that models can master impossible languages, ones governed by rules unlike those of any known human language, just as easily as possible ones. Recently, five computational linguists put Chomsky's claim to the test. They modified an English text database to generate a dozen impossible languages. Language models then had more difficulty learning these languages than ordinary English.
Their paper, titled Mission Impossible Language Models, was awarded a Best Paper Prize at the 2024 Association of Computational Linguistics Conference. Adele Goldberg is a linguist at Princeton University. It was absolutely needed. People were dismissing these large language models as useless.
First, they weren't supposed to be able to learn language. That was the going statement for decades. And now it's that, well, they don't learn them the way humans learn them, or they could learn anything. This is a counterargument to the idea that they could learn anything. So I think it's a great paper. I think it's absolutely timely and important. The results suggest that language models might be useful tools after all for researchers seeking to understand the babbles of babies.
During the first half of the 20th century, most linguists were concerned with cataloging the world's languages. Then, in the late 1950s, Chomsky spearheaded an alternative approach. He drew on ideas from theoretical computer science and mathematical logic in an ambitious attempt to uncover universal structure underlying all languages.
Chomsky argued that humans must have innate mental machinery devoted specifically to language processing. That would explain many big mysteries in linguistics, including the observation that some simple grammatical rules never appear in any known language.
Chomsky reasoned that if language learning worked the same way as other kinds of learning, it wouldn't favor some grammatical rules over others. But if language really is special, this is what you'd expect: any specialized language processing system would necessarily predispose humans toward certain languages, making others impossible.
Tim Hunter is a linguist at the University of California, Los Angeles. It doesn't really make sense to say that humans are hardwired to learn certain things without saying that they're also hardwired not to learn other things. Chomsky's approach quickly became the dominant strain of theoretical linguistics research. It remained so for a half century. Then came the machine learning revolution.
Language models are based on mathematical structures called neural networks, which process data according to the connections between their constituent neurons. The strength of each connection is quantified by a number called its weight. To build a language model, researchers first choose a specific type of neural network.
They then randomly assign weights to the connections. That makes the language model spew nonsense at first. Researchers then train the model to predict, one word at a time, how sentences will continue. They do this by feeding the model large troves of text. Each time the model sees a block of text, it spits out a prediction for the next word, then compares this output to the actual text and tweaks connections between neurons to improve its predictions.
After enough tiny tweaks, it learns to generate eerily fluent sentences. Language models and humans differ in obvious ways. For one, state-of-the-art models must be trained on trillions of words, far more than any human sees in a lifetime. Even so, language models might provide a novel test case for language learning, one that sidesteps ethical constraints on experiments with human babies.
Isabel Papa-Dimitrio is a computational linguist at Harvard University and a co-author of the new paper. She tells reporter Ben Brubaker that there's no animal model of language. Language models are the first thing that we can kind of experiment on in any kind of like interventional way about the structure and nature of human language.
The fact that language models work at all is proof that something resembling language learning can happen without any of the specialized machinery Chomsky proposed. Systems based on neural networks have been wildly successful at many tasks that are totally unrelated to language processing, and their training procedure ignores everything linguists have learned about the intricate structure of sentences.
Jeff Mitchell, a computational linguist at the University of Sussex, says it's a very linear way of looking at language. This model of learning where you're just trying to learn the next word. So you're just saying, I've seen these words, what comes next? In 2020, Mitchell and Jeffrey Bowers, a psychologist at the University of Bristol, set out to study how language models' unusual way of learning would affect their ability to master impossible languages.
Inventing a new language from scratch would introduce too many uncontrolled variables. If a model was better or worse at learning the artificial language, it would be hard to pinpoint why.
Instead, Mitchell and Bowers devised a control for their experiment by manipulating an English text dataset in different ways to create three unique artificial languages governed by bizarre rules. For instance, to construct one language, they split every English sentence in two at a random position and flipped the order of the words in the second part. Mitchell and Bowers started with four identical copies of an untrained language model.
They then trained each one on a different dataset: the three impossible languages and unmodified English. Finally, they gave each model a grammar test involving new sentences from the language it was trained on. The models trained on impossible languages were unfazed by the convoluted grammar. They were nearly as accurate as the one trained on English.
The impossible was possible for language models, it seemed. Chomsky and his co-authors cited these results in their 2023 article, arguing that language models were inherently incapable of distinguishing between possible languages and even the most cartoonishly impossible ones. So that was it. Case closed, right?
Julie Collini wasn't so sure. It was August of 2023, and she'd just started graduate school in computer science at Stanford University. Chomsky's critiques of language models came up often in informal discussions among her fellow students. But when Collini looked into the literature, she realized there'd been no empirical work on impossible languages since Mitchell and Bauer's paper three years earlier.
She found the paper fascinating, but thought Chomsky's sweeping claim required more evidence. It was supposed to apply to all language models, but Mitchell and Bowers had only tested an older type of neural network that's less popular today. To Collini, the mission was obvious: test Chomsky's claim with modern models.
Collini met with her advisor, Christopher Potts, and proposed a thorough study of impossible language acquisition in so-called transformer networks, which are at the heart of today's leading language models. Potts initially thought it sounded too ambitious. I remember being a little bit discouraging because it seemed like we would have to train a huge number of language models and also sort out a bunch of hard conceptual questions.
And so I kind of felt like for the first project in your PhD, maybe for the whole PhD, this could be the project. Julie was pretty relentless about it. That's actually Collini laughing in that recording. So Collini and Potts agreed that she would take charge of training the models. But first, they had to work out which specific transformer models to test and which languages to study.
For that, they roped in Papa Dimitrio and two other computational linguists, Richard Futrell at the University of California, Irvine, and Kyle Mahowald of the University of Texas, Austin. The team decided to use relatively small transformer networks modeled after GPT-2, a 2019 predecessor to the language model that powers ChatGPT.
Smaller networks need less training data, so they're a little more human-like. Perhaps they'd also resemble humans by favoring possible languages over impossible ones?
Collini soon learned that not everyone thought so. Her peers in Stanford's computer science department were hardly machine learning skeptics, but many still came down on Chomsky's side in the impossible language debate. Just talking to other computer science students at Stanford while this work was in progress, lots of people were betting that the transformer can just learn anything.
Anything. So the team constructed a dozen impossible languages, most of them based on different procedures for shuffling words within each sentence of an ordinary English dataset. In one extreme case, the shuffling was random, but in all the others, it followed a simple pattern. For example, dividing each sentence into groups of three adjacent words and swapping the second and third words in each group.
They also included the partial reverse language that Mitchell and Bowers had studied, as well as a full reverse language which they generated by reversing every sentence in the training data.
Their last language, dubbed "word hop," was the closest to ordinary English. It differed only in how to tell whether a verb was singular or plural. Instead of using a suffix, like the "s" in "runs," it used a special character placed four words after the verb.
The team was especially curious to see how models handled this language, since it was inspired by classic examples from the linguistics literature. Here's Hunter. In any sort of general computational terms, it doesn't seem like there's anything particularly complicated about saying, "Put this word four words downstream from this one," that kind of thing. That seems like exactly the kind of thing which you might expect to show up in a human language.
if completely general computational principles were the only guiding factors. And yet, we look at language after language after language and we don't see any of those. No human language seems to follow that kind of pattern. All of the impossible languages disrupted the linguistic structure of English to varying degrees. But apart from the random shuffle, they all communicated the same information, at least in a specific theoretical sense.
Here's Futrell. In theory, in principle, an omnipowerful predictor would have no more difficulty with the impossible language than the possible one. Collini and her colleagues started with multiple copies of a transformer network and trained each one on a different language. They'd periodically pause the training procedure to test each model's word prediction chops. The models all got better over time.
Even in the extreme case of random shuffling, the model could still learn that "the" is a more common word than "impossible." But the model trained on unaltered English text learned much faster and performed better at the end than all the others, with one exception.
The model trained on word hop, which replaces certain verb suffixes with special characters four words away, fared approximately as well. That wasn't surprising. After all, the subtle distinction between this language and ordinary English doesn't matter for most word predictions. But when they compared models trained on these two languages with a test designed to pinpoint the distinction, they saw a clear difference.
Once again, the impossible language was much harder for it to master. It was a classic plot twist. Language models weren't so omnipotent after all. The results show that language models, like humans, prefer to learn some linguistic patterns over others. Their preferences bear some resemblance to human preferences, but they're not necessarily identical. It's still possible that aspects of Chomsky's theories play a role in how humans learn.
Human brains and neural networks are each so complicated that understanding how they differ, especially when it comes to a task as subtle as language learning, can seem hopeless. The paper title, Impossible Language Models, is fitting in more than one way.
But like action heroes, researchers have a habit of accepting seemingly impossible missions and finding creative ways to make progress. Galini and her co-authors pinpointed a simple principle called information locality that explains why their models found some of the impossible languages harder than others. That principle might also be relevant to human language acquisition. Their results have already prompted several concrete proposals for follow-up studies.
Ryan Neft is a philosopher of cognitive science at the University of Cape Town in South Africa. That's what I really like about the Kellini et al. paper because what they do is they take this highly theoretical claim that comes from the theoretical linguistics literature with all of its baggage
And they kind of investigate its parts in as neutral a way as they can by just like teasing apart this continuum that's, I think, useful outside of the paper, even if you don't believe the results or you question the results or the methodology, I think that the impossible languages continuum is super useful.
So I think in that way it's just, it's so fruitful because it opens up so many different avenues and questions. One promising approach is to study how impossible language learning depends on the details of neural network design. The negative results from Mitchell and Bauer's earlier experiments already indicate that different kinds of networks can have very different behavior.
Language model researchers typically refine their models by tweaking the underlying networks and seeing which tweaks make the models better at learning ordinary languages. It may be more fruitful to instead search for tweaks that make models even worse at learning impossible ones. Here's Potts. And that was a fascinating project. It's kind of what we're doing for Mission Impossible 2. Like many sequels, that second mission will also feature a subplot inspired by a response to the team's results from Hunter.
He proposed comparing WordHop to a new artificial language that he suspects will give networks more trouble, even though it's more like real languages.
Hunter remains most sympathetic to the Chomskyian approach to linguistics, but he's glad that claims about language learning in neural networks are being tested directly. I would love to see more research trying to do exactly these kinds of experiments. I think the question it's tackling is spot on. Collini and her colleagues hope that their results also inspire other researchers to study impossible languages. It's a rich field with enough material for many more missions.
♪
Arlene Santana helped with this episode. I'm Susan Vallett. For more on this story, read Ben Brubaker's full article, Can AI Models Show Us How People Learn? Impossible Languages Point Away, on our website, quantamagazine.org. Make sure to tell your friends about the Quanta Science Podcast and give us a positive review or follow where you listen. It helps people find this podcast. From PR.