Corpus Cosmology

The generative theory of language (see the previous post for details) is mathematically sound and intellectually appealing. What’s more, it’s well suited for computer processing: for many generative grammars, it’s relatively easy to write a computer program that analyzes or produces texts that match that particular grammar.

At the time generative linguistics was introduced (1957), its computer applications were politically motivated, too: intelligence services hoped for an automatic translation facility that would quickly help them read Russian scientific papers, for example – as it turns out from the ALPAC report that brought about the temporary demise of machine translation and the ascent of machine-assisted human translation.

Besides, the theory was so sound and so appealing that generative linguists really believed: it simply had to be the way language works in the human mind. Alas, the appeal of a hypothesis or a theory doesn’t make it established science: that requires irrefutable experimental proof, through unambiguously described and repeatable observations. Now this is exactly what generative grammatical theory doesn’t have. Not only does actual human output resist to compute through generative grammars; the generative approach has also become heavily contested by other schools of study such as cognitive linguistics. Cognitive linguistics says that the ability of the mind to learn and build up language, though probably truly born with us, cannot be separated from the rest of cognitive functionality in the human brain.

Nevertheless, all theories require observable evidence, and when you say something about language – from general things like generative grammars, to more specific ones like the way Hungarian case endings are similar to, say, English prepositions –, you need observable language, that is, spoken or written text, to support your statement. To draw general conclusions about language, you will need text from a sufficiently large sample of the population to prove that people actually say the things you think they say, and they say them the way you think they say them.

This means you need to collect an immense amout of text from a formidable number of sources before you can even start your experiment. Wikipedia says that the first such experiment was done by American linguists Henry Kucera and W. Nelson Francis in the 1960s. They worked on a ‘Computational Analysis of Present-Day American English’ (such was the title of their study published in 1967), and they compiled a digital stockpile of written text called the Brown corpus, one million words in size, a huge amount of data for computing devices at the time.

In a way, creating computer programs to translate written text from a language into another, using generative grammars, is an experiment to test if the underlying theory is correct. In science, you can say there are no failed experiments: even when a theory is proved incorrect or inaccurate, the negative proof itself is solid knowledge that we can build upon. Not to mention the potential side effects that often turn out to be real treasure: Turing’s universal automaton is a by-product of his effort to solve Hilbert’s second problem, which he didn’t manage. Yet the by-product is one of the greatest achievements of contemporary mathematics – and modern engineering –, enough to make him famous and revered through the ages.


In the previous post, I wrote that generative linguistics had grown out of the structuralist movement, which later became criticized by behaviorist psychology. Now behaviorist psychology insists that one must study observable events only, and draw conclusions from them – not speculate about some sort of theory, and have your experiments twisted by your ambition to prove them true. In this spirit, corpus linguistics – the study of language through large bodies of text – is true behaviorist linguistics, while generative linguistics, no matter where its roots are, is not.

Behaviorism goes further, in fact: it claims that there are no unseen inner states in a system or an organism – all states are visible in external behavior in a way or another.

I wish it was this simple. But it’s not my place to judge one scientific approach against another – both have justification to exist. Take the two most complex – and, at the same time, most exciting – fields of physics: cosmology and quantum mechanics. Next to health care, these are probably the most publicly exposed fields of research that work along the strict rules of science. If you look at what’s happening there, you’ll see a whirling cycle of activities: after observing some natural phenomenon, researchers think up a theory that fits in mathematically with existing theories, and then they work out a reproducible experiment that they can use to prove the new theory. Every now and then, it happens that the new theory doesn’t entirely fit in with existing ones – in this case, first they check the observations, and if they are correct, they try and find where the previous theories need to be modified so that the complex theory of everything is consistent again.

Many times, physicists also look at potential consequences of their theories, and start looking for actual phenomena that the theories predict – if they find them, they get closer to proving the theory. One such hypothesis was that a strong field of gravity (around a large celestial body like the Sun) should bend the path of light, a consequence of Einstein’s theory of general relativity. The first test of this phenomenon happened in 1919, done by Arthur Eddington and his peers, and was repeated several times since (remember, a scientific experiment must be reproducible).

For setting up reproducible experiments, science has very powerful means called modeling and simulation. For example, you can’t go and see what happens to various particles inside a star. But you can simulate the environment in an earthly device such as a particle accelerator.

The point in every model is that you cannot be sure if it accurately matches the original thing. In astronomy, you can’t see the actual events that happen in a distant star – all you see is radiation. From the radiation itself, and your previous experience, you make an educated guess, and build your model accordingly. If your model is not accurate, it will turn out when you discover another similar phenomenon that you can’t possibly replicate with what you have.

In linguistics, corpora (where ‘corpora’ is the plural of ‘corpus’) play the role of the radiation from the distant celestial body. The distant celestial body is the human mind – we can’t see the actual cognitive mechanism behind it, but we can make pretty good guesses.


In a corpus, we see a lot of text. We can tear up the text into smaller units, words mostly, and count words – and other words that occur together with them. We can use such statistical methods to find regularities about language. For example, we can discover that some verbs often go together with certain prepositions, while the same verbs don’t like other prepositions so much. Thus we can find out about phrasal verbs in English, even with little or no previous grammatical knowledge. (The construct ‘find out about’ is in fact a splendid example.)

After finding some regularities, you set up a theory. Although the real process was more speculative than that, let’s say, for the sake of argument, that you invent the generative theory that way.

In reality, the generative theory was invented long years before the first proper – computerized – corpus was put together. Again, in reality, the theories linguists come to have a lot narrower scope; in most cases, they don’t even model the entire text – but they usually follow the rules of a phrase structure grammar, derived from generative theory. Unless they form a probability model – gaining a lot of popularity these days –, which can be completely oblivious of any grammatical theory.

To create the model from your theory, you write a computer program that uses the theory to analyze text, and run the program to process new text, that is, text that doesn’t exist in the corpus where you originally discovered the regularity you are testing.


Science never exists for its own sake, and our utilitarian world would never allow that anyway. Engineers – practitioners of natural language processing – often take a shortcut here. Even though the theory doesn’t fit all text, and the model doesn’t work correctly on all new text thrown at it, they go and create applications that actual non-scientific people (for example, translators) use in actual non-scientific work (for example, translation). When they see a theory and a model, engineers have a way to say that it will be sufficient for a purpose.

When you see a text-processing tool such as a spelling checker, or, to think of a more complex one, a machine translation service, you see such models in action. These models all try to mimic human language in a way or another. In science, we say they approximate human language. To see how good this approximation is, you need to process a lot of new text with the model, and employ human judges to count the places where the model got it right – and where it was wrong. The percentage of times your model was correct is called the precision of the model.

In the previous post, I found today’s language technology researchers somewhat cynical – or fatigued – about language models. What made me think this? Well – I have reason to think that linguists still don’t have a very good idea how language works in the mind. We suspect that language is more deeply intertwined with our thinking than we originally thought – but that only makes modeling all the more difficult. The point is, researchers no longer seem to care much. Language models behind everyday text-processing tools no longer claim to model the human mind.

Scholars of artificial intelligence turned to machine learning instead. Machine learning uses very sophisticated forms of statistics to build up decision trees, association rules and other data structures that computers can use to draw conclusions from new input. Machine learning algorithms are also used to train artificial neural networks, working models of small clusters of nerve cells, similar to those in the human brain.

In a way, this approach seems more honest, because the technology you use doesn’t claim to be something it is not. Unattended machine translation, the best-known – or most wished-for? – flavor of language technology is not artificial intelligence from the aspect of modeling human intelligence. As you would expect, today’s most-used machine translation programs are based on statistics: they guess the next two or three words in the translation, given the source-language text, and the translation produced so far; using a so-called language model that they compute from an immense amount of source-translation text pairs (that is, a parallel corpus). There are sophisticated twists and tricks in the procedure, but on the whole, this is all there is. In certain cases, this approach produces intelligible, sometimes even acceptable results – it all depends on the original parallel corpus the machine is trained from. And up to a certain point, the more text you use to train, the better the results get – but then the improvement stops, and where it stops is still far from the equivalent of human output.

They (for example, Ray Kurzweil of Google) say there is such a huge amount of text and other data collected over the decades, making machine-learning algorithms so efficient that eventually, they will evolve some kind of intelligence. But as far as I know, there was no claim that this intelligence would be modeled after the human mind.

In fact, Kurzweil says more: that our civilization has already spawned a hybrid human-machine intelligence. While this might be true on a level (digital gadgets can augment our abilities), it’s completely different from intelligence and volition together in an autonomous being (or in technical equipment) that can decide and act on its own, without human instruction.

In a way, I think this is deeply twisted: it’s as if we gave up on the wish to learn how our own mind works, and, in pursuit of short-term returns, trade this wish for creating something that is not human, that we don’t understand and can’t control. This seems a very Faustian deal to me, and I’ll have to go on about it another day, in another post.

And by the way, this also looks like we are abandoning the very sort of rationalism that scientists profess to cherish – in exchange for a different sort of rationalism claiming that the human mind is not much more complex than the algorithms we use today to approximate it.


The downside of corpus linguistics is this: it assumes that text corpora are something they are not; that they represent human communication. They don’t: they contain the encoded signal that passes between speaker and listener, writer and reader. This is the same debate as the one that goes on between generative and cognitive linguistics: how much can language be separated from the mind? Or you could ask, how much can the signal be treated independently from the speaker and the listener?

The most established model of language communication doesn’t accept this separation. From Saussure’s model of the speech circuit, we linguists all learned that, when emitting a signal (speaking or writing), the speaker assumes some previous information (knowledge) on the listener’s part. In turn, the listener will have previous information, which is almost always – ever so slightly – different from what the speaker assumes. The result of the communication – the understanding – is the sum of the listener’s previous information and the signal received. As you can imagine, the previous information is several magnitudes larger than the actual signal we can observe.

When you look at text from a corpus (such as the one accumulated in Google’s search servers), you don’t see two crucial pieces of information: the listener’s previous knowledge, and the speaker’s assumption of the same knowledge. We don’t know how such knowledge is represented in the mind, and we have absolutely no idea how to represent it in a computer. The so-called computational semantics models are approximations, just like the grammatical regularities we can detect.

It seems a far-fetched idea to look for the inner workings of the mind in the text it produces. But it’s not much more far-fetched than looking for the theory of life, the universe and everything in radiation emitted by stars and other objects millions of light-years away. I can understand why this dampens the spirits of some researchers, so much so that they start seeking easier paths: especially when they are hard pressed to produce return on investment. People do this in so many different scenes of life, too. But maybe, just maybe, we should first listen to the universe, to our own minds – before we rush forward to replicate something we didn’t care to understand well enough.

Don’t get me wrong: I’m all for profiting from whatever language technology we have – I was trained as an engineer originally, and I’ve been doing this my entire life. But I also say that short- or mid-term profit is not the ultimate purpose of language technology research, or, for that matter, any research. Luckily, we will always have dreamers who look further than this, and pursue the maximum of knowledge we can gather about the universe and ourselves.

2 thoughts on “Corpus Cosmology

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s