Corpus is a word with many meanings. It comes from the Latin word meaning 'body' as well 'collection of facts or things'. Today it has the same meaning with the Oxford English dictionary defining it as three things. The first is 'the body of a man or animal'. The second as 'a structure of a special character or function in the animal body'. And lastly 'A body or complete collection of writings or the like; the whole body of literature on any subject'. From that last definition comes 'the body of written or spoken material upon which a linguistic analysis is based'. That is the corpus I wish to write about as it is a fascinating subject.
A linguistic corpus is a information bank (usually digital) of millions of words pertaining to one language. There are a number of corpora (plural of corpus) around the world whose job it is to collect and analyze the words of a given language. English language corpora include the American National Corpus, the Oxford English Corpus and the British National Corpus. There are also several international corpora. A few examples are the Russian National Corpus and the Hanshari Corpus.
The British National Corpus is the largest corpora of the english language. It was founded in 1990 in collaboration with Oxford University and the National Library. It has over 100 million words in its bank that can be easily analyzed by computers for researches to analyze the various linguistic nuances of the english language. It contains both written English as well as spoken english.
It is mind boggling to think that the British English alone has 100 million words (even if some are words rarely spoken today). Storing all of the words in one place is a great way to better understand our language and the way we speak. Digitizing the corpora makes the access to information even easier. Researchers can easily pull up multiple words at once to analyze. The computers can also compare words faster than a human can.
The corpora hold not only written words but also spoken ones. Spoken corpora are fascinating. Groups of volunteers are recorded speaking words or even having casual conversations with each other. It is interesting to think that in several hundred years, that our future societies might be able to listen to how we talked. Future generations would be able to study our speech patterns and structure. This is something we can't do with ancient languages today but it is exciting to think we have the ability to provide this for future scientists.
There are corpora that a specially designated for ancient languages. In their archives they contain all written texts for languages like ancient greek of egyptian. Like corpora for modern languages, researchers can easily compare words. For instance if an unknown word doesn't make sense in one piece of text it make sense in another of a completely different topic.
1. Oxford English Dictionary. "Corpus". 2012. http://www.oed.com/view/Entry/41873?redirectedFrom=corpus#eid. Accessed November 21 2012.
Wikipedia. "Text Corpus". Last modified November 10 2012. http://en.wikipedia.org/wiki/Text_corpus. Accessed November 21 2012.
3. British National Corpus. "About the BNC". 2010. http://www.natcorp.ox.ac.uk. Accessed November 21 2012.
4. Online Etymology Dictionary. "Corpus". 2012. http://www.etymonline.com/index.php?term=corpus. Accessed November 21 2012.