Description: Professor Mark Davies of Linguistics creates 2-billion word Corpus del Español and 1-billion word Corpus do Potuguês. Check them out at http://corpus.byu.edu.
- Start: May 1, 2015
- End: December 31, 2017
- Sponsor: National Endowment for the Humanities
- Principal Investigator: Mark Davies
- Website: http://corpus.byu.edu
What is a corpus? A corpus (or corpora in plural) is a collection of words, texts, or other works. You can think of it almost like a giant database of words and phrases. 15 years ago, Professor Davies received a grant from the NEH to create corpora for two languages: one for Spanish and one for Portuguese. Although these corpora were in the lead for their time, they were focused on historical texts drawing back to the 1200s. They have since become outdated. Professor Davies has been awarded a new NEH grant to expand the Spanish and Portuguese corpora, focusing more on current language use and the comparison of dialects. The updated Spanish corpus has 2 billion words, 100 times as much data as the original one created. Additionally, the Portuguese corpus has grown by a magnitude of 50 to include 1 billion words.
Both corpora combined include data from 25 countries. The Portuguese corpus draws from Brazil, Portugal, Angola, and Mozambique, while the Spanish corpus includes data from 21 Spanish-speaking countries such as Mexico, Colombia, Spain, and Chile, just to name a few. Both corpora were recently completed and published onto http://corpus.byu.edu.
The new corpora produced by Professor Davies are anticipated to be the largest Spanish and Portuguese corpora that have ever been created. Three main improvements have been made since the corpora created over a decade ago. First, the corpora are substantially larger. One reason that such large corpora are needed is due to the fact the researchers often need lots of data to make linguistic claims and to study language patterns. Secondly, the updated corpora contain words that are based from contemporary sources, rather than historical data. Lastly, the updates allow researchers, teachers, and other users to directly compare dialects between different countries.
The Corpus del Español and the Corpus do Potuguês are only two corpora created by Professor Davies. There are several other English-based corpora which are also located on the corpus website, some of which also have 1-2 billion words and are updated daily. The Linguistics Professor has received several grants in the past to create and improve his corpora, including grants from the National Science Foundation, as well as the National Endowment for the Humanities.
The corpora located at http://corpus.byu.edu are presently used by at least 130,000 individuals each month, and all corpora on the website are made free to the public. The new Spanish and Portuguese corpora are geared toward students and teachers with the ability to analyze various dialects in each language. Professor Davies and his team will continue to work on the Corpus del Español and the Corpus do Potuguês to make improvements and fix any bugs. Some of the features of the corpora include allowing the user to search through the most commonly used words in each language, analyze nearby words and contexts in which words are used, and study usage based on genre or country. This project has and will continue to aid linguists, teachers, students, and other researchers in their work.