Monday, February 11, 2013

Big Data, meet Historical Dutch

Welcome to the COHD project, or the Corpus of Historical Dutch. We are four linguists on three continents, and we're working together to record, sort, and investigate the Dutch language as far back as the written record allows. As one man said, we're "making the most of bad data." Our plan is to compile a set of digital corpus-building tools for transcribing archival documents, and then overlay this linguistic data from the corpus we build with detailed sociohistorical and demographic data. In this blog, we look to chronicle the methods we use, the mistakes we make and the new things we discover along the way. And as we have time, we plan to cover interesting historical and demographic topics relevant to our project and the field of (socio)historical linguistics.

With the massive amount of information available on the Internet today, people are beginning to find novel ways to use data to improve our lives and deepen our understanding of ourselves. This recent piece from the BBC highlights the impact Big Data is having on our world today. However, most of this Big Data only encompasses real-time information or data from the very recent past, ignoring the vast sources passed down primarily in the form of archival materials. With enough historical data, we can form a nuanced and much more complete view of the past, and in this way, 'Big Data' can impact how we view ourselves within a historical perspective. In fields like historical linguistics, being able to search large corpora has been groundbreaking, but most languages and historical time periods are scarcely represented by databases. The CoHD project hopes to address this by compiling and then utilizing 'Big Data' to do 'Big Research' on language in the past.

TEI markup language for 14th and 15th century texts.
To do this we need data... Gobs and gobs of data. The more data the better - in all its most complex and grotesque forms (and yes, data is a singular noun). Accordingly, one of the primary goals of the project is to amass large quantities of linguistic data from Dutch speakers as far back as we can go. First and foremost, this means expanding the amount of data currently available to historical linguists, which in the case of Dutch has traditionally too often been limited to published works alone. What's more, many studies were undertaken with few tools for doing quantitative analyses using the power of computers. 

We want to harness the advantages that computers can offer us as historical linguists. To that end, we encode texts in the corpus using the Text Encoding Initiative (TEI) in XML, a standard for complex document-oriented data like historical texts. After transcription, we annotate the texts with linguistic information using GATE, or the General Architecture for Textual Engineering (whew I'm tired after typing all that). 

Joan Huydecoper's journal (1684)
Most importantly, we're building a web-based tool for taking pictures of documents found in the many public archives in the Netherlands. These archival materials represent a completely new set of data for Dutch historical research. Moreover, with such a corpus, we'll be able to take various text types into account, which helps round out our view of the language in different contexts. In concert with some of the groundbreaking work that social historians have been doing, we can then trace linguistic changes with demographic shifts in the past. We have already seen positive results in this last area. For example, with our first peek at a small set of pilot data, we found that case loss tended to occur after massive in-migration took place into two cities: Leiden and Brugge. This has already shown us that with a large corpus of historical Dutch at our control, we can make significant and fundamental contributions to our understanding of Dutch as it developed from the Middle Ages to the end of the Early Modern Period.

No comments:

Post a Comment