With the massive amount of information available on the Internet today, people are beginning to find novel ways to use data to improve our lives and deepen our understanding of ourselves. This recent piece from the BBC highlights the impact Big Data is having on our world today. However, most of this Big Data only encompasses real-time information or data from the very recent past, ignoring the vast sources passed down primarily in the form of archival materials. With enough historical data, we can form a nuanced and much more complete view of the past, and in this way, 'Big Data' can impact how we view ourselves within a historical perspective. In fields like historical linguistics, being able to search large corpora has been groundbreaking, but most languages and historical time periods are scarcely represented by databases. The CoHD project hopes to address this by compiling and then utilizing 'Big Data' to do 'Big Research' on language in the past.
![]() |
| TEI markup language for 14th and 15th century texts. |
We want to harness the advantages that computers can offer us as historical linguists. To that end, we encode texts in the corpus using the Text Encoding Initiative (TEI) in XML, a standard for complex document-oriented data like historical texts. After transcription, we annotate the texts with linguistic information using GATE, or the General Architecture for Textual Engineering (whew I'm tired after typing all that).
![]() |
| Joan Huydecoper's journal (1684) |



No comments:
Post a Comment