Representativeness in Corpora of Literary Texts: Introducing the C18P Project
Currently there are very few specialised corpora of literary texts that are tailored to the needs of literary critics who are interested in corpus stylistic analyses of prose fiction. Many existing corpora including literary texts were compiled for linguistic research interests and are often unsuitable for corpus stylistic purposes. The paper addresses three of the main problems: the absence of labelling of the texts for literary genre, the use of extracts, and the prevalence of linguistic periodisation schemes. C18P is a corpus of prose fiction designed specifically to address these issues. It traces the early development of the novel from 1700 up until the Victorian era. It can, for instance, be used for an analysis of the characteristic linguistic features of individual literary genres and forms. The following paper introduces the design of the corpus as well as some of its potential uses.
- Abstract viewed = 175 times
- HTML viewed = 33 times
- PDF viewed = 145 times
BIBER, Douglas (1993). “Representativeness in Corpus Design.” Literary and Linguistic Computing 8.4: 243–257.
BURWICK, Frederick, ed. (2012). The Encyclopedia of Romantic Literature. Chichester: John Wiley.
CULPEPER, Jonathan (2009). “Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet.” International Journal of Corpus Linguistics 14.1: 29–59.
DAVIES, Mark (2004). BYU-BNC. (Based on the British National Corpus from Oxford University Press). 30 Jun. 2015. http://corpus.byu.edu/bnc/.
DAVIES, Mark (2008). The Corpus of Contemporary American English: 450 million words, 1990-present. 30 Jun. 2015. http://corpus.byu.edu/coca/.
DAY, Gary, and Jack Lynch, eds. (2015). The Encyclopedia of British Literature 1660 - 1789. Chichester: John Wiley.
DE SMET, Hendrik (2005). “A corpus of Late Modern English text.” ICAME Journal 29: 69–82.
DE SMET, Hendrik (n.d.). The Corpus of English Novels (CEN). 15 Mar. https://perswww.kuleuven.be/~u0044428/cen.htm.
DE SMET, Hendrik, Hans-Jürgen Diller, and Jukka Tyrkkö (2013). “The Corpus of Late Modern English Texts, version 3.0.” 29 Jan. 2015. https://perswww.kuleuven.be/~u0044428/.
FANEGO, Teresa (2012). “COLMOBAENG: A Corpus of Late Modern British and American English Prose.” Creation and use of historical English corpora in Spain. Ed. Nila Vázquez. Newcastle upon Tyne: Cambridge Scholars Publishing: 101–117.
FISCHER-STARCKE, Bettina (2010). Corpus Linguistics in Literary Analysis: Jane Austen and Her Contemporaries. London: Continuum.
FROW, John (2005). Genre. Oxon: Routledge.
GEMEINBÖCK, Iris (2015). “Containing chaos: compiling a corpus of eighteenth century prose fiction.” On-line Proceedings of the Annual Conference of the Poetics and Linguistics Association (PALA). 29 Jan. 2016.
GREENBLATT, Stephen, and M.H. Abrams, eds. (2006). Norton Anthology of English Literature. New York: Norton.
HOOVER, David L. (2007). “Corpus Stylistics, Stylometry, and the Styles of Henry James.” Style 2.41: 174–203.
“ICE-GB Corpus Design” (28 May 2015). The International Corpus of English – Britain. University College London. 30 Jun. 2015.
KILGARRIFF, Adam, Sue Atkins, and Michael Rundell (2007). “BNC Design Model Past its Sell-By.” Corpus Linguistics Conference, Birmingham, UK. 8 Dec. 2015.
KRÁLÍK, Jan, Michal Sulc (2005). “The Representativeness of Czech corpora.” International Journal of Corpus Linguistics 10.3: 357-366.
LEE, David Y. W. (2001). “Genres, Registers, Text Types, Domains, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle.” Language, Learning & Technology 5.3. 24 Mar. 2015. http://llt.msu.edu/vol5num3/lee/.
MAHLBERG, Michaela (2007). “Clusters, key clusters and local textual functions in Dickens.” Corpora 2.1: 1–31.
MAURANEN, Anna. (1998). “Another look at genre: corpus linguistics vs. genre analysis.” Studia Anglica Posnaniensia: international review of English Studies: 303.
MILIC, Louis T. (1995). “The Century of Prose Corpus: A half-million word historical database.” Computers and the Humanities 29: 327–337.
O’HALLORAN, Kieran (2007). “The subconscious in James Joyce’s ‘Eveline’: a corpus stylistic analysis that chews on the ‘Fish hook’.” Language and Literature 16.3: 227–244.
PENG, Roger, and Nicolas Hengartner (2011). “Quantitative Analysis of Literary Styles.” Department of Statistics Papers, UCLA. 25 October 2011. 8 Dec. 2015. http://escholarship.org/uc/item/883831vz.
Project Gutenberg. 29 Jan. 2015. https://www.gutenberg.org.
PUNTER, David (2012). A New Companion to the Gothic. Oxford: Blackwell.
PUNTER, David (2012), and Glennis Byron (2004). The Gothic. Malden: Blackwell Publishing.
RAVEN, James (1987). British Fiction 1750–1770: A Chronological Check-List of Prose Fiction Printed in Britain and Ireland. Newark: University of Delaware Press.
RAVEN, James (2000). “Historical Introduction: The Novel Comes of Age.” The English Novel 1770–1829: A Bibliographical Survey of Prose Fiction Published in the British Isles: Volume I. Eds. Peter Garside, James Raven, and Rainer Schöwerling. Oxford: Oxford UP. 15-121.
RICHETTI, John, ed. (1996). The Cambridge Companion to the Eighteenth-Century Novel. Cambridge: Cambridge UP.
SINCLAIR, John (2005). “Chapter 1: Corpus and Text—Basic Principles.” Developing Linguistic Corpora: a Guide to Good Practice. Ed. Martin Wynne: 4–24. 29 Jan. 2015.
STUBBS, Michael (2005). “Conrad in the computer: examples of quantitative stylistic methods.” Language and Literature 14.1: 5–24.
University of Oxford Text Archive. University of Oxford. 4 May 2015. https://ota.ox.ac.uk/.
WATSON, George, ed. (1971). The New Cambridge Bibliography of English Literature: 1660–1800. Cambridge: Cambridge UP.
WYNNE, Martin (2005). Developing Linguistic Corpora: a Guide to Good Practice. 29 Jan. 2015. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
MATLIT embraces online publishing and open access to back issues. Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 International (CC BY-NC-ND 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal. The article can be quoted but not changed and presented differently.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
- A CC licensing information in a machine-readable format is embedded in all articles published by MATLIT.
NonCommercial — You may not use the material for commercial purposes.
NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
No additional restrictions — You may not apply legal terms or technological measuresthat legally restrict others from doing anything the license permits.
- You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
- No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.