Representativeness in Corpora of Literary Texts: Introducing the C18P Project

  • Iris Gemeinböck Department of English and American Studies - University of Vienna


Currently there are very few specialised corpora of literary texts that are tailored to the needs of literary critics who are interested in corpus stylistic analyses of prose fiction. Many existing corpora including literary texts were compiled for linguistic research interests and are often unsuitable for corpus stylistic purposes. The paper addresses three of the main problems: the absence of labelling of the texts for literary genre, the use of extracts, and the prevalence of linguistic periodisation schemes. C18P is a corpus of prose fiction designed specifically to address these issues. It traces the early development of the novel from 1700 up until the Victorian era. It can, for instance, be used for an analysis of the characteristic linguistic features of individual literary genres and forms. The following paper introduces the design of the corpus as well as some of its potential uses.


  • Abstract viewed = 181 times
  • HTML viewed = 38 times
  • PDF viewed = 171 times


Download data is not yet available.

Author Biography

Iris Gemeinböck, Department of English and American Studies - University of Vienna
University of Vienna, PhD candidate


BAWARSHI, Anis S., and Mary Jo Reif (2010). Genre: An Introduction to History, Theory, Research, and Pedagogy. West Lafayette: Parlor Press.

BIBER, Douglas (1993). “Representativeness in Corpus Design.” Literary and Linguistic Computing 8.4: 243–257.

BURWICK, Frederick, ed. (2012). The Encyclopedia of Romantic Literature. Chichester: John Wiley.

CULPEPER, Jonathan (2009). “Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet.” International Journal of Corpus Linguistics 14.1: 29–59.

DAVIES, Mark (2004). BYU-BNC. (Based on the British National Corpus from Oxford University Press). 30 Jun. 2015.

DAVIES, Mark (2008). The Corpus of Contemporary American English: 450 million words, 1990-present. 30 Jun. 2015.

DAY, Gary, and Jack Lynch, eds. (2015). The Encyclopedia of British Literature 1660 - 1789. Chichester: John Wiley.

DE SMET, Hendrik (2005). “A corpus of Late Modern English text.” ICAME Journal 29: 69–82.

DE SMET, Hendrik (n.d.). The Corpus of English Novels (CEN). 15 Mar.

DE SMET, Hendrik, Hans-Jürgen Diller, and Jukka Tyrkkö (2013). “The Corpus of Late Modern English Texts, version 3.0.” 29 Jan. 2015.

FANEGO, Teresa (2012). “COLMOBAENG: A Corpus of Late Modern British and American English Prose.” Creation and use of historical English corpora in Spain. Ed. Nila Vázquez. Newcastle upon Tyne: Cambridge Scholars Publishing: 101–117.

FISCHER-STARCKE, Bettina (2010). Corpus Linguistics in Literary Analysis: Jane Austen and Her Contemporaries. London: Continuum.

FROW, John (2005). Genre. Oxon: Routledge.

GEMEINBÖCK, Iris (2015). “Containing chaos: compiling a corpus of eighteenth century prose fiction.” On-line Proceedings of the Annual Conference of the Poetics and Linguistics Association (PALA). 29 Jan. 2016.

GREENBLATT, Stephen, and M.H. Abrams, eds. (2006). Norton Anthology of English Literature. New York: Norton.

HOOVER, David L. (2007). “Corpus Stylistics, Stylometry, and the Styles of Henry James.” Style 2.41: 174–203.

“ICE-GB Corpus Design” (28 May 2015). The International Corpus of English – Britain. University College London. 30 Jun. 2015.

KILGARRIFF, Adam, Sue Atkins, and Michael Rundell (2007). “BNC Design Model Past its Sell-By.” Corpus Linguistics Conference, Birmingham, UK. 8 Dec. 2015.

KRÁLÍK, Jan, Michal Sulc (2005). “The Representativeness of Czech corpora.” International Journal of Corpus Linguistics 10.3: 357-366.

LEE, David Y. W. (2001). “Genres, Registers, Text Types, Domains, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle.” Language, Learning & Technology 5.3. 24 Mar. 2015.

MAHLBERG, Michaela (2007). “Clusters, key clusters and local textual functions in Dickens.” Corpora 2.1: 1–31.

MAURANEN, Anna. (1998). “Another look at genre: corpus linguistics vs. genre analysis.” Studia Anglica Posnaniensia: international review of English Studies: 303.

MILIC, Louis T. (1995). “The Century of Prose Corpus: A half-million word historical database.” Computers and the Humanities 29: 327–337.

O’HALLORAN, Kieran (2007). “The subconscious in James Joyce’s ‘Eveline’: a corpus stylistic analysis that chews on the ‘Fish hook’.” Language and Literature 16.3: 227–244.

PENG, Roger, and Nicolas Hengartner (2011). “Quantitative Analysis of Literary Styles.” Department of Statistics Papers, UCLA. 25 October 2011. 8 Dec. 2015.

Project Gutenberg. 29 Jan. 2015.

PUNTER, David (2012). A New Companion to the Gothic. Oxford: Blackwell.

PUNTER, David (2012), and Glennis Byron (2004). The Gothic. Malden: Blackwell Publishing.

RAVEN, James (1987). British Fiction 1750–1770: A Chronological Check-List of Prose Fiction Printed in Britain and Ireland. Newark: University of Delaware Press.

RAVEN, James (2000). “Historical Introduction: The Novel Comes of Age.” The English Novel 1770–1829: A Bibliographical Survey of Prose Fiction Published in the British Isles: Volume I. Eds. Peter Garside, James Raven, and Rainer Schöwerling. Oxford: Oxford UP. 15-121.

RICHETTI, John, ed. (1996). The Cambridge Companion to the Eighteenth-Century Novel. Cambridge: Cambridge UP.

SINCLAIR, John (2005). “Chapter 1: Corpus and Text—Basic Principles.” Developing Linguistic Corpora: a Guide to Good Practice. Ed. Martin Wynne: 4–24. 29 Jan. 2015.

STUBBS, Michael (2005). “Conrad in the computer: examples of quantitative stylistic methods.” Language and Literature 14.1: 5–24.
University of Oxford Text Archive. University of Oxford. 4 May 2015.

WATSON, George, ed. (1971). The New Cambridge Bibliography of English Literature: 1660–1800. Cambridge: Cambridge UP.

WYNNE, Martin (2005). Developing Linguistic Corpora: a Guide to Good Practice. 29 Jan. 2015.
How to Cite
GEMEINBÖCK, Iris. Representativeness in Corpora of Literary Texts: Introducing the C18P Project. MATLIT: Materialities of Literature, [S.l.], v. 4, n. 2, p. 29-48, july 2016. ISSN 2182-8830. Available at: <>. Date accessed: 12 dec. 2018. doi:
Secção Temática | Thematic Section


corpus analysis; corpus stylistics; corpus building; eighteenth century; prose fiction; representativeness