Empowering Your Writing with Corpus Tools

9 min readApr 16, 2020

By Vera Dugartsyrenova, Ph.D., an Associate Professor at HSE University, EFL instructor, who completed the School of trainers and ran a course at the Academic Writing Center.

With the pressures to publish internationally ever so real, many find that their experiences of learning English are taking on new meaning. As they settle into their role as writers for a global audience, the urge to find that old English grammar textbook re-emerges with new vitality. So does the interest in learning the kind of discipline-specific, relevant, and natural vocabulary that would help invite rather than discourage editors’ attention to their writing.

Using corpus tools in research writing

While getting published in English may seem like a dream come true, it means a certain ability to navigate through the demands and discourse expectations of the international academic community. To the novice academic writer, developing this ability means getting increasingly more knowledgeable not only about what fellow researchers around the globe are writing, but also HOW they are doing this language-wise. Computer-based corpora have come in as an indispensable learning aid that does just that — provides deliberate exposure to authentic language usage in target texts.

Corpora (plural from “corpus”) can be defined as collections of samples of real-life language produced by both native and non-native speakers in oral and written contexts. These collections are stored in computer databases and often marked up with additional linguistic and sociolinguistic information known as metadata (e.g., medium, variety of English, register, genre, user gender, age, nativeness, etc.).

Thousands of language/research centers, instructors, and writers compile and use corpora for their own needs and contexts. For example, corpora of academic texts are often developed based on the following criteria:

size of the corpus (can include from as few as 10 to hundreds of texts)
relevance to a specific genre of research writing (a research article, a PhD dissertation, etc.)
relevance to a specific discipline(s) (computer sciences, philosophy, economics, etc.)
currency of texts (how recently the texts have been published)
choice of specific research topics
publication venue (e.g., the kind of journal where an article was published).

These specially selected texts thus have important qualities that make them good models to analyze and follow:

They reflect the rhetorical tradition representative of a certain genre (e.g., a research article) and discipline (e.g., mathematics). As such, they can give an idea of the discourse expectations for texts in this genre/discipline — what content to include in various parts of the text, how to organize the text in various sections at the sentence and paragraph levels, and what kind of vocabulary and syntax would be the most natural to use.
They reflect the current expectations of the target audience (journal editors, academic supervisors, and other types of readers).
They can help address specific language queries, such as “How frequent, if at all, is the use of “researches” in international academic discourse in my field?” “Is it OK to say “the research object/subject”? “Does the X phrase make sense?”

With their capacity to store, arrange, and present raw language data in meaningful ways, computer-based corpora of target texts have made it possible to do a range of things in just a few clicks:

explore typical examples of actual language usage in a variety of contexts
work out rules regarding the meaning and use of language items based on your observations
check the frequency of words and phrases used in certain discourses
pick up commonly used phrases that are relevant to your needs
compile your own wordlists (of collocations, prepositional phrases, verb patterns, etc.).

Here is an example of what a corpus query looks like.

At some point, some of you may have wondered how international scholars would use the word “data”: as a singular or plural noun, and with what adjectives/participles. A simple corpus search in the Hong Kong Polytechnic University’s 2014 corpus of journal articles will generate interesting results. We get 8,318 hits — the number of times that the word “data” is used in the corpus (Figure 1). The corpus displays all these uses as a numbered list (concordance). As we can note from the first 40 concordance lines, the word “data” can feature both as a singular noun (“data that surrounds” in line 30) and a plural noun (“data do not allow us”, line 12; “data were collected”, line 37). A complete list of all uses of “data” can be accessed by clicking “All” in the upper right-hand corner of the screen (see “Show all instances”).

We also note that some authors have used “data” with a number of adjectives and participles. For example, here are some participle + “data” combinations”: “labeled data” (line 2), “independently generated data” (line 4), “transferred data” (line 14), “observed data” (line 18), “existing data” (line 31). If you want to search further to see how many times a specific word combination is used, you can make a new query by entering this combination in the search field at the end of the concordance list or on the corpus’s main page.

Figure 1. Corpus search results for “data”

Using open-access corpora

To become more familiar with corpora of academic texts, a good place to start is to use readily available corpora. Among these the following corpora have been extensively used and cited in academic literature:

1. the 2007 Corpus of Research Articles (СRA) — an expert corpus from the Hong Kong Polytechnic University that contains 780 research articles from high-impact journals in 39 disciplines (20 papers per discipline). This is a large corpus of 5,609,407 words, which can be searched by using advanced search features: sorting papers by discipline (Anthropology, Applied Biology and Chemical Technology, Applied Linguistics, Applied Mathematics, Applied Physics, etc.) and by section (Abstracts, Introductions, Literature Reviews, Methods, Results, Discussions, Applications, Implications, Directions, Limitations, Recommendations, Conclusions, Summaries). Those wishing to learn more about the linguistic nuances of writing a research article and its sections will find this resource especially useful.

2. the 2014 Corpus of Journal Articles (CJA) — a more recent selection of research articles from the same institution. CJA contains new 721 research articles in 38 disciplines and offers similar features. One drawback of both this and the above corpus is that only limited segments of separate articles (2–3 sentences) are displayed during corpus searches.

3. the Michigan Corpus of Upper-Level Student Papers (MICUSP) — this learner-produced corpus hosts 830 grade A papers in 16 disciplines written by Bachelor and Master’s students of the University of Michigan. The student papers fall into seven “paper types” (or genres): argumentative essay, creative writing, critique/evaluation, proposal, report, research paper, and response paper. The papers can be searched and sorted by various parameters (student’s level — senior undergraduate, 1st, 2nd, or 3rd-year Master’s student; nativeness — being a native or non-native English speaker; textual features — part of the text; paper type; and discipline). Complete versions of the papers are fully viewable online and can be downloaded as pdf files.

Try your hand at doing some simple corpus tasks using any of the corpora above:

Is the noun “evidence” used in the singular or plural form? What prepositions typically follow this noun? Is there a difference in how these prepositions are used?
What verbs is “this hypothesis” typically used with?
Which phrase tends to occur more frequently: “results show” or “results reveal”?
What tenses is “extensively studied” used with?
What verbs/verb phrases commonly follow “these findings”? Write down at least 10 different verbs/verb phrases.

As a follow-up, make up some queries of your own. You can search within the whole corpus or only a selected part of the corpus (e.g., the Introduction section in the corpora of journal articles).

Creating your own corpora

Instead of using readily available corpora, many may find it more useful to compile their own corpora for personal needs (genre of research text, disciplinary focus, etc.). This can be done with the help of corpus management tools, such as AntConc (freely available), WordSmith Tools (paid), and Monoconc Pro (paid). These computer applications allow users to upload their own selections of texts, combine these into corpora, and run various kinds of analysis on texts within the self-compiled corpora. Being freely available and easy to use, AntConc (developed by Professor Anthony Laurence) seems the most popular choice.

Here are a few steps to get you ready to create a corpus with AntConc:

Find and select the texts you want to include in your corpus (journal articles in your field or their parts, research proposals, abstracts, etc.).
Convert all files into .txt format files (the application only uses this type of files).
Save the files you want to appear in the same corpus in one folder (directory) on your computer. These files can also be organized into subfolders (for example, selections of texts by discipline) to generate sub-corpora.
Download the application from here and install it on your computer.
Run the application and select “Open File(s)” from the “File” menu to load the files.
Select the files from your folder/subfolder(s) and click “Open”.
Start your searches in your own corpus! Use the “Search Term” field to enter keywords for corpus search.

Imagine that you have just created your own corpus of journal papers from five target fields. Let’s find out how often the authors would use “this study explores” as opposed to “this study examines” to introduce the focus of their studies (Figure 2).

Figure 2. Corpus search results for “this study explores”

The results for the first query suggest that the phrase is used six times — 6 concordance hits (against 11 hits for “this study examines”). The column on the right (under “File” in the top menu) indicates the names of the files that mention the phrase. Of the six instances, two came from one and the same paper (by Doocy et al., 2016). To see the context in which the phrase was used in a particular text, click on the phrase in the concordance line. This will open a new window under “File View” (see the top menu), where the phrase will be highlighted. More detailed instructions on how to use AntConc can be found in this manual from Anthony Laurence’s website. Alternatively, you can watch this short video on how to start using AntConc or a more detailed one here.

Given their ease of design and use, and their versatile ability to embrace individual needs, corpora have appealed to both novice writers and writing gurus alike. Some may prove to be your next favorite aid on a pathway to excellence. Some may not live up to the promise. If still in doubt, try one out.

Further literature

If you are interested in learning more about the practical applications of corpora for research writing purposes, here are some recommended readings.

Journal articles on the use of corpus tools in teaching research writing

Bianchi, F., & Pazzaglia, R. (2007). Student writing of research articles in a foreign language: Metacognition and corpora. In R. Fachinetti. (Ed.), Corpus linguistics 25 years on (pp. 259–287). Amsterdam: Rodopi. https://brill.com/view/title/30112

Cai, J. (2016). Exploratory study on an integrated genre-based approach for the instruction of academic lexical phrases. Journal of English for Academic Purposes, 24, 58–74.

Charles, M. (2007). Reconciling top-down and bottom-up approaches to graduate writing: Using a corpus to teach rhetorical functions. Journal of English for Academic Purposes, 6(4), 289–302.

Cotos, E., Huffman, S., & Link, S. (2017). A move/step model for methods sections: Demonstrating rigour and credibility. English for Specific Purposes, 46, 90–106.

Diani, G. (2012). Text and corpus work, EAP writing and language learners. In R. Tang (Ed.), Academic writing in a second or foreign language (pp. 45–66). London: Continuum.

Dugartsyrenova, V. A. (2020). Supporting genre instruction with an online academic writing tutor: Insights from novice L2 writers. Journal of English for Academic Purposes, 100830.

Eriksson, A. (2012). Pedagogical perspectives on bundles: Teaching bundles to doctoral students of biochemistry. In J. Thomas, & A. Boulton (Eds.), Input, process and products. Developments in teaching and language corpora (pp. 195–211). Brno, Czech Republic: Masaryk University Press

Flowerdew, L. (2015). Using corpus-based research and online academic corpora to inform writing of the discussion section of a thesis. Journal of English for Academic Purposes, 20, 58–68.

Lee, D., & Swales, J. (2006). A corpus-based EAP course for NNS doctoral students: Moving from available specialized corpora to self-compiled corpora. English for Specific Purposes, 25, 56–75.

Journal articles on the integration of corpus tools in online writing platforms:

Chang, C.-F., & Kuo, C.-H. (2011). A corpus-based approach to online materials development for writing research articles. English for Specific Purposes, 30, 222–234.

Kuo, C.-H. (2008). Designing an online writing system: Learning with support. RELC, 39(3), 285–299.

Lin, C.-C., Liu, G.-Z., & Wang, T.-I. (2017). Development and usability test of an e-Learning tool for engineering graduates to develop academic writing in English: A case study. Educational Technology & Society, 20(4), 148–161.

Lo, H.-Y., Liu, G.-Z., & Wang, T.-I. (2014). Learning how to write effectively for academic journals: A case study investigating the design and development of a genre-based writing tutorial system. Computers & Education, 78, 250–267.

Sun, Y.-C. (2007). Learner perceptions of a concordancing tool for academic writing. Computer Assisted Language Learning, 20(4), 323–343.

Empowering Your Writing with Corpus Tools

Using corpus tools in research writing

Using open-access corpora

Creating your own corpora

Further literature

Written by HSE Academic Writing Center

No responses yet