Part One. Early Corpus Linguistics and the Chomskyan Revolution.

Early Corpus Linguistics

"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.

Language acquisition

The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 - analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present - again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)].

Spelling conventions

Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.

Language pedagogy

Fries and Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagody had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.

Other examples

Comparative linguistics, and syntax and semantics can be read about in Chapter 1, page 3 of "Corpus Linguistics".


Chomsky

Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance.

Competence is best described as our tacit, internalised knowledge of a language.

Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form.

Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence.

Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it.

However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.




The non-finite nature of language

All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:
  • The sentences of a natural language are finite.
  • The sentences of a natural language can be collected and enumerated.
The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists...regarded the corpus as the sole explicandum of linguistics" (Leech, 1991).

To be fair, not all linguists at the time made such bullish statements - Harris [1951) is probably the most enthusiastic exponent of this point, while Hockett [1948] did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time."

The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in "Corpus Linguistics" Chapter 1, pages 7-8).

The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.


The value of introspection

Even if language was a finite construct, would corpus methodology still be the best method of studying language? Why bother waiting for the sentences of a language to enumerate themselves, when by the process of introspection we can delve into our own minds and examine our own linguistic competence? At times intuition can save us time in searching a corpus.

Without recourse to introspective judgements, how can ungrammatical utterances be distinguished from ones that simply haven't occurred yet? If our finite corpus does not contain the sentence:

*He shines Tony books

how do we conclude that it is ungrammatical? Indeed, there may be persuasive evidence in the corpus to suggest that it is grammatical if we see sentences such as:

He gives Tony books
He lends Tony books
He owes Tony books

Introspection seems a useful and good tool for cases such as this. But early corpus linguistics denied its use.

Also, ambiguous structures can only be identified and resolved with some degree of introspective judgement. An observation of physical form only seems inadequate. Consider the sentences:

Tony and Fido sat down - he read a book of recipes.
Tony and Fido sat down - he ate a can of dog food.

It is only with introspection that this pair of ambiguous sentences can be resolved e.g. we know that Fido is the name of a dog and it was therefore Fido who ate the dog food, and Tony who read the book.



Other criticisms of corpus linguistics

Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1963) summed up the corpus-based approach as being composed of "pseudo-procedures". Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive.

Whatever Chomsky's criticisms were, Abercrombie's were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time.

The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died.


Chomsky re-examined

Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpus-based work. For example, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. Also, in the field of language acquisition the observation of naturally occuring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteen-month-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed, and there is no evidence that a child at the one-word stage has meta-linguistic awareness. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies.


Benefits of corpus data

  1. Leech (1992) argues that the corpus is a more powerful methodology from the point of view of the scientific method, as it is open to objective verification of results

  2. Is language production really a poor reflection of language competence as Chomsky really argued? Labov (1969) showed that "the great majority of utterances in all contexts are grammatical". We are not saying that all sentences in a corpus are grammatically acceptable, but it seems probable that the Chomsky's (1968: 88) claim that performance data is 'degenerate' is an exaggeration (see Ingram 1989: 223 for further criticisms of this view).

  3. Quantitative data is of use to linguistics. For example, Svartvik's (1966) study of passivisation used quantitative data extracted from a corpora. Elsewhere, all successful approachs to automated part-of-speech analysis reply on quantitative data from corpora. The proof of the pudding is in the eating.

  4. Abercrombie's observations that corpus research is time-consuming, expensive and error-prone are no longer applicable thanks to the development of powerful computers and software which is able to perform complex calculations in seconds, without error.

The revival of corpus linguistics

It is a common belief that corpus linguistics was abandoned entirely in the 1950s, and then adopted once more almost as suddenly in the early 1980s. This is simply untrue, and does a disservice to those linguists who continued to pioneer corpus-based work during this interregnum.

For example, Quirk (1960) planned and executed the construction of his ambitious Survey of English Usage (SEU) which he began in 1961. In the same year, Francis and Kucera began work on the now famous Brown corpus, a work which was to take almost two decades to complete. These researchers were in a minority, but they were not universally regarded as peculiar and others followed their lead. In 1975 Jan Svartvik started to build on the work of the SEU and the Brown corpus to construct the London-Lund corpus.

During this period the computer slowly started to become the mainstay of corpus linguistics. Svartvik computerised the SEU, and as a consequence produced what some, including Leech (1991) still believe to be "to this day an unmatched resource for studying spoken English".

The availability of the computerised corpus and the wider availability of institutional and private computing facilities do seem to have provided a spur to the revival of corpus linguistics. The table below (from Johansson, 1991) shows how corpus linguistics grew during the latter half of this century.

DateStudies
To 196510
1966-197020
1971-197530
1976-198080
1981-1985160
1985-1991320



The machine readable corpus

The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as psuedo-techniques. The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer.

Processes

Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist.

The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words occuring in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a concordance), and extract from this another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark.

The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.




Goals and conclusion

In this section we have
  • seen the failure of early corpus linguistics
  • examined Chomsky's criticisms
  • seen the failings of introspective data
  • seen how corpus linguistics was revived
In the remaining sections we will see -

  • how corpus linguists study syntactic features (Section 2)
  • how corpus linguistics balances enumeration with introspection (Section 3)

  • how corpora can be used in language studies (Section 4)
http://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus1/1FRA1.HTM

posted under | 0 Comments

Part Two. What is a Corpus, and What is in it?

Definition of a corpus

The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.

In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.



Sampling and Representativeness

Often in linguistics we are not merely interested in an individual text or author, but a whole variety of language. In such cases we have two options for data collection:
  • We could analyse every single utterance in that variety - however, this option is impracticable except in a few cases, for example with a dead language which only has a few texts. Usually, however, analysing every utterance would be an unending and impossible task.
  • We could construct a smaller sample of that variety. This is a more realistic option.
As discussed in lecture 1, one of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously. This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which which a much less biased and representative corpus may be constructed.

We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions. What we are looking for is a broad range of authors and genres which, when taken together, may be considered to "average out" and provide a reasonably accurate picture of the entire language population in which we are interested.



Finite Size

The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words. Their main advantages are:
  • They are not static - new texts can always be added, unlike the synchronic "snapshot" provided by finite corpora.
  • Their scope - they provide for a large and broad sample of language.
Their main disadvantage is:
  • They are not such a reliable source of quantitative data (as opposed to qualitative data) because they are constantly changing in size and are less rigourously sampled than finite corpora.

With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size. (An exception is the London-Lund corpus, which was increased in the mid-1970s to cover a wider variety of genres.)



Machine-readable form

Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.

Today few corpora are available in book form - one which does exist in this way is "A Corpus of English Conversation" (Svartvik and Quirk 1980) which represents the "original" London-Lund corpus. Corpus data (not excluding context-free frequency lists) is occasionally available in other forms of media. For example, a complete key-word-in-context concordance of the LOB corpus is available on microfiche, and with spoken corpora copies of the actual recordings are sometimes available - this is the case with the Lancaster/IBM Spoken English Corpus but not with the London-Lund corpus.

Machine-readable corpora possess the following advantages over written or spoken formats:

  • They can be searched and manipulated at speed. (This is something which we covered at the end of Part One).
  • They can easily be enriched with extra information. (We will examine this in detail later.)

If you haven't already done so you can now read about other characteristics of the modern corpus.



A standard reference

There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.

  • One advantage of a widely available corpus is that it provides a yardstick by which successive studies can be measured. So long as the methodology is made clear, new results on related topics can be directly compared with already published results without the need for re-computation.
  • Also, a standard corpus also means that a continuous base of data is being used. This implies that any variation between studies is less likely to be attributed to differences in the data and more to the adequacy of the assumptions and methodology contained in the study.


Text Encoding and Annotation

If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.

For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.

Leech (1993) describes 7 maxims which should apply in the annotation of text corpora.



Textual and extra-textual information

The most basic type of additional information is that which tells us what text or texts we are looking at. A computer file name may give us a clue to what the file contains, but in many cases filenames can only provide us with a tiny amount of information.

Information about the nature of the text can often consist of much more than a title and an author. Click here for an example of a document header.

These information fields provide the document with a whole document header which can be used by retrieval programs to search and sort on particular variables. For example, we might only be interested in looking at texts in a corpus that were written by women, so we could ask a computer program to retrieve texts where the author's gender variable is equal to "FEMALE".




Orthography

It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task, but even with a basic machine-readable text, issues of encoding are vital, although to English speakers their extent may not be apparent at first.

In languages other than English, the issue of accents and of non-Roman alphabets such as Greek, Russian and Japanese present a problem. IBM-compatible computers are capable of handling accented characers, but many other mainframe computers are unable to do this. Therefore, for maximum interchangeability, accented characters need to be encoded in other ways. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. For example, French speakers omit the accent entirely, writing Hélenè as Helene. To handle the umlaut, German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter, so Frühling would become Fruehling or Fr"uhling. However, these strategies cause additional problems - in the case of the French, information is lost, while in the German extraneous information is added.

In response to this the TEI has suggested that these characters are encoded as TEI entities, using the delimiting characters of & and ;. Thus, ü would be encoded by the TEI as

&uumlaut; Read about the handling of non-Roman alphabets and the transcription of spoken data in Corpus Linguistics, chapter 2, pages 34-36.


Types of annotation

Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow:

Multilingual Corpora

Not all corpora are monolingual, and an increasing amount of work in being carried out on the building of multilingual corpora, which contain texts of several different languages.

First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language, but each contains completely different texts in those several languages. For example, the Aarthus corpus of Danish, French and English contract law consists of a set of three monolingual law corpora, which is not comprised of translations of the same texts.

The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. This refers to corpora which hold the same texts in more than one language. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew, Latin and Greek etc.

A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" would be equivalent to "is smoking" in English.

At present there are few cases of annotated parallel corpora, and those which exist tend to be bilingual rather than multilingual. However, two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers a restricted range of text types (proceedings of the Candian Parliament). However, this is an area of growth, and the situation is likely to change dramatically in the near future.

Click here to see an example of a bilingual corpus




Conclusion

In this section we have -

  • seen what the term "corpus" entails

  • learnt about the standards of representing information in texts

  • learnt about headers and orthography

  • learnt about the types of annotation a corpus can be given

  • seen how a corpus can be bilingual or multilingual
http://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus2/2FRA1.HTM


posted under | 0 Comments

Part Three. Quantitative Data.

Introduction

In this session we'll be looking at the techniques used to carry out corpus analysis. We'll re-examine Chomsky's argument that corpus linguistics will result in skewed data, and see the procedures used to ensure that a representative sample is obtained. We'll also be looking at the relationship between quantitative and qualitative research. Although the majority of this session is concerned with statisitical procedures which can be said to be quantitative, it is important not to ignore the importance of qualitative analyses.

With the statistical part of this session two points should be made.

  • First, that this section is of necessity incomplete. Space precludes the coverage of all of the techniques which can be used on corpus data.
  • Second, we do not aim here to provide a "step-by-step" guide to statistics. Many of the techniques used are very complex and to explain the mathematics in full would require a separate session for each one. Other books, notably Language and Computers and Statistics for Corpus Linguistics (Oakes, M. - forthcoming) present these methods in more detail than we can give here.

Tony McEnery, Andrew Wilson, Paul Baker.


Qualitative vs Quantitative analysis

Corpus analysis can be broadly categorised as consisting of qualitative and quantitative analysis. In this section we'll look at both types and see the pros and cons associated with each. You should bear in mind that these two types of data analysis form different, but not necessary incompatible perspectives on corpus data.

Qualitative analysis: Richness and Precision.

The aim of qualitative analysis is a complete, detailed description. No attempt is made to assign frequencies to the linguistic features which are identified in the data, and rare phenomena receives (or should receive) the same amount of attention as more frequent phenomena. Qualitative analysis allows for fine distinctions to be drawn because it is not necessary to shoehorn the data into a finite number of classifications. Ambiguities, which are inherent in human language, can be recognised in the analysis. For example, the word "red" could be used in a corpus to signify the colour red, or as a political cateogorisation (e.g. socialism or communism). In a qualitative analysis both senses of red in the phrase "the red flag" could be recognised.

The main disadvantage of qualitative approaches to corpus analysis is that their findings can not be extended to wider populations with the same degree of certainty that quantitative analyses can. This is because the findings of the research are not tested to discover whether they are statistically significant or due to chance.

Quantitative analysis: Statistically reliable and generalisable results.

In quantitative research we classify features, count them, and even construct more complex statistical models in an attempt to explain what is observed. Findings can be generalised to a larger population, and direct comparisons can be made between two corpora, so long as valid sampling and significance techniques have been used. Thus, quantitative analysis allows us to discover which phenomena are likely to be genuine reflections of the behaviour of a language or variety, and which are merely chance occurences. The more basic task of just looking at a single language variety allows one to get a precise picture of the frequency and rarity of particular phenomena, and thus their relative normality or abnomrality.

However, the picture of the data which emerges from quantitative analysis is less rich than that obtained from qualitative analysis. For statistical purposes, classifications have to be of the hard-and-fast (so-called "Aristotelian" type). An item either belongs to class x or it doesn't. So in the above example about the phrase "the red flag" we would have to decide whether to classify "red" as "politics" or "colour". As can be seen, many linguistic terms and phenomena do not therefore belong to simple, single categories: rather they are more consistent with the recent notion of "fuzzy sets" as in the red example. Quantatitive analysis is therefore an idealisation of the data in some cases. Also, quantatitve analysis tends to sideline rare occurences. To ensure that certain statistical tests (such as chi-squared) provide reliable results, it is essential that minimum frequencies are obtained - meaning that categories may have to be collapsed into one another resulting in a loss of data richness.

A recent trend

From this brief discussion it can be appreciated that both qualitative and quantitative analyses have something to contribute to corpus study. There has been a recent move in social science towards multi-method approaches which tend to reject the narrow analytical paradigms in favour of the breadth of information which the use of more than one method may provide. In any case, as Schmied (1993) notes, a stage of qualitative research is often a precursor for quantitative analysis, since before linguistic phenomena can be classified and counted, the categories for classification must first be identified. Schmied demonstrates that corpus linguistics could benefit as much as any field from multi-method research.


Corpus Representativeness

As we saw in Session One, Chomsky criticised corpus data as being only a small sample of a large and potentially infinite population, and that it would therefore be skewed and hence unrepresentative of the population as a whole. This is a valid criticism, and it applied not just to corpus linguistics but to any form of scientific investigation which is based on sampling. However, the picture is not as drastic as it first appears, as there are many safeguards which may be applied in sampling to ensure maximum representativeness.

First, it must be noted that at the time of Chomsky's criticisms, corpus collection and analysis was a long and pains-taking task, carried out by hand, with the result that the finished corpus had to be of a manageable size for hand analysis. Although size is not a guarantee of representativeness, it does enter significantly into the factors which must be considered in the production of a maximally representative corpus. Thus, Chomsky's criticisms were at least partly true at the time of those early corpora. However, today we have powerful computers which can store and manipulate many millions of words. The issue of size is no longer the problem that it used to be.

Random sampling techniques are standard to many areas of science and social science, and these same techniques are also used in corpus building. But there are additional caveats which the corpus builder must be aware of.

Biber (1993) emphasises that we need to define as clearly as possible the limits of the population which we wish to study, before we can define sampling procedures for it. This means that we must rigourously define our sampling frame - the entire population of texts from which we take our samples. One way to do this is to use a comprehensive bibliographical index - this was the approach taken by the Lancaster-Oslo/Bergen corpus who used the British National Bibliography and Willing's Press Guide as their indices. Another approach could be to define the sampling frame as being all the books and periodicals in a particular library which refer to your particular area of interest. For example, all the German-language books in Lancaster University library that were published in 1993. This approach is one which was used in building the Brown corpus.

Read about a different kind approach, which was used in collecting the spoken parts of the British National Corpus, in Corpus Linguistics, chapter 3, page 65.

Biber (1993) also points out the advantage of determining beforehand the hierarchical structure (or strata) of the population. This refers to defining the different genres, channels etc. that it is made up if. For example, written German could be made up of genres such as:

  • newspaper reporting
  • romantic fiction
  • legal statutes
  • scientific writing
  • poetry
  • and so on....

Stratificational sampling is never less representative than pure probablistic sampling, and is often more representative, as it allows each individual stratum to be subjected to probablistic sampling. However, these strata (like corpus annotation) are an act of interpretation on the part of the corpus builder and others may argue that genres are not naturally inherent within a language. Genre groupings have a lot to do with the theoretical perspective of the linguist who is carrying out the stratification.

Read about optimal lengths and number of sample sizes, and the problems of using standard statistical equations to determine these figures in Corpus Linguistics, chapter 3, page 66.




Frequency Counts

This is the most straight-forward approach to working with quantitative data. Items are classified according to a particular scheme and an arithmetical count is made of the number of items (or tokens) within the text which belong to each classification (or type) in the scheme.

For instance, we might set up a classification scheme to look at the frequency of the four major parts of speech: noun, verb, adjective and adverb. These four classes would constitute our types. Another example inolves the simple one-to-one mapping of form onto classification. In other words, we count the number of times each word appears in the corpus, resulting in a list which might look something like:


abandon: 5
abandoned: 3
abandons: 2
ability: 5
able: 28
about: 128
etc.....

More often, however, the use of a classification scheme implies a deliberate act of categorisation on the part of the investigator. Even in the case of word frequency analysis, variant forms of the same lexeme may be lemmatised before a frequency count is made. For instance, in the example above, abandon, abandons and abandoned might all be classed as the lexeme ABANDON. Very often the classification scheme used will correspond to the type of linguistic annotation which will have already been introduced into the corpus at some earlier stage (see Session 2). An example of this might be an analysis of the incidence of different parts of speech in a corpus which had already been part-of-speech tagged.



Working with Proportions

Frequency counts are useful, but they have certain disadvantages. When one wishes to compare one data set with another, for example a corpus of spoken language with a corpus of written language. Frequency counts simply give the number of occurences of each type, they do not indicate the prevalence of a type in terms of a proportion of the total number of tokens in the text. This is not a problem when the two corpora that are being compared are of the same size, but when they are of different sizes frequency counts are little more than useless. The following example compares two such corpora, looking at the frequency of the word boot

Type of corpusNumber of wordsNumber of instances of boot
English Spoken50,00050
English Written500,000500

A brief look at the table seems to show that boot is more frequent in written rather than spoken English. However, if we calulate the frequency of occurrence of boot as a percentage of the total number of tokens in the corpus (the total size of the corpus) we get:

spoken English: 50/50,000 X 100 = 0.1%
written English: 500/500,000 X 100 = 0.1%

Looking at these figures it can be seen that the frequency of boot in our made-up example is the same (0.1%) for both the written and spoken corpora.

Even where disparity of size is not an issue, it is often better to use proportional statistics to present frequencies, since most people find them easier to understand than comparing fractions of unusual numbers like 53,000. The most basic way to calculate the ratio between the size of the sample and the number of occurences of the type under investigation is:

ratio = number of occurrences of the type / number of tokens in the entire sample

This result can be expressed as a fraction, or more commonly as a decimal. However, if that results in an unwieldy looking small number (in the above example it would be 0.0001) the ratio can then be multiplied by 100 and represented as a percentage.


Significance Testing

Significance tests allow us to determine whether or not a finding is the result of a genuine difference between two (or more) items, or whether it is just due to chance. For example, suppose we are examining the Latin versions of the Gospel of Matthew and the Gospel of John and we are looking at how third person singular speech is represented. Specifically we want to compare how often the present tense form of the verb "to say" is used ("dicit") with how often the perfect form of the verb is used ("dixit"). A simple count of the two verb forms in each text produces the following results:

Textno. of occurences of dicitno. of occurences of dixit
Matthew
46
107
John
118
119

From these figures is looks as if John uses the present form ("dicit") proportionally more often than Matthew does, but to be more certain that this is not just due to co-incidence, we need to perform a further calculation - the significance test.

There are several types of significance test available to the corpus linguist: the chi squared test, the [Student's] t-test, Wilcoxon's rank sum test and so on. Here we will only examine the chi-squared test as it is the most commonly used significance test in corpus linguistics. This is a non-parametric test which is easy to calculate, even without a computer statistics package, and can be used with data in 2 X 2 tables, such as the example above. However, it should be noted that the chi-squared test is unreliable where very small numbers are involved and should not therefore be used in such cases. Also, proportional data (percentages etc) can not be used with the chi-squared test.

The test compares the difference between the actual frequencies (the observed frequencies in the data) with those which one would expect if no factor other than chance had been operating (the expected frequencies). The closer these two results are to each other, the greater the probablity that the observed frequencies are influenced by chance alone.

Having calculated the chi-squared value (we will omit this here and assume it has been done with a computer statistical package) we must look in a set of statistical tables to see how significant our chi-squared value is (usually this is also carried out automatically by computer). We also need one further value - the number of degrees of freedom which is simply:

(number of columns in the frequency table - 1) x (number of rows in the frequency table - 1)
In the example above this is equal to (2-1) x (2-1) = 1.

We then look at the table of chi-square values in the row for the relevant number of degrees of freedom until we find the nearest chi-square value to the one which is calculated, and read off the probability value for that column. The closer to 0 the value, the more significant the difference is - i.e. the more unlikely that it is due to chance alone. A value close to 1 means that the difference is almost certainly due to chance. In practice it is normal to assign a cut-off point which is taken to be the difference between a significant result and an "insignificant" result. This is usually taken to be 0.05 (probablity values of less than 0.05 are written as "p <>

In our example about the use of dicit and dixit above we calculate a chi-squared value of 14.843. The table below shows the significant p values for the first 3 degrees of freedom:

Degrees of Freedomp = 0.05p = 0.01p = 0.001
1
3.846.6310.83
2
5.999.2113.82
3
7.8111.3416.27

The number of degrees of freedom in our example is 1, and our result is higher than 10.83 (see the final column in the table) so the probability value for this chi-square value is 0.001. Thus, the difference between Matthew and John can be said to be significant at p <>

In depth: You can also read about Type I and Type II errors in the glossary.



Collocations

The idea of collocations is an important one to many areas of linguistics. Khellmer (1991) has argued that our mental lexicon is made up not only of single words, but also of larger phraseological units, both fixed and more variable. Information about collocations is important for dictionary writing, natural language processing and language teaching. However, it is not easy to determine which co-occurences are significant collocations, especially if one is not a native speaker of a language or language variety.

Given a text corpus it is possible to empirically determine which pairs of words have a substantial amount of "glue" between them. Two of the most commonly encountered formulae are: mutual information and the Z-score. Both tests provide similar data, comparing the probablities that two words occur together as a joint event (i.e. because they belong together) with the probability that they are simply the result of chance. For example, the words riding and boots may occur as a joint event by reason of their belonging to the same multiword unit (riding boots) while the words formula and borrowed may simply occur because of a one-off juxtaposition and have no special relationship. For each pair of words, a score is given - the higher the score the greater the degree of collocality.

Mutual information and the Z-score are useful in the following ways:

  • They enable us to extract multiword units from corpus data, which can be used in lexicography and particularly specialist technical translation.

  • We can group similar collocates of words together to help to identify different senses of the word. For example, bank might collocate with words such as river, indicating the landscape sense of the word, and with words like investment indicating the financial use of the word.

  • We can discriminate the differences in usage between words which are similar. For example, Church et al (1991) looked at collocations of strong and powerful in a corpus of press reports. Although these two words have similar meanings, their mutual information scores for associations with other words revealed interesting differences. Strong collocated with northerly, showings, believer, currents, supporter and odor, while powerful collocated with words such as tool, minority, neighbour, symbol, figure, weapon and post. Such information about the delicate differences in collocation between the two words has a potentially important role, for example in helping students who learn English as a foreign language.

Read about the use of mutual information in parallel aligned corpora in Corpus Linguistics, Chapter 3, page 73.



Multiple Variables

The tests that we have looked at so far can only pick up differences between particular samples (i.e. texts and copora) on particular variables (i.e. linguistic features) but they cannot provide a picture of the complex interrelationship of similarity and difference between a large number of samples, and large numbers of variables. To perform such comparisons we need to consider multivariate techniques. Those most commonly encountered in linguistic research are:
  • factor analysis
  • principal components analysis
  • multidimensional scaling
  • cluster analysis

The aim of multivariate techniques is to summarise a large set of variables in terms of a smaller set on the basis of statistical similarities between the original variables, whilst at the same time losing the minimal amount of information about their differences.

Although we will not attempt to explain the complex mathematics behind these techniques, it is worth taking time to understand the stages by which they work: All the techniques begin with a basic cross-tabulation of the variables and samples.

For factor analysis an intercorrelation matrix is then calculated from the cross-tabulation, which is used to attempt to "summarise" the similarities between the variables in terms of a smaller number of reference factors which the technique extracts. The hypothesis being that the many variables which appear in the original frequency cross-tabulation are in fact masking a smaller number of variables (the factors) which can help explain better why the observed frequency differences occur.

Each variable receives a loading on each of the factors which are extracted, signifying its closeness to that factor. For example, in analysing a set of word frequencies across several texts one might find that words in a certain conceptual field (i.e. religion) received high loadings on one factor, whereas those in another field (e.g. government) loaded highly on another factor.

Follow this link for an example of factor analysis.

Correspondence analysis is similar to factor analysis, but it differs in the basis of its calculations.

Multidimensional scaling (MDS) also makes use of an intercorrelation matrix, which is then converted to a matrix in which the correlation coefficients are replaced with rank order values. E.g. the highest correlation value recieves a rank order of 1, the next highest receives a rank order of 2 and so on. MDS then attempts to plot and arrange these variables so that the more closely related items are plotted closer together than the less closely related items.

Cluster analysis involves assembling the variables into unique groups or "clusters" of similar items. A matrix is created, in a similar fashion to factor analysis (although this may be a distance matrix showing the degree of difference rather than similarity between the pairs of variables in the cross-tabulation). The matrix is then used to group the variables contained within it.

Read more about cluster analysis in Corpus Linguistics, Chapter 3, pages 76, 78 and 79.




Log-linear Models

Here we will consider a different technique which deals with the interrelationships of several variables. As linguists, we often want to go beyond the simple description of a phenomenon, and explain what it is that causes the data to behave in a particular way. A loglinear analysis allows us to take a standard frequency cross-tabulation and find out which variables seem statistically most likely to be responsible for a particular effect.

For example, let us imagine that we are interested in the factors which influence whether the word for is present or omitted from phrases of duration such as She studied [for] three years in Munich. We may hypothesise several factors which could have an effect on this, e.g. the text genre, the semantic category of the main verb and whether or not the verb is separated by an adverb from the phrase of duration. Any one of these factors might be solely responsible for the omission of for, or it might be the case that a combination of factors are culpable. Finally, all the factors working together could be responsible for the presence/omission of for. A loglinear analysis provides us with a number of models which take these points into account.

The way that we test the models in loglinear analysis is first to test the significance of associations in the most complex model - that is the model which assumes that all of the variables are working together. Then we take away each variable at a time from the model and see whether significance is maintained in each case, until we reach the model with the lowest possible dimensions. So in the above example, we would start with a model that posited three variables (e.g. genre, verb class and adverb separation) and test the significance of a three variable model. Then we would test each of the two variable models (taking away one variable in each case) and finally each of the three one-variable models. The best model would be taken to be the one with the fewest number of variables which still retained statistical significance.

Read about variable rule analysis and probabilistic language modelling in Corpus Linguistics, Chapter 3, pages 83-84.


Conclusion

In this section we have -

  • Discussed the roles of qualitative and quantitative analysis

  • Examined the notion of a representative corpus

  • Looked at frequency counts and the importance of proportionally representative data

  • Considered statistical significace testing and looked at a number of statistical tests that can be carried out on corpora; namely: collocation, factor analysis and loglinear models.



posted under | 1 Comments

Part Four. The Use of Corpora in Language Studies.

Introduction

In this session we will examine the roles which corpora may play in the study of language. The importance of copora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language. Empirical data also allows us to study language varieties such as dialects or earlier periods in a language for which it is not possible to carry out a rationalist approach.

It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety. Corpus linguistics, proper, should be seen as a subset of the activity within an empirical approach to linguistics. Although corpus linguistics entails an empirical approach, empirical linguistics does not always entail the use of a corpus.

In the following pages we'll consider the roles which corpora use may play in a number of different fields of study related to language. We will focus on the conceptual issues of why corpus data are important to these areas, and how they can contribute to the advancement of knowledge in each, providing real examples of corpus use. In view of the huge amount of corpus-based linguistic research, the examples are necessarily selective - you can consult further reading for additional examples.

Corpora in Speech Research

A spoken corpus is important because of the following useful features:
  • It provides a broad sample of speech, extending over a wide selection of variables such as:
    • speaker gender
    • speaker age
    • speaker class
    • genre (e.g. newsreading, poetry, legal proceedings etc)
    This allows generalisations to be made about spoken language as the corpus is as wide and as representative as possible. It also allows for variations within a given spoken language to be studied.

  • It provides a sample of naturalistic speech rather than speech elicited under aritificial conditions. The findings from the corpus are therefore more likely to reflect language as it is spoken in "real life" since the data is less likely to be subject to production monitoring by the speaker (such as trying to suppress a regional accent).

  • Because the (transcribed) corpus has usually been enhanced with prosodic and other annotations it is easier to carry out large scale quantitative analyses than with fresh raw data. Where more than one type of annotation has been used it is possible to study the interrelationships between say, phonetic annotations and syntactic structure.

Prosodic annotation of spoken corpora

Because much phonetic corpus annotation has been at the level of prosody, this has been the focus of most of the phonetic and phonological research in spoken corpora. This work can be divided roughly into three types:
  1. How do prosodic elements of speech relate to other linguistic levels?
  2. How does what is actually perceived and transcribed relate to the actual acoustic reality of speech?
  3. How does the typology of the text relate to the prosodic patterns in the corpus?

Read more about prosodic annotation in spoken corpora in detail in Corpus Linguistics, Chapter 4, pages 89-90.


Corpora in Lexical Studies

Empirical data has been used in lexicography long before the discipline of corpus linguistics was invented. Samuel Johnson, for example, illustrated his dictionary with examples from literature, and in the 19th Century the Oxford Dictionary used citation slips to study and illustrate word usage. Corpora, however, have changed the way in which linguists can look at language.

A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined.

Follow this link for an example of the benefits of corpus linguistics in lexicography

Examples extracted from corpora can be easily organised into more meaningful groups for analysis. For example, by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together. Furthermore, because corpus data contains a rich amount of textual information - regional variety, author, date, genre, part-of-speech tags etc it is easier to tie down usages of particular words or phrases as being typical of particular regional varieties, genres and so on.

The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc. However, finite corpora also have an important role in lexical studies - in the area of quantification. It is possible to rapidly produce reliable frequency counts and to subdivide these areas across various dimensions according to the varieties of language in which a word is used.

Finally, the ability to call up word combinations rather than individual words, and the existence of mutual information tools which establish relationships between co-occuring words (see Session 3) mean that we can treat phrases and collocations more systematically than was previously possible. A phraseological unit may consitute a piece of technical terminology or an idiom, and collocations are important clues to specific word senses.

Read about coprus-based work on morphlogy in Corpus Lingustics, Chapter 4, page 92.


Corpora and Grammar

Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types of research which have used corpora. Copora makes a useful tool for syntactical research because of :
  • The potential for the representative quantification of a whole language variety.
  • Their role as empirical data for the testing of hypotheses derived from grammatical theory.

Many smaller-scale studies of grammar using corpora have included quantitative data analysis (for example, Schmied's 1993 study of relative clauses). There is now a greater interest in the more systematic study of grammatical frequency - for example, Oostdijk and de Haan (1994a) are aiming to analyse the frequency of the various English clause types.

Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see Session One) has often meant that these two approaches have been viewed as separate and in competition with each other. However, there is a group of researchers who have used corpora in order to test essentially rationalist grammatical theory, rather than use it for pure description or the inductive generation of theory.

At Nijmegen University, for instance, primarily rationalist formal grammars are tested on real-life language found in computer corpora (Aarts 1991). The formal grammar is first devised by reference to introspective techniques and to existing accounts of the grammar of the language. The grammar is then loaded into a computer parser and is run over a corpus to test how far it accounts for the data in the corpus. The grammar is then modified to take account of those analyses which it missed or got wrong.


Corpora and Semantics

The main contribution that corpus linguistics has made to semantics is by helping to establish an approach to semantics which is objective, and takes account of indeterminacy and gradience. Mindt (1991) demonstrates how a corpus can be used in order to provide objective criteria for assigning meanings to linguistic terms. Mindt points out that frequently in semantics, meanings of terms are described by reference to the linguist's own intuitions - the rationalist approach that we mentioned in the section on Corpora and Grammar. Mindt argues that semantic distinctions are associated in texts with characteristic observable contexts - syntactic, morphological and prosodic - and by considering the environments of the linguistic entities an empirical objective indicator for a particular semantic distinction can be arrived at.

Another role of corpora in semantics has been in establishing more firmly the notions of fuzzy categories and gradience. In theoretical linguistics, categories are usually seen as being hard and fast - either an item belongs to a category or it does not. However, psychological work on categorisation suggests that cognitive categories are not usually "hard and fast" but instead have fuzzy boundaries, so it is not so much a question of whether an item belongs to one category or the other, but how often it falls into one category as opposed to the other one. In looking empirically at natural language in corpora it is clear that this "fuzzy" model accounts better for the data: clear-cut boundaries do not exist; instead there are gradients of membership which are connected with frequency of inclusion.

For examples of the above read Corpus Linguistics, Chapter 4, pages 96-97.


Corpora in Pragmatics and Discourse Analysis

The amount of corpus-based reseach in pragmatics and discourse analysis has been relatively small up to now. This is partly because these fields rely on context (Myers 1991) and the small samples of texts used in corpora tend to mean that they are somewhat removed from their social and textual contexts. Sometimes relevant social information (gender, class, region) is encoded within the corpus but it is still not always possible to infer context from corpus texts.

Much of the work that has been carried out in this area has used the London-Lund corpus which was until recently the only truly conversational corpus. The main contribution of such research has been to the understanding of how conversation works, with respect to lexical items and phrases which have conversational functions. Stenstöm (1984) correlated discourse items such as well, sort of and you know with pauses in speech and showed that such correlations related to whether or not the speaker expects a response from the addressee. Another study by Stenstöm (1987) examined "carry-on signals" such as right, right-o and all right. These signals were classified according to the typology of their various functions e.g.:

  • right was used in all functions, but especially as a response, to evaluate a previous response or terminate an exchange.
  • All right was used to mark a boundary between two stages in discourse.
  • that's right was used as an emphasiser.
  • it's alright and that's alright were responses to apologies.

The availability of new conversational corpora, such as the spoken part of the BNC (British National Corpus) should provide a greater incentive both to extend and to replicate such studies, since the amount of conversational data available, and the social/geographical range of people recorded both will have increased. At present, quantitative analyses of corpus-based approaches to issues in pragmatics have been poorly served. Hopefully this is one area which will be exploited by linguists in the near future.


Corpora and Stylistics

Stylistics researchers are usually interested in individual texts or authors rather than the more general varieties of a language and tend not to be large-scale users of corpora. Nevertheless, some stylisticians are interested in investigating broader issues such as genre, and others have found corpora to be important sources of data in their research.

In order to define an author's particular style, we must, in part examine the degree by which the author leans towards different ways of putting things (technical vs non-technical vocabulary, long sentences vs short sentences and so on). This task requires comparisons to be made not only internally within the author's own work, but also with other authors or the norms of the language or variety as a whole. As Leech and Short (1981) point out, stylistics often demands the use of quantification to back up judgements which may appear subjective rather than objective. This is where corpora can play a useful role.

Another type of stylistic variation is the more general variation between genres and channels - for example, one of the most common uses of corpora has been in looking at the differences between spoken and written language. Altenberg (1984) examined the differences in the ordering of cause-result constructions while Tottie (1991) looked at the differences in negation strategies. Other work has looked at variations between genres, using subsamples of corpora as a database. For example, Wilson (1992) used sections from the LOB and Kolhpur corpora, the Augustan Prose Sample and a sample of modern English conversation to examine the usage of since and found that causal since had evolved from being the main causal connective in late seventeenth century writing to being characteristic of formal learned writing in the twentieth century.

Read about stylistic work carried out by Biber and Wikberg in Corpus Linguistics, Chapter 4, pages 102-103.


Corpora in the Teaching of Languages and Linguistics

Resources and practices in the teaching of languages and linguistics tend to reflect the division between the empirical and rationalist approaches. Many textbooks contain only invented examples and their descriptions are based upon intutition or second-hand accounts. Other books, however, are explicitly empirical and use examples and descriptions from corpora or other sources of real life language data.

Corpus examples are important in language learning as they expose students to the kinds of sentences that they will encounter when using the language in real life situations. Students who are taught with traditional syntax textbooks which contain sentences such as Steve puts his money in the bank are often unable to analyse more complex sentences such as The government has welcomed a report by an Australian royal commission on the effects of Britain's atomic bomb testing programme in the Australian desert in the fifties and early sixties (from the Spoken English Corpus).

Apart from being a source of empirical teaching data, corpora can be used to look critically at existing language teaching materials. Kennedy (1987a, 1987b) has looked at ways of expressing quantification and frequency in ESL (English as a second language) textbooks. Holmes (1988) has examined ways of expressing doubt and certainty in ESL textbooks, while Mindt (1992) has looked at future time expressions in German textbooks of English. These studies have similar methodologies - they analyse the relevant constructions or vocabularies, both in the sample text books and in standard English corpora and then they compare their findings between the two sets. Most studies found that there were considerable differences between what textbooks are teaching and how native speakers actually use language as evidenced in the corpora. Some textbook gloss over important aspects of usage, or foreground less frequent stylistic choices at the expense of more common ones. The general conclusion from these studies is that non-empirically based teaching materials can be misleading and that corpus studies should be used to inform the production of material so that the more common choices of usage are given more attention than those which are less common.

Read about language teaching for "special purposes" in Corpus Linguistics, Chapter 4, pages 104-105.

Corpora have also been used in the teaching of linguistics. Kirk (1994) requires his students to base their projects on corpus data which they must analyse in the light of a model such as Brown and Levinson's politeness theory or Grice's co-operative principle. In taking this approach, Kirk is using corpora not only as a way of teaching students about variation in English but also to introduce them to the main features of a corpus-based approach to linguistic analysis.

A further application of corpora in this field is their role in computer-assisted language learning. Recent work at Lancaster University has looked at the role of corpus-based computer software for teaching undergraduates the rudiments of grammatical analysis (McEnery and Wilson 1993). This software - Cytor - reads in an annotated corpus (either part-of-speech tagged or parsed) one sentence at a time, hides the annotation and asks the student to annotate the sentence him- or herself. Students can call up help in the form of the list of tag mnemomics, a frequency lexicon or concordances of examples. McEnery, Baker and Wilson (1995) carried out an experiment over the course of a term to determine how effective Cytor was at teaching part-of-speech learning by comparing two groups of students - one who were taught with Cytor, and another who were taught via traditional lecturer-based methods. In general the computer-taught students performed better than the human-taught students throughout the term.



Corpora and Historical Linguistics

Historical linguistics can be seen as a species of corpus linguistics, since the texts of a historical period or a "dead" language form a closed corpus of data which can only be extended by the (re-)discovery of previously unknown manuscripts or books. In some cases it is possible to use (almost) all of the closed corpus of a language for research - something which can be done for ancient Greek for example, using the Theasurus Linguae Graecae corpus which contains most of extant ancient Greek literature. However, in practice historical linguistics has not tended to follow a strict corpus linguistic paradigm, instead taking a selective approach to empirical data, to look for evidence of a particular phemonema and making rough estimates at frequency. No real attempts were made to produce samples that were representative.

In recent years, however, some historical linguistics have changed their approach, resulting in an upsurge in strictly corpus-based historical linguistics and the building of corpora for this purpose. The most widely known English historical corpus is the Helsinki corpus.

The Helsinki corpus contains approximately 1.6 million words of English dating from the earliest Old English Period (before AD 850) to the end of the Early Modern English period (1710). It is divided into three main periods - Old English, Middle English and Early Modern English - and each period is subdivided into a number of 100-year subperiods (or 70-year subperiods in some cases). The Helsinki corpus is representative in that it covers a range of genres, regional varieties and sociolinguistics variables such as gender, age, education and social class. The Helsinki team have also produced "satellite" corpora of early Scots and early American English.

Other examples of English historical corpora in development are the Zürich Corpus of English Newspapers (ZEN), the Lampeter Corpus of Early Modern English Tracts (a sample of English pamphlets from between 1640 and 1740) and the ARCHER corpus (a corpus of British and American English from 1650-1990).

The work which is carried out on historical corpora is qualitatively similar to that which is carried out on modern language corpora, although it is also possible to carry out work on the evolution of language through time. For example, Peitsara (1993) used four subperiods from the Helsinki corpus and calculated the frequencies of different prepositions introducing agent phrases. Throughout the period she found that the most common prepositions of this type were of and by, which were of almost equal frequency at the beginning of the period, but by the fifteenth century by was three times more common than of, and by 1640 by was eight times as common.

Studies like this have particular importance in the context of Halliday's (1991) conception of language evolution as a motivated change tin the probabilities of the grammar. However, it is important to be aware of the limitations of corpus linguistics, as Rissanen (1989) pointed out. Rissanen identifies three main problems associated with using historical corpora

  1. The "philologist's dilemma" - the danger that the use of a corpus and a computer may supplant the in-depth knowledge of language history which is to be gained from the study of original texts in their context.
  2. The "God's truth fallacy" - the danger that a corpus may be used to provide representative conclusions about the entire language period, without understanding its limitations in the terms of which genres it does and does not cover.
  3. The "mystery of vanishing reliability" - the more variables which are used in sampling and coding the corpus (periods, genres, age, gender etc) the harder it is to represent each one fully and achieve statistical reliability. The most effective way of solving this problem is to build larger corpora of course.
Rissanen's reservations are vaild and important, but should not diminish the value of corpus-based linguistics, rather they should serve as warnings of possible pitfalls which need to be taken on board by scholars, since with appropriate care they are surmountable.


Corpora in Dialectology and Variation Studies

In this section we are concerned with geographical variation - corpora have long been recognised as a valuable source of comparison between language varieties as well as for the description of those varieties themselves. Certain corpora have tried to follow as far as possible the same sampling procedures as other corpora in order to maximise the degree of comparability. For examples, the LOB corpus contains roughly the same genres and sample sizes as the Brown corpus and is sampled from the same year ( i.e. 1961). The Kolhapur Indian corpus is also broadly parallel to Brown and LOB, although the sampling year is 1978.

One of the earliest pieces of work using the LOB and Brown corpora in tandem was the production of a word frequency comparison of American and British written English. These corpora have also been used as the basis of more complex aspects of language such as the use of the subjunctive (Johansson and Norheim 1988).

One role for corpora in national variation studies has been as a testbed for two theories of language variation. Quirk et al's (1985) "common core" hypothesis, and Braj Kachru's conception of national varieties as forming many unique "Englishes" which differ in important ways from one another. Most work on lexis and grammar comparing the Kolhapur Indian corpus with Brown and LOB has supported the common core hypothosis (Leitner 1991). However, there is still scope for the extension of such work.

Few examples of dialect corpora exist at present - two of which are the Helsinki corpus of English dialects and Kirk's Northern Ireland Transcribed Corpus of Speech (NITCS). Both corpora consist of conversations with a fieldworker - in Kirk's corpus from Northern Ireland, and in the Helsinki corpus from several English regions. Dialectology is an empirical field of linguistics although it has tended to concentrate on experiments and less controlled sampling, rather than use corpora. Such elicitation experiments tend to focus on vocabulary and pronunciation, neglecting other aspects of linguistics such as syntax. Dialect corpora allow these other aspects to be studied, and because the corpora are sampled so as to be representative, quantitative as well as qualitative conclusions can be drawn about the target population as a whole.

Read about comparisons using dialect data in Corpus Linguistics, Chapter 4, page 110.


Corpora and Psycholinguistics

Although psycholinguistics is inherently a laboratory subject, measuring mental processes such as the length of time it takes to position a syntactic boundary in reading or how eye movements change, corpora can still have a part to play in this field. One important use is as a source of data from which materials for laboratory experiments can be developed. Schreuder and Kerkman (1987) point out that frequency is an important consideration in a number of cognitive processes, including word recognition. The psycholinguist should not go blindly into experiments in areas such as this with only a vague notion of frequency to guide the selection and analysis of materials. Sampled corpora can provide psycholinguists with more concrete and reliable information about frequency, including the frequencies of different senses and parts of speech of ambiguous words (if the corpora are annotated).

A more direct example of the role of corpora in psycholinguistics can be seen from Garnham et al's (1981) study which used the London-Lund corpus to examine the occurence of speech errors in natural conversational English. Before the study was carried out nobody knew how frequent speech errors were in everyday language, because such an analysis required adequate amounts of natural conversation, while previous work on speech errors had been based on the gradual ad hoc accumulation of data from many different sources. However, the spoken corpus was able to provide exactly the kind of data that was required. Garnham's study was able to classify and count the frequencies of different error types and hence provide some estimate of the general frequency of these in relation to speakers' overall output.

A third role for corpora lies in the the analysis of language pathologies, where an accurate picture of abnormal data must be constructed before it is possible to hypothesise and test what may be wrong with the human language processing system. Although little work has been done with sampled corpora to date, it is important to stress their potential for these analyses. Studies of the language of linguistically impaired people, and of the language of children who are developing their (normal) linguistic skills, lack the quantified representative descriptions which are available. In the last decade, however, there has been a move towards the empirical analysis of machine-readable data in these areas. For example, the Polytechnic of Wales (POW) corpus is a corpus of children's language; a corpus of impaired and normal language development was been collected at Reading University, while the CHILDES database contains a large amount of impaired and normal child language in several languages.




Corpora and Cultural Studies

It is only recently that the role of a corpus in telling us about culture has really begun to be explored. After the completion of the LOB corpus of British English, one of the earliest pieces of work to be carried out was a comparison of its vocabulary with the vocabulary of the American Brown corpus (Hofland and Johansson 1982). This revealed interesting differences which went beyond the purely linguistic ones such as spelling (colour/color) or morphology (got/gotten).

Leech and Fallon (1992) used the results of these earlier studies, along with KWIC concordances of the two corpora to check up on the senses in which words were being used. They then grouped the differences which were statistically significant into fifteen broad categories. The frequencies of concepts in these categories revealed differences between the two countries which were primarily of cultural, not linguistic difference. For example - travel words were more frequent in American English than British English, perhaps suggestive of the larger size of the United States. Words in the domains of crime and the military were also more common in the American data, as was "violent crime" in the crime category, perhaps suggestive of the American "gun culture". In general, the findings seemed to suggest a picture of American culture at the time of the two corpora (1961) that was more macho and dynamic than British culture. Although this work is in its infancy and requires methodological refinement, it seems to be an interesting and promising area of study, which could also integrate more closely work in language learning with that in national cultural studies.



Corpora and Social Psychology

Although linguists are the main users of corpora, they are not the sole users. Researchers in other fields which make use of language data have also recently taken an interest in the exploitation of corpus data - perhaps the most important of these have been social psychologists.

Social psychologists require access to naturalistic data which cannot be reproduced in laboratory conditions (unlike many other psychology-related fields), while at the same time they are under pressure to quantify and test their theories rather than rely on qualitative data. This places them in a curious position.

One area of research in social psychology is that of how and why people attempt to explain things. Explanations (or attributions) are important to the psychologist because they reveal the ways in which people regard their environment. To obtain data for studying explanations researchers have relied on naturally occuring texts such as newspapers, diaries, company reports etc. However, these are written texts, and most everyday human interaction takes place through the medium of speech. To solve this problem Antaki and Naji (1987) used the London-Lund corpus (of spoken language) as a source of data for explanations in everyday conversation. They took 200,000 words of conversation and retrieved all instances of the commonest causal conjunction because (and its variant cos). An analysis of a pilot sample derived a classification scheme for the data, which was then used to classify all the explanations according to what was being explained. For example "actions of speaker or speaker's group", "general states of affairs" and so on. A frequency analysis of the explanation types in the corpus showed that explanations of general states of affairs were the most common type of explanation (33.8%) followed by actions of speaker and speaker's group (28.8%) and actions of others (17.7%). This refuted previous theories that the prototypical type of explanation is the explanation of a person's single action. Work such as Antaki and Naji shows clearly the potential of corpora to test and modify theory in subjects which require naturalistic quantifiable language data, and one may expect other social psychologists to make use of corpora in the future.



Conclusion

In this session we have seen how a number of areas of language study have benefited from exploiting corpus data. To summarise, the main important advantages of corpora are:

  • Sampling and quantification. Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.

  • Ease of access. As all of the data collection has been dealt with by someone else, the researcher does not have to go through the issues of sampling, collection and encoding. The majority of corpora are readily available, either free or at low-cost price. Once the corpora have been obtained, it is usually easy to access the data within it, e.g. by using a concordance program.

  • Enriched data. Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data.

  • Naturalistic data. Corpus data is not always completely unmonitored in the sense that the people producing the spoken or written texts are unaware until after the fact that they are being asked to participate in the building of a corpus. But for the most part, the data are largely naturalistic, unmonitored and the product of real social contexts. Thus the corpus provides one of the most reliable sources of naturally occurring data that can be examined.

References

Aarts, J. (1991) "Intuition-based and observation-based grammars" in Aijmer and Altenburg 1991, pp 44-62.

Aarts, J. and Meijs, W. (eds) (1986) Corpus Linguistics II, Amsterdam: Rodopi.

Aijmer, K. and Altenberg, B. (eds) (1991) English Corpus Linguistics: Studies in Honour of Jan Svartvik, London: Longman.

Altenberg, B. (1984) "Causal linking in spoken and written English", Studia Linguisitica 38: 20-69.

Antaki, C. and Naji, S. (1987) "Events explained in conversational "because" statements", British Journal of Social Psychology 26: 119-126.

Atkins, B. T. S. and Levin, B. (1995). "Building on a corpus: a linguistic and lexicographical look at some near-synonyms", International Journal of Lexicography 8:2, 85-114.

Garnham, A., Shillock, R., Brown, G., Mill, A. and Cutler, A. (1981) "Slips of the tongue in the London-Lund corpus of spontaneous conversation", Linguistics 19: 805-17.

Halliday, M. (1991) "Corpus studies and probabilistic grammar", in Aijmer and Altenberg 1991, pp 30-43.

Hofland, K. and Johansson, S. (1982) Word Frequencies in British and American English, Bergen: Norwegian Computing Centre for the Humanities.

Holmes, J. (1988) "Doubt and certainty in ESL textbooks", Applied Linguistics 9: 21-44.

Holmes, J. (1994) "Inferring language change from computer corpora: some methodological problems", ICAME Journal 18: 27-40.

Johansson, S. and Norheim, E. (1988) "The subjunctive in British and American English", ICAME Journal 12: 27-36.

Johansson , S. and Stenstr&oumlm, A-B. (eds) (1991) English Computer Corpora: Selected Papers and Research Guide, Berlin: Mouton de Gruyter.

Kennedy, G. (1987) "Expressing temporal frequency in academic English", TESOL Quarterly 21: 69-86.

Kennedy, G. (1987) "Quantification and the use of English: a case study of one aspect of the learner's task", Applied Linguistics 8: 264-86.

Kirk, J. (1994) "Teaching and language corpora: the Queen's approach", in Wilson and McEnery 1994, pp 29-51.

Kjellmer, G. (1986) ""The lesser man": observations on the role of women in modern English writings", in Arts and Meijs 1986, pp 163-76.

Kytö, M., Rissanen, M. and Wright, S. (eds) (1994) Corpora across the Centuries, Amsterdam, Rodopi.

Leech, G. and Fallon, R. (1992) "Computer corpora - what do they tell us about culture?", ICAME Journal 16: 29-50.

Leech, G. and Short, M. (1981) Style in Fiction, London: Longman.

Leitner, G. (1991) "The Kolhapur corpus of Indian English: intravarietal description and/or intervarietal comparison", in Johansson and Stenstr&oumlm 1991, pp 215-32.

McEnery, A. and Wilson, A. (1993) "The role of corpora in computer-assisted language learning", Computer Assisted Language Learning 6(3): 233-48.

McEnery, A., Baker, P. and Wilson, A. (1995) "A statistical analysis of corpus based computer vs traditional human teaching methods of part of speech analysis.", Computer Assisted Language Learning 8(2/3): 259-74.

Meijs, W. (ed) (1987) Corpus Linguistics and Beyond, Amsterdam: Rodopi.

Mindt, D. (1991) "Syntactic evidence for semantic distinctions in English", in Aijmer and Altenburg 1991, pp 182-96.

Mindt, D. (1992) Zeitbezug im Englischen: eine didaktische Grammatik des englischen Futurs, T&uumlbingen: Gunter Narr.

Myers, G. (1991) "Pragmatics and corpora", talk given at Corpus Linguistics Research Group, Lancaster Univeristy.

O'Connor, J. and Arnold, G. (1961) Intonation of Colloquial English, London: Longman.

Oostdijk, N. and de Haan, P. (1994a) "Clause patterns in modern British English: a corpus-based (quantitative) study", ICAME Journal 18: 41-79.

Oostdijk, N. and de Haan, P. (eds) (1994b) Corpus Based Research into Language, Amsterdam: Rodopi.

Peitsara, K. (1993) "On the development of the by-agent in English", in Rissanen, Kytö and Palander-Collin 1993 pp 217-33.

Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. (1985) A Comprehensive Grammar of the English Language, London, Longman.

Rissanen, M. (1989) "Three problems connected with the ue of diachronic corpora", ICAME Journal 13: 16-19.

Rissanen, M., Kytö, M. and Palander-Collin, M. (eds) (1993) Early English in the Computer Age, Berlin, Mouton de Gruyter.

Schreuder, R. and Kerkman, H. (1987) "On the use of a lexical database in psycholinguistic research", in Meijs 1987, pp 295-302.

Stenst&oumlm, A-B. (1984) "Discourse items and pauses", Paper presented at Fifth ICAME Conference, Windermere. Abstract in ICAME News 9 (1985): 11.

Stenst&oumlm, A-B. (1987) "Carry-on signals in English Conversatoin", in Meijs 1987, pp 87-119.

Tottie, G. (1991) Negation in English Speech and Writing: A study in Variation, San Diego: Academic Press.

Wilson, A. (1992) The Usage of Since: A Quantitative Comparison of Augustan, Modern British and Modern Indian English, Lancaster Papers in Linguistics 80.

Wilson, A and McEnery, A. (eds) (1994) Corpora in Language Education and Research: A Selection of Papers from Talc94, Unit for Computer Research on the English Language Technical Papers 4 (special issue), Lancaster University.



http://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus4/4FRA1.HTM

posted under | 0 Comments

Text corpus

om Wikipedia, the free encyclopedia

Jump to: navigation, search

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

Contents

[hide]

[edit] Archaeological corpora

Text corpora are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year Amarna letters texts-(1350 BC). The corpus of an ancient city, (for example the "Kültepe Texts" of Turkey), may go through a series of corpora, determined by their find site dates.

[edit] Some notable text corpora

English language:

Other languages:

http://en.wikipedia.org/wiki/Text_corpus

posted under | 0 Comments
Postingan Lama

Followers


Recent Comments