Word lists by frequency

From Wikipedia, the free encyclopedia - View original article

 
Jump to: navigation, search

Word lists by frequency are lists of a language's words grouped by frequency of occurrence within some given text corpus, either by levels or as a ranked list, serving the purpose of vocabulary acquisition. A word list by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort", (Nation 1997) but is mainly intended for course writers, not directly for learners. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.

In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank, less meaningful, can be derived

TypeOccurrencesRank
the37896541st
he20987622nd
[...]
king578971,356th
boy569751,357th
[...]
stringyfy534,589th
[...]
transducionalify1123,567th

Methodology[edit]

Factors[edit]

Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:

Corpuses[edit]

Traditional written corpus
Main article: Text corpus

Most of currently available studies are based on written texts.

SUBTLEX movement

However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of this traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. This has recently been followed by a handful of copy-cat studies, providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. Carreiras), Vietnamese (Pham, Bolger & Baayen 2011), and Polish[1]

Lexical unit[edit]

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character unique word.

Statistics[edit]

It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.

German linguists define the Häufigkeitsklasse (frequency class) N of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.

N=\left\lfloor0.5-\log_2\left(\frac{\text{Frequency of this item}}{\text{Frequency of most common item}}\right)\right\rfloor

where \lfloor\ldots\rfloor is the floor function.

Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.

Pedagogy[edit]

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and book makers (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006la).

Effects of words frequency[edit]

Word frequency is known to have various effects (Brysbaert et al. Bölte; Rudell 1993). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (Laufer 1997). Lexical access is positively influenced by high word frequency (Segui, Mehler & Frauenfelder Morton1982).

Languages[edit]

Below is a review of available resources.

English[edit]

Word counting dates back to Hellenistic time. Thorndike & Lorge, assisted by their colleagues, counted 18,000,000 running words to provide the first large scale frequency list in 1944, before modern computers made such projects far easier (Nation 1997).

Traditional lists[edit]

The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)

The TWB contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18,000,000 written words was hand analysed. The size of its inputted corpus increased its usefulness, but its age and language change reduced its applicability (Nation 1997).

The General Service List (West, 1953)

The GSL contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5,000,000 written words was analysed in the 1940s. Rate of occurrence (%) for different meanings and parts of speech of the headword are provided, while it was also a careful application of the various criteria other than frequency and range. Thus, despite its age, some errors, and its solely written base, it is still an excellent database (word frequency, frequency of meanings, reduction of noise) (Nation 1997).

The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)

A corpus of 5,000,000 running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words, namely the frequency of each word in each of the school grade levels and in each of the subject areas (Nation 1997).

The Brown (Francis and Kucera, 1982) LOB and related corpora

These now contain 1,000,000 words from a written corpora representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).

French[edit]

Traditional datasets

A review has been made by New & Pallier 3.01. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules.[2] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence,[3] while 3,680 words make about 95~98% of coverage.[4] A list of 3,000 frequent words is available.[5]

The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet.[6] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".[7]

More recently, the project Lexique 3 provided a list of 135,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number, frequency, associated lexemes, etc., available under an open-source license[8]

Subtlex

New 2007 made a completely new counting based on online film subtitles.

Spanish[edit]

There have been several studies of Spanish word frequency (Cuetos et al. 2011).[9]

Chinese[edit]

As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997) provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.

References[edit]

  1. ^ http://www.ncbi.nlm.nih.gov/pubmed/24942246
  2. ^ "Le français fondamental". [dead link]
  3. ^ Ouzoulias, André (2004), Comprendre et aider les enfants en difficulté scolaire: Le Vocabulaire fondamental, 70 mots essentiels, Retz  - Citing V.A.C Henmon
  4. ^ "Generalities". 
  5. ^ "PDF 3000 French words". 
  6. ^ "Maitrise de la langue à l'école: Vocabulaire". Ministère de l'éducation nationale. 
  7. ^ Baudot, J. (1992), Fréquences d'utilisation des mots en français écrit contemporain, Presses de L'Université, ISBN 2-7606-1563-4 
  8. ^ http://www.lexique.org/
  9. ^ "Spanish word frequency lists". Vocabularywiki.pbworks.com. 

See also[edit]

Sources[edit]

Theoretical concepts
Written texts-based databases
SUBTLEX movement