Corpora in language variation studies
Download
1 / 46

Corpora in language variation studies - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Corpora in language variation studies. Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com. Aims of this session. Lecture Biber’s (1988) MF/MD approach Xiao’s (2009) enhanced MDA model Case study of world Englishes Lab session

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Corpora in language variation studies' - rhonda-gates


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Corpora in language variation studies

Corpora in language variation studies

Corpus Linguistics

Richard Xiao

lancsxiaoz@googlemail.com


Aims of this session
Aims of this session

  • Lecture

    • Biber’s (1988) MF/MD approach

    • Xiao’s (2009) enhanced MDA model

    • Case study of world Englishes

  • Lab session

    • Using Xaira to explore distribution of passives across genres in FLOB


Corpora vs register and genre analysis
Corpora vs. register and genre analysis

  • “Register” and “genre” are two terms that are often used interchangeably

  • The corpus-based approach is well suited for the study of register variation and genre analysis

    • A corpus is created using external criteria, which define different registers and genres

    • Corpora, especially balanced sample corpora, typically cover a wide range of registers or genres

  • Biber’s (1988) MF/MF analytical framework is the most powerful tool for approaching register variation and genre analysis


Biber s mf md approach
Biber’s MF/MD approach

Established in Biber (1988): Variation across Speech and Writing (CUP)

Factor analysis of 67 functionally related linguistic features

481 text samples, amounting to 960,000 running words

LOB

London-Lund corpus

Brown corpus

A collection of professional and personal letters


Factor analysis
Factor analysis

The key to the multidimensional analysis approach

A common data reduction method available in many standard statistics packages

e.g. SPSS: “Analyze – Data reduction – Factor analysis”

Reducing a large number of variables to a manageable set of underlying “factors” (“dimensions”)

e.g. questions + 1st/2nd person pronouns vs. passives + nominalization

Extensively used in social sciences to identify clusters of inter-related variables


Methodological overview
Methodological overview

  • Collect texts with register information

  • Collect a set of potential (functionally related) linguistic features to analyze (usually based on literature review)

  • Automatically tag texts with linguistic features, post-editing where necessary

  • Compute frequency of co-occurrence patterns of linguistic features using factor analysis

    • Functional interpretation of co-occurrence patterns (i.e. dimensions of variation) through analysis of co-occurring features

  • Sum the factor scores of features on each dimension

    • Mean dimension scores for each register are used to analyze similarities and differences

  • Two ways of doing MDA in genre analysis

    • Following Biber’s model and factor scores

    • Establishing your own MDA model


  • How does factor analysis work
    How does factor analysis work?

    • Build a correlation matrix of all variables (i.e. linguistic features)

    • From this, determine the loading (or weight) of each linguistic feature

      • Loading tells us to what degree we can generalize from this factor to the linguistic feature

      • Positive loading = positive correlation (likewise for negative)

      • A higher absolute value of a feature = the feature is more representative of a factor/dimension or register/genre

    • Biber discarded features with absolute value under the cut-off point 0.35

      • Features are only kept on the factor they had the highest loading for (even if they occur on 2+ with scores above 0.35): one feature, one factor/dimension


    Biber s mf md approach1
    Biber’s MF/MD approach

    Biber’s seven factors / dimensions

    1) Informational vs. involved production

    2) Narrative vs. non-narrative concerns

    3) Explicit vs. situation-dependent reference

    4) Overt expression of persuasion

    5) Abstract vs. non-abstract information

    6) Online informational elaboration

    7) Academic hedging


    Biber s mf md approach2
    Biber’s MF/MD approach

    • Factors 1, 3 and 5 are associated with “oral” and “literate” differences in English

    • The spoken vs. written distinction is too broad

      • Spoken and written registers can be similar in some dimensions but differ in others

    • “Each dimension is associated with a different set of underlying communicative functions, and each defines a different set of similarities and differences among genres. Consideration of all dimensions is required for an adequate description of the relations among spoken and written texts.” (Biber 1988: 169)


    Biber s mf md approach3
    Biber’s MF/MD approach

    • The primary motivations for the MDA approach are the two assumptions (Biber 1995)

      • Generalizations about register variation in a language must be based on analysis of the full range of spoken and written registers

      • Nosingle linguistic parameter is adequate in itself to capture the range of similarities and differences among spoken and written registers


    Biber s mf md approach4
    Biber’s MF/MD approach

    Biber’s MF/MD approach has been well received as it establishes a link between form and function

    Influential and widely used

    Synchronic analysis of specific registers / genres and author styles

    Diachronic studies describing the evolution of registers

    Register studies of non-Western languages and contrastive analyses

    Research of University English and materials development

    Move analysis and study of discourse structure

    Bier’s initial MDA model is largely confined to lexical and grammatical categories


    The enhanced mda model
    The enhanced MDA model

    Xiao (2009) seeks to enhance Biber’s MDA by incorporating semantic components with grammatical categories

    Wmatrix = CLAWS + USAS

    A total of 141 linguistic features investigated

    109 features retained in the final model

    Five million words in 2,500 text samples, with one million words in 500 samples for each of the 5 varieties of English

    ICE – GB, HK, India, Singapore, the Philippines

    300 spoken + 200 written samples

    12 registers ranging from private conversation to academic writing

    [Xiao, R. (2009) Multidimensional analysis and the study of world Englishes. World English 28(4): 421-450.]



    141 linguistic features covered
    141 linguistic features covered

    A) Nouns: 21 categories, e.g.

    nominalisation, other nouns; 19 semantic classes of nouns (e.g. evaluations, speech acts)

    B) Verbs: 28 categories, e.g.

    do as pro-verb, be as main verb, tense and aspect markers, modals, passives, 16 semantic categories of verbs

    C) Pronouns: 10 categories, e.g.

    person, case, demonstrative

    D) Adjectives: 11 categories, e.g.

    attributive vs. predicative use, 9 semantic categories


    141 linguistic features covered1
    141 linguistic features covered

    E) Adverbs: 7 categories

    F) Prepositions (2 categories)

    G) Subordination (3 categories)

    H) Coordination (2 categories)

    I) WH-questions / clauses (2 categories)

    J) Nominal post-modifying clauses (5 categories)

    K) THAT-complement clauses (3 categories)

    L) Infinitive clauses (3 categories)

    M) Participle clauses (2 categories)

    N) Reduced forms and dispreferred structures (4 categories)

    O) Lexical and structural complexity (3 categories)


    141 linguistic features covered2
    141 Linguistic features covered

    P) Quantifiers (4 categories)

    Q) Time expressions (11 categories)

    R) Degree expressions (8 categories)

    S) Negation (2 categories)

    T) Power relationship (4 categories)

    U) Definiteness (2 categories)

    V) Helping/hindrance (2 categories)

    X) Linear order (1 category)

    Y) Seem / Appear (1 category)

    Z) Discourse bin (1 category)


    Procedure of data analysis
    Procedure of data analysis

    1) Data clean-up

    2) Grammatical and semantic tagging with Wmatrix

    3) Extracting the frequencies of 141 linguistic features from 2,500 corpus files

    4) Building a profile of normalised frequencies (per 1,000 words) for each linguistic feature

    5) Factor analysis

    Factor extraction (Principal Factor Analysis)

    Factor rotation (Pramax)

    Optimum structure: 9 factors

    6) Interpreting extracted factors in functional terms

    7) Computing factor scores of various dimensions/factors

    8) Using the enhanced MDA model in exploration of variation across registers and language varieties


    The enhanced mda model1
    The enhanced MDA model

    Nine factors established in the new model

    1) Interactive casual discourse vs. informative elaborate discourse

    2) Elaborative online evaluation

    3) Narrative concern

    4) Human vs. object description

    5) Future projection

    6) Subjective impression and judgement

    7) Lack of temporal / locative focus

    8) Concern with degree and quantity

    9) Concern with reported speech

    Robustness of the model in register analysis


    1 interactive casual discourse vs informative elaborate discourse
    1) Interactive casual discourse vs. informative elaborate discourse

    Private conversation is most interactive and casual

    Academic writing is most informative and elaborate

    Spoken registers are generally more interactive and less elaborate than written registers

    ANOVA :

    F=775.86

    p<0.0001

    R2=77.4%


    2 elaborative online evaluation
    2) discourseElaborative online evaluation

    Public dialogue (e.g. broadcast discussion / interview, parliamentary debate) has the most prominent focus on elaborative online evaluation

    Unscripted monologue also involves a high level of elaborative online evaluation

    Persuasive writing (e.g. press editorials) may relate to elaborative evaluation but is not restricted by real-time production

    Private conversation is least elaborative even if the evaluation is made online

    Evaluation is not a concern in creative writing

    F=102.20

    p<0.0001

    R2=31.1%


    3 narrative concern
    3) Narrative concern discourse

    Unscripted monologue (e.g. demonstrations, presentations, sports commentaries) has a narrative concern

    Unsurprisingly, creative writing is also narrative

    Narrative is not a concern in academic writing, non-professional writing (student essays and exam scripts), and instructional writing (argumentation, instruction)

    F=134.50

    p<0.0001

    R2=37.3%


    4 human vs object description
    4) Human vs. object description discourse

    Private conversation is most likely to have a focus on people

    Correspondence (social letters and business letters) also involves human description

    Instructional writing tends to give concrete descriptions of objects

    Academic and non-academic writings can also be concrete when an object or substance is described

    F=44.03

    p<0.0001

    R2=16.3%


    5 future projection
    5) discourseFuture projection

    Persuasive writing (e.g. press editorials, trying to influence people’s future attitudes and actions) has the most prominent focus on future projection

    Correspondence and public dialogue also involve future projection to varying extents

    Academic writing is least concerned with future projection (timeless truth?)

    F=28.10

    p<0.0001

    R2=11.1%


    6 subjective impression judgement
    6) discourseSubjective impression / judgement

    Factor score of creative writing is by far greater than any other register

    Frequent use of possessive and reflective pronouns, as well as adjectives of judgement / appearance

    Scripted and unscripted monologue, public dialogue and news reportage also tend to avoid expressions of subjective impression and judgement (trying to appear/sound objective and impartial as far as possible)

    Instructional writing, private conversation, and student essays display low scores in this dimension

    They do not have a focus on personal impression and judgement

    F=126.22

    p<0.0001

    R2=35.8%


    7 lack of temporal locative focus
    7) discourseLack of temporal / locative focus

    Student essays and persuasive writing (argumentation and persuasion) do not have a temporal / locative focus (not concerned with concepts such as when, how long, and where)

    Such specific information is of vital importance in correspondence (social and business letters)

    F=89.55

    p<0.0001

    R2=28.4%)


    8 concern with degree quantity
    8) discourseConcern with degree / quantity

    Non-academic popular writing (e.g. popular science writing) has the greatest concern of degree and quantity

    Persuasive writing also displays a high propensity for expressions of degree and quantity

    In contrast, such expressions tend to be avoided in instructional writing (e.g. administrative documents) and correspondence

    F=19.33

    p<0.0001

    R2=7.9%


    9 concern with reported speech
    9) discourseConcern with reported speech

    News reportage has the greatest concern with reported speech (both direct and indirect speech)

    Reported speech is also very common in creative writing (fictional dialogue)

    Instructional writing and academic prose do not appear to have a concern with reported speech

    F=80.02

    p<0.0001

    R2=26.1%


    12 registers along 9 factors
    12 registers along 9 factors discourse

    Factor 1 is the dimension along which the 12 registers demonstrate the sharpest contrasts

    Interactive casual discourse vs. informative elaborate discourse: a fundamental aspect of variation across registers

    Robustness of the model


    Case study summary
    Case study summary discourse

    Summary

    Seeking to enhance Biber’s MDA model with semantic components

    Introducing the new model in research of World Englishes

    Cao, Y. & Xiao, R. (2013) “A multidimensional contrastive study of Englishabstracts by native and nonnative writers”. Corpora, 8 (1-2)

    Lab session: Exploring distribution of passives in the FLOB corpus

    Andrew H. and Xiao R. (2005) Introduction to Xaira. UCREL Corpus Research Group, Lancaster, November 2005.

    Part 1. All about Xaira: www.lancs.ac.uk/staff/xiaoz/papers/crg_xaira_part1.ppt

    Part 2. Using Xaira to explore corpora: www.lancs.ac.uk/staff/xiaoz/papers/crg_xaira_part2.ppt








    Open subcorpora
    Open subcorpora discourse


    Open subcorpora1
    Open subcorpora discourse


    Query builder
    Query builder discourse



    Define 1 st search node
    Define 1 discoursest search node

    Select all tags starting with VB


    Define 2 nd search node
    Define 2 discoursend search node

    Select all tags starting with VVN


    Define link type
    Define link type discourse

    [For demonstration purpose, only passives with the verb BE followed immediately by a past participle will be included]


    Random sampling
    Random sampling discourse



    Sorted by
    Sorted by % discourse


    ad