scientext a corpus of french english academic scientific texts l.
Skip this Video
Loading SlideShow in 5 Seconds..
SCIENTEXT: A Corpus of French & English Academic & Scientific Texts PowerPoint Presentation
Download Presentation
SCIENTEXT: A Corpus of French & English Academic & Scientific Texts

Loading in 2 Seconds...

play fullscreen
1 / 32

SCIENTEXT: A Corpus of French & English Academic & Scientific Texts - PowerPoint PPT Presentation

  • Uploaded on

SCIENTEXT: A Corpus of French & English Academic & Scientific Texts. Alice Henderson (for the Scientext team) LLS research group– Université de Savoie Chambéry, France. BAAL, University of Newcastlem September 3-5, 2009. English academic. French Ecrits universitaires ss writing

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

SCIENTEXT: A Corpus of French & English Academic & Scientific Texts

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scientext a corpus of french english academic scientific texts

A Corpus of French & English

Academic & Scientific Texts

Alice Henderson (for the Scientext team)

LLSresearch group– Université de Savoie

Chambéry, France

BAAL, University of Newcastlem September 3-5, 2009




Ecrits universitaires

ss writing


Ecrits de recherche



= scientifique

  • General overview of the Scientext project
  • End product & applications
  • Goals of the linguistic study
  • Details of the corpus & tagging
  • Presentation of the beta version
general overview
General Overview
  • Goals:
    • Create a freely-available corpus of scientific & academic writing in French & English
    • Devise tools for studying linguistic markers of stance/positioning AND reasoning
  • Intended Users: Linguists, epistemologists, information retrieval specialists, scientists, language teachers.
  • Long-Term Applications:
    • L1 & FL/L2 teaching
    • Lexicography & writing aids
    • Information retrieval in scientific & technical fields
general overview5
General Overview
  • Draws on several branches of linguistics:
    • Corpus linguistics: creation & study of a large corpus of scientific & academic texts
    • Natural Language Processing: processing & study of a corpus using a syntactic dependency parser (Bourigault’s Syntex).
    • Traditional branches of linguistics: discourse analysis, lexicology,enunciation, syntax and semantics
  • Projet coordinatedby LIDILEM research group (F. Grossmann, A. Tutin), 3 teams = multidisciplinary
    • LIDILEM (Grenoble) : F. Grossmann, A. Tutin, F. Boch, C. Cavalla, O. Kraif, M. Florez, I. Novakova, M.L. Nguyen, F. Rinck.
    • LLS (Chambéry) : J. Osborne, A. Henderson, R. Barr.
    • LiCorN (Lorient) : G. Williams, H. Maury, C. Ropers.
end product applications
End Product & Applications
  • Web site with several ways of selectingsub-parts of texts.
    • Query search (complex & simple) and text view
    • Search for traces of stance/positioning and reasoning using local pre-established grammars
    • Downloading of XML corpus (for authors who gave permission, Creative Commons)
    • Downloading of search results (zip format, CSV format for statistics)
end product applications8
End Product & Applications
  • Websiteallowingselection ofsub-parts of texts
  • Teaching applications for both L1 and L2 learners: research into university writing, second language production, etc.
  • Lexicographical applications including assistance with encoding strategies using reference corpora.
  • Targeted information retrieval in scientific and technical fields.
the linguistic study
The Linguistic Study
  • Focus on 2 essential features of the texts:
    • Authors use stance to situate themselves in relation to previous and contemporary research whilst demonstrating what is specific to their work and the choices made.
    • The intellectual process upon which findings and deductions are based can be revealed via the analysis of authorial reasoning.
  • Test two hypotheses:
    • Stance is expressed by a phraseology that is shared (partly? largely?) across fields
    • This phraseology is more characteristic of genres than of fields
the linguistic study10
The Linguistic Study
  • Distinguish between 3 main parameters:

Field, Text genre (and sub-genres), Text section

Scientific sub-genres

          • Scientific articles
          • Conference proceedings
          • PhD theses, HDR

Academic sub-genres (Learner corpus)

          • 2nd year English majors, Long Essays
          • 3rd year English majors, Language Policy analyses
details french scientific corpus
Details: French scientific corpus

234 texts (1997-2008), 5 million words

details english corpora
Details: English corpora
  • Academic (learner) corpus(Chambery,1997-2007)
    • 1.1 million words, 300 texts, 4000-5000 words long
  • Scientific corpus(Lorient,
    • 33 million words “hoovered” from BMC Corpus of Biology and Medical Texts
    • POS & lemmatised
    • Theoretical analysis of meaning transfers for the analysis of diachronic & synchronic meaning changes in context through collocational resonance
    • Creation of a bottom-up dictionary of verb patterns with corpus-driven thematic and conceptual groupings for NNS scientists
corpus tagging french sci eng academic learner
Corpus Tagging (French sci. + Eng academic/learner)
  • XML format (Text Encoding Initiative)
  • Tagged elements
    • Header:
      • Type of tagging, information about the text, availability of the text
    • Text Structure (semi-automatic tagging):
      • Identification of text sections: abstract, introduction, body of the text, conclusion, notes, references.
      • Lay-out (when available): bold, italics, structure of lists
    • Linguistic Tagging (automatic):
      • Morpho-syntactic tagging & identification of syntactic dependencies(Bourigault’s Syntex – 2007 version)
  • General overview of the Scientext project
  • End product & applications
  • Goals of the linguistic study
  • Details of the corpus & tagging
  • Presentation of the beta version
presentation of the beta version
Presentation of the beta version
  • Web site available on-line:
  • Interface created by Achille Falaise, using the query language Concquest developed by Olivier Kraif (Université Grenoble 3)
step 2 searching in the texts
Step 2 : Searching in the texts
  • 3 search modes
    • Simple interface, with scroll-menus and predefined values
    • Complex query language, so grammars can be created/written
    • Local grammars, involving stance/positioning or reasoning
      • Example: grammar of scientific affiliation
example of a simple query
Example of a simple query
  • Selection of predicate adjectives used with the noun policy.
examples of predefined searches
Examples of predefinedsearches
  • Verbs of feeling: hate, love, feel, like, …
  • Verbs of opinion: consider, think find, …
  • Evaluative adjectives: true, great, important, best, new, right, …
example of a complex query advanced search
Example of a complexquery(advancedsearch)
  • Search for syntacticdependency + co-occurrence<hypothèse,#1><>*<cat=V,#2> :: (SUJ,#2,#1);

Verbswhich come after the lemmahypothèse, wherehypothèseis the subject of the verb.

  • Search for a disjunction of lemmas + syntacticdependency

<lemma=/(hypothèse|notion|concept)/,#1> && <cat=V,#2> && <cat=A,#3> :: (SUJ,#2,#1) AND (ADJ,#1,#3) ;

The lemmashypothèse, notion or conceptfunctioning as subjects & accompanied by an adjective

example of a local grammar to write an advanced search
Example of a local grammar(to write an advancedsearch)
  • Using variables
  • Re-defining a relation
    • Ex : (ATTSUJ,#2,#1) = (ATTS,#3,#1) AND (SUJ,#3,#2)
step 3 display
Step3 : Display
  • KWIC display, can be customised
displaying a wider context
Displaying a widercontext
  • Display of a wider context
  • Project still running (through early 2010)
    • Constitution of corpus & tagging : LONG … & fastidious
    • Interface still being developed
    • Linguistic model still needs finalising
    • More grammars need to be developed
    • Teaching materials need developing & piloting
  • Issues: interface between lexis & rhetorical functions
  • Future Research
    • Linguistic study of markers :
      • “positioned” citations
      • markers of scientific affiliation
    • Teaching materials need piloting & evaluating
thank you

Thank you!!

(and please try it out)

publications resources linked to scientext project
Publications & resourceslinked to Scientextproject
  • Boch F., Grossmann F. (2002). “Se référer au discoursd’autrui : quelqueséléments de comparaison entre experts et néophytes”. L’écritdansl’enseignementsupérieur. Enjeux,:Brussels, pp41-51.
  • Boch F., Grossmann F. , Rinck (2007). “Conformément à nosattentes ...” oul’étude des marqueurs de convergence/divergence dansl’articlescientifique”. Revue Française de LinguistiqueAppliquée. Voll. XII-2, pp109-122.
  • Bourigault D. (2007). SYNTEX, analyseursyntaxiqueopérationnel. Mémoired’habilitation à diriger des recherches, Université Toulouse Le Mirail.
  • Chavez I. (2008). La démarcation dans les écrits scientifiques - Les collocations transdisciplinaires comme aide à l’écrit universitaire auprès des étudiants étrangers, Mémoire de Master 2 Français Langue Etrangère Recherche, C. Cavalla (supervisor), Université Stendhal-Grenoble3: Grenoble.
  • Garcia P.P. (2008). Etude des marques de la filiation dans les écrits scientifiques. Master 1 thesis, Université Stendhal-Grenoble3: , F. Grossmann and A. Tutin (supervisors).
  • Grossmann F., Tutin A. (2008). “Evidential Markers in French Scientific Writing: the Case ofthe French Verb voir. Evidentiality Workshop, Bamberg, 27-29 February 2008.
  • Henderson, A . & R. Barr (2009), “Corpus-based L2 Writing Instruction : Raising Awareness of Authorial Stance”, Journal of Writing Research, (forthcoming).
  • Rinck, F. (2006). L’article de recherche en Sciences du Langage et en Lettres, Figure de l’auteur et approche disciplinaire du genre. Doctoral thesis, Sciences du Langage, F. Boch and F. Grossmann (supervisors), Université Stendhal-Grenoble3: Grenoble.

Rinck, F., Boch, F., Grossmann, F. (2007). “Quelques lieux de variation du positionnement énonciatif dans l’article de recherche”, in Lambert P., Millet A., Rispail, M.Trimaille C. (eds). Variations au coeur et aux marges de la sociolinguistique. L’Harmattan, Espaces Discursifs, Paris.

Tutin A. (2008). “Evaluative adjectives in academic writings”. Interpersonality in written academic discourse: perspectives across languages and cultures, 11-13 December, Zaragoza, Spain.

Tutin A. (2007a) (ed) “Lexique et écrits scientifiques”. Revue Française de Linguistique Appliquée, volume XII-2, December 2007.

Tutin, A. (2007b).“Modélisation linguistique et annotation des collocations : application au lexique transdisciplinaire des écrits scientifiques”,  in S. Koeva, D. Maurel, M. Silberztein (eds). Formaliser les langues avec l’ordinateur. Presses universitaires de Franche-Comté: Besançon.

Williams G & Millon C. (2009). “The General and the Specific: Collocational resonance of scientific language”. Proceedings Corpus Linguistics 2009, University of Liverpool. (forthcoming)

Williams G & Millon C. (2009.) “Les verbes et la science: la construction d’un dictionnaire organique”. Actes des Journées de la Linguistique de Corpus 2009. Texte et Corpus. (forthcoming)

Williams G. (2008). “Le Corpus et le dictionnaire dans les langues de spécialité”, in Maniez et al (eds). Corpus et dictionnaires de langues de spécialité. Presses Universitaires de Grenoble, pp 135-151.

Wiliams G. (2008) “Verbs of Science and the Learner’s Dictionary”. Proceedings Thirteenth EURALEX International CongressBarcelona, Spain. 15-19 July 2008, pp797-806.