1 / 14

Compiling a corpus of transcribed speech

Guy Aston guy@sslmit.unibo.it. Compiling a corpus of transcribed speech. Anyqs. A corpus for classroom use in training interpreters Transcribed spontaneous speech (hard to come by) Understandable without detailed contextual information (standard format) Contemporary

aelwen
Download Presentation

Compiling a corpus of transcribed speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Guy Aston guy@sslmit.unibo.it Compiling a corpus of transcribed speech

  2. Anyqs • A corpus for classroom use in training interpreters • Transcribed spontaneous speech (hard to come by) • Understandable without detailed contextual information (standard format) • Contemporary • Quite a lot (currently 1.2M words) • Easy to encode in TEI and to index with XAIRA

  3. No way is this publicly available • The BBC site contains transcripts of all Any Questions programmes in the last 3 years, which you can download freely for personal non-commercial use. • But/and you cannot adapt, alter or create a derivative work except for your own personal, non-commercial use.

  4. What the BBC’s original looks like … • PRESENTER: Jonathan DimblebyPANELLISTS: Lord FalconerMalcolm RifkindAnne McElvoyChris HuhneFROM: Medical Women's Federation, Central LondonDIMBLEBYWelcome to London where we are on the edge of Regent's Park at the Royal College of Obstetricians and Gynaecologists. Our host here is the Medical Women's Federation, which is holding its 90th anniversary conference here. With its origins in the late 19th Century the federation was in 1917 formed with an initial membership of 190 women doctors. Subjects at the top of their agenda then: Medical women engaged in war and the contemporary challenges of venereal diseases, prostitution, maternity and infant welfare. Plus ca change. Except that today more than half the present crop of medical students are women and the federation's main aim is to keep women doctors active in the medical workforce with all that that implies for part-time training and child welfare.On our panel: the former Lord Chancellor Charlie Falconer. Lord Falconer there have been scurrilous reports in some of the newspapers to the effect that you're not happy with your pension and that you want it to be doubled, it's £52,000 a year, we can presume that you are quite happy yes?FALCONERI think I'd rather not talk about that, if you don't mind Jonathan.DIMBLEBYYou're entirely free not to talk about that which suggests that it's unresolved.The former Foreign Secretary, Sir Malcolm Rifkind; Chris Huhne who wants to be the next leader of the Liberal Democrats - do you like being the underdog?HUHNEI'm not sure, I think - I'm working on it, I'm ambitious not to be the underdog Jonathan.DIMBLEBYAnd Anne McElvoy, executive editor and columnist at the Evening Standard. [CLAPPING]Our first question please.HICKSTom Hicks. Should Ian Blair resign?

  5. Marking it up in XML… • In the Header • Programme details • Participants and roles • Setting • In the Text • Topic boundaries (new question) • Utterance boundaries and their speakers • Sentence boundaries (based on punctuation in transcript) • Non-verbal events (clapping, laughter, coughs) • Tokenisation - ’s • Pos tagging – maybe some day … • Alignment with audio – maybe some day ???

  6. Overall document structure <TEI> <teiHeader> <fileDesc> <titleStmt> <title> Any questions <date> [Date] </date> </title> </titleStmt> </fileDesc> <profileDesc> [Profile] </profileDesc> </teiHeader> <text> [Text] </text> </TEI>

  7. Profile <profileDesc> <particDesc><listPerson> <person who=“name” sex =“f | m” role = “presenter | questioner | party | profession”> <para> fullname</para> </person> </listPerson></particDesc> <settingDesc> <setting> wherefrom</setting> </settingDesc> </profileDesc>

  8. Text <text> <div type=“intro”> <u who=“DIMBLEBY”> <s n=“1”>Welcome to London …</s> … <s n=“13”>And Anne McElvoy, executive editor and columnist at the Evening Standard. </s> <event desc=“clapping”/> <s n=“14”>Our first question please.</s> </u> </div> <div type=“question> <u who=“HICKS”> <s n=“15”>Tom Hicks. </s> <s n=“16”>Should Ian Blair resign? </s> </u> … </div> … </text>

  9. Things to do with it (1): emphasis Agreement (most frequent adverb collocates 1L) • Agree (773) • Entirely / actually / rather / completely / absolutely / broadly / strongly / totally / certainly / quite • Disagree (110) • Fundamentally / profoundly / strongly / completely

  10. Things to do with it (2): subjunctives in speech • It were (189) • As it were (152) • If it were (30) • I wish it were (3)

  11. Things to do with it (3):As it were A particularly Any Questions feature? A particularly male one? • Any Questions • Male speakers 146 4.1 / 1000 sentences • Female speakers 6 0.7 / 1000 sentences • BNC • Male speakers 291 0.6 / 1000 sentences • Female speakers 68 0.2 / 1000 sentences

  12. occurrences/1000 <s> UK Lab 40 6 ‰ Con 12 2 ‰ Lib 22 5 ‰ (Ukip 1 7 ‰) United Kingdom Lab 17 2 ‰ Con 22 3 ‰ Lib 12 3 ‰ (Ukip 0 -) occurrences /1000 <s> Britain Lab 94 14 ‰ Con 129 20 ‰ Lib 50 12 ‰ (Ukip 8 58 ‰) Things to do with it (4): Preferred lexis of patriotism?

  13. Thank you! • for any answers on how to get permission …

  14. Role Lab 2141 / 6845 Con 1787 / 6309 Lib 1096 / 4098 Presenter 7936 / 13318 Questioner 1180 / 2241 Other 3144 / 11535 Sex Male 14670 / 35981 Female 2575 / 8295 Unknown 39 / 70 Utterances / Sentences Total 17284 / 44346

More Related