The games corpus
Download
1 / 34

The Games Corpus - PowerPoint PPT Presentation


  • 182 Views
  • Uploaded on

The Games Corpus. Design, implementation and annotation. Agust ín Gravano [email protected] Spoken Language Processing Group Columbia University. The Games Corpus. Design and Implementation Annotation. The Games Corpus. Design and Implementation Annotation. Experiment Design.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Games Corpus' - mele


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The games corpus

The Games Corpus

Design, implementation and annotation

Agustín Gravano

[email protected]

Spoken Language Processing Group

Columbia University


The games corpus1
The Games Corpus

  • Design and Implementation

  • Annotation

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus2
The Games Corpus

  • Design and Implementation

  • Annotation

"The Games Corpus" - Agustín Gravano - Columbia University


Experiment design
Experiment Design

  • Goal: Study the relation between the down-stepped contour and

    • Information status

    • Syntactic position

    • Discourse position

  • Spontaneous speech

  • Both monologue and dialogue

"The Games Corpus" - Agustín Gravano - Columbia University


Experiment design1
Experiment Design

  • Three computer games.

  • Two players, each on a different computer.

  • They collaborate to perform a common task.

  • Totally unrestricted speech.

"The Games Corpus" - Agustín Gravano - Columbia University


Cards game 1
Cards Game #1

Player 1 (Describer)

Player 2 (Searcher)

  • Short monologues

  • Vary frequency and order of occurrence of objects on the cards.

"The Games Corpus" - Agustín Gravano - Columbia University


Cards game 2
Cards Game #2

Player 1 (Describer)

Player 2 (Searcher)

  • Dialogue

  • Vary frequency and order of occurrence of objects on the cards.

"The Games Corpus" - Agustín Gravano - Columbia University


Objects game
Objects Game

Player 1 (Describer)

Player 2 (Searcher)

  • Dialogue

  • Vary target and surrounding objects (subject and object position).

"The Games Corpus" - Agustín Gravano - Columbia University


Games session
Games Session

  • Repeat 3 times:

    • Cards Game #1

    • Cards Game #2

  • Short break (optional)

  • Repeat 3 times:

    • Objects Game

  • Each subject participated in 2 sessions.

  • 12 sessions

"The Games Corpus" - Agustín Gravano - Columbia University


Subjects
Subjects

  • Postings:

    • Columbia’s webpage for temporary job adds.

    • Craig’s list

      • http://www.craigslist.org

      • Category: Gigs  Event gigs

  • Problem:

    • People are unreliable

    • ~50% did not show up, or cancelled with short notice.

"The Games Corpus" - Agustín Gravano - Columbia University


Subjects1
Subjects

  • Possible solutions:

    • Give precise instructions to e-mail ALL required info:

      • Name, native speaker?, hearing impairments?, etc.

    • Ask for a phone number.

    • Call them and explain why it is so important for us that they show up (or cancel with adecuate notice).

    • Increase the pay after each session.

      • Example: $5, $10, $15 instead of $10, $10, $10.

"The Games Corpus" - Agustín Gravano - Columbia University


Recording
Recording

  • Sound-proof booth

    • 2 subjects + 1 or 2 confederates.

    • Head-mounted mics.

    • Digital Audio Tape (DAT): one channel per speaker.

  • Wav files

    • One mono file per speaker.

    • Sample rate: 48000

    • Downsampled to 16000 (but kept original files!)

    • ~20 hours of speech  2.8 GB (16k)

"The Games Corpus" - Agustín Gravano - Columbia University


Logs

  • Log everything the subjects do to a text file.

  • Example:

    17:03:55:234 BEGIN_EXECUTION

    17:04:04:868 NEXT_TURN

    17:04:31:837 RESULTS 97 points awarded.

    17:04:38:426 NEXT_TURN

    17:05:03:873 RESULTS 92 points awarded.

    ...

  • Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus3
The Games Corpus

  • Design and Implementation

  • Annotation

"The Games Corpus" - Agustín Gravano - Columbia University


Speech processing tools
Speech Processing Tools

  • Praat

    • http://www.praat.org

  • WaveSurfer

    • http://www.speech.kth.se/wavesurfer

  • Transcriber

    • http://trans.sourceforge.net

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 1
Orthographic Tier - Method 1

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 11
Orthographic Tier - Method 1

  • Problems

    • Very stressing

    • Time consuming

  • Separate transcription from alignment.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 2
Orthographic Tier - Method 2

  • Transcribe chunks using a web interface.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 21
Orthographic Tier - Method 2

  • Transcribe chunks using a web interface.

  • Align each chunk automatically.

  • Concatenate all chunks.

  • Correct the alignment by hand using Praat, Wavesurfer or similar.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 22
Orthographic Tier - Method 2

  • Advantages

    • Transcription task is very comfortable.

    • Most of the alignment task is done automatically. Only fine-grain hand corrections are needed.

  • Problems

    • Overhead: chunking, automatic alignment, concat.

    • Error prone! Easy for humans to overlook errors in the automatic alignment.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 3
Orthographic Tier - Method 3

  • Transcribe the whole file, using:

    • a regular audio player (e.g., Windows Media Player), and

    • a regular plain-text editor (e.g., Notepad).

  • Use Wavesurfer to align the words.

    • “Load text labels” function

    • Check out:

      • Spectrogram settings

      • Customizable shortcuts

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier
Orthographic Tier

  • Transcription guidelines

    • capital letters

    • abbreviations

    • disfluencies

    • mmhm, uhhuh, gotcha, etc.

  • Alignment guidelines

    • boundaries

  • http://www.cs.columbia.edu/~agus/games

    • username/password = speech/lions

"The Games Corpus" - Agustín Gravano - Columbia University


Too many cooks
Too many cooks…

  • Concurrency problem

  • File locking webpage

    • Annotators lock a file before working on it, and release it when done.

"The Games Corpus" - Agustín Gravano - Columbia University


Annotation cue words
Annotation: Cue Words

  • okay, mmhm, uhhuh, right, etc.

  • Acknowledgment, Backchannel, Segment Beginning, Segment End, etc.

  • Developed an ad-hoc application in Java.

    • Bad idea!!! Too long development time.

  • Instead, use Praat (or other general-purpose tool).

    • For simple, specific tasks, Praat is not difficult to learn.

    • Create a file with empty points at the middle point of the words that need to be labeled.

    • Annotators only label those words, safely ignoring the rest.

"The Games Corpus" - Agustín Gravano - Columbia University


Other annotations
Other Annotations

  • Turn switches

    • Smooth switches, interruptions, backchannels, etc.

    • The labeler received a Praat file with empty turns.

  • Prosody

    • ToBI Labeling Conventions: Tones and Break Indices.

  • Questions

    • Identification, form and function.

"The Games Corpus" - Agustín Gravano - Columbia University


Guidelines for guidelines
Guidelines for Guidelines

  • Web based (password protected)

  • Highlight recent changes

  • Avoid long lists: categorize, trees.

"The Games Corpus" - Agustín Gravano - Columbia University


Files
Files

  • games/data/session_NN/sNN.GAME.P.Y.ext

    • NN= 01..12

    • GAME = {cards, objects}

    • P = 0..3 if GAME=cards, 0..1 if GAME=objects

    • Y = {A, B}

    • ext = {wav, words, tones, breaks, misc, turns, …}

"The Games Corpus" - Agustín Gravano - Columbia University


Files1
Files

  • Examples:

    games/data/session_08/s08.cards.3.B.wav

    s08.cards.3.B.words

    s08.cards.3.B.misc

    s08.objects.1.A.wav

    s08.objects.1.A.words

    s08.objects.1.A.misc

    games/data/session_11/…

"The Games Corpus" - Agustín Gravano - Columbia University


Files format
Files Format

  • All files (except *.wav) are saved as plain text, with the WaveSurfer format:

    • Start End Value (for interval tiers)

    • Time Value (for point tiers)

  • Advantages

    • Human-readable.

    • Very easy to process.

  • Problems

    • Consistency

    • Rounding

"The Games Corpus" - Agustín Gravano - Columbia University


Files format1
Files Format

  • Other formats:

    • XML

      • General-purpose mark-up language.

      • <TAG attribute=“value”> … </TAG>

      • Solves problems like consistency and rounding.

      • Not human-readable, harder to process.

    • Praat

      • Not human-readable, hard to process.

      • Also has the consistency problem.

"The Games Corpus" - Agustín Gravano - Columbia University


Scripts
Scripts

  • So far, we have needed dozens of Perl scripts.

  • Examples:

    • Convert between Praat and WaveSurfer formats.

    • Create a Praat file with empty CW labels, turns, etc.

    • Find typos, missing labels, and other errors.

    • Unify notation (e.g., “mm-hmm”  “mmhm”).

    • Check consistency of files.

"The Games Corpus" - Agustín Gravano - Columbia University


Back up
Back-up!

  • Back-up wav files only once (too heavy) in different places (DVD, 3+ computers).

  • Back-up everything else (plain text: light) periodically, and automatically.

    • Configure “cron” to make a backup copy every 8 hours.

"The Games Corpus" - Agustín Gravano - Columbia University


Timeline
Timeline

  • Orthographic tier first!

time

design+implem.

orthographic tier

prosody (ToBI)

cue words

turn switches

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus4

The Games Corpus

Design, implementation and annotation

Agustín Gravano

[email protected]

Spoken Language Processing Group

Columbia University


ad