The games corpus
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

The Games Corpus PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on
  • Presentation posted in: General

The Games Corpus. Design, implementation and annotation. Agust ín Gravano [email protected] Spoken Language Processing Group Columbia University. The Games Corpus. Design and Implementation Annotation. The Games Corpus. Design and Implementation Annotation. Experiment Design.

Download Presentation

The Games Corpus

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The games corpus

The Games Corpus

Design, implementation and annotation

Agustín Gravano

[email protected]

Spoken Language Processing Group

Columbia University


The games corpus1

The Games Corpus

  • Design and Implementation

  • Annotation

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus2

The Games Corpus

  • Design and Implementation

  • Annotation

"The Games Corpus" - Agustín Gravano - Columbia University


Experiment design

Experiment Design

  • Goal: Study the relation between the down-stepped contour and

    • Information status

    • Syntactic position

    • Discourse position

  • Spontaneous speech

  • Both monologue and dialogue

"The Games Corpus" - Agustín Gravano - Columbia University


Experiment design1

Experiment Design

  • Three computer games.

  • Two players, each on a different computer.

  • They collaborate to perform a common task.

  • Totally unrestricted speech.

"The Games Corpus" - Agustín Gravano - Columbia University


Cards game 1

Cards Game #1

Player 1 (Describer)

Player 2 (Searcher)

  • Short monologues

  • Vary frequency and order of occurrence of objects on the cards.

"The Games Corpus" - Agustín Gravano - Columbia University


Cards game 2

Cards Game #2

Player 1 (Describer)

Player 2 (Searcher)

  • Dialogue

  • Vary frequency and order of occurrence of objects on the cards.

"The Games Corpus" - Agustín Gravano - Columbia University


Objects game

Objects Game

Player 1 (Describer)

Player 2 (Searcher)

  • Dialogue

  • Vary target and surrounding objects (subject and object position).

"The Games Corpus" - Agustín Gravano - Columbia University


Games session

Games Session

  • Repeat 3 times:

    • Cards Game #1

    • Cards Game #2

  • Short break (optional)

  • Repeat 3 times:

    • Objects Game

  • Each subject participated in 2 sessions.

  • 12 sessions

"The Games Corpus" - Agustín Gravano - Columbia University


Subjects

Subjects

  • Postings:

    • Columbia’s webpage for temporary job adds.

    • Craig’s list

      • http://www.craigslist.org

      • Category: Gigs  Event gigs

  • Problem:

    • People are unreliable

    • ~50% did not show up, or cancelled with short notice.

"The Games Corpus" - Agustín Gravano - Columbia University


Subjects1

Subjects

  • Possible solutions:

    • Give precise instructions to e-mail ALL required info:

      • Name, native speaker?, hearing impairments?, etc.

    • Ask for a phone number.

    • Call them and explain why it is so important for us that they show up (or cancel with adecuate notice).

    • Increase the pay after each session.

      • Example: $5, $10, $15 instead of $10, $10, $10.

"The Games Corpus" - Agustín Gravano - Columbia University


Recording

Recording

  • Sound-proof booth

    • 2 subjects + 1 or 2 confederates.

    • Head-mounted mics.

    • Digital Audio Tape (DAT): one channel per speaker.

  • Wav files

    • One mono file per speaker.

    • Sample rate: 48000

    • Downsampled to 16000 (but kept original files!)

    • ~20 hours of speech  2.8 GB (16k)

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus

Logs

  • Log everything the subjects do to a text file.

  • Example:

    17:03:55:234BEGIN_EXECUTION

    17:04:04:868NEXT_TURN

    17:04:31:837RESULTS97 points awarded.

    17:04:38:426NEXT_TURN

    17:05:03:873RESULTS92 points awarded.

    ...

  • Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus3

The Games Corpus

  • Design and Implementation

  • Annotation

"The Games Corpus" - Agustín Gravano - Columbia University


Speech processing tools

Speech Processing Tools

  • Praat

    • http://www.praat.org

  • WaveSurfer

    • http://www.speech.kth.se/wavesurfer

  • Transcriber

    • http://trans.sourceforge.net

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 1

Orthographic Tier - Method 1

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 11

Orthographic Tier - Method 1

  • Problems

    • Very stressing

    • Time consuming

  • Separate transcription from alignment.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 2

Orthographic Tier - Method 2

  • Transcribe chunks using a web interface.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 21

Orthographic Tier - Method 2

  • Transcribe chunks using a web interface.

  • Align each chunk automatically.

  • Concatenate all chunks.

  • Correct the alignment by hand using Praat, Wavesurfer or similar.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 22

Orthographic Tier - Method 2

  • Advantages

    • Transcription task is very comfortable.

    • Most of the alignment task is done automatically. Only fine-grain hand corrections are needed.

  • Problems

    • Overhead: chunking, automatic alignment, concat.

    • Error prone! Easy for humans to overlook errors in the automatic alignment.

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier method 3

Orthographic Tier - Method 3

  • Transcribe the whole file, using:

    • a regular audio player (e.g., Windows Media Player), and

    • a regular plain-text editor (e.g., Notepad).

  • Use Wavesurfer to align the words.

    • “Load text labels” function

    • Check out:

      • Spectrogram settings

      • Customizable shortcuts

"The Games Corpus" - Agustín Gravano - Columbia University


Orthographic tier

Orthographic Tier

  • Transcription guidelines

    • capital letters

    • abbreviations

    • disfluencies

    • mmhm, uhhuh, gotcha, etc.

  • Alignment guidelines

    • boundaries

  • http://www.cs.columbia.edu/~agus/games

    • username/password = speech/lions

"The Games Corpus" - Agustín Gravano - Columbia University


Too many cooks

Too many cooks…

  • Concurrency problem

  • File locking webpage

    • Annotators lock a file before working on it, and release it when done.

"The Games Corpus" - Agustín Gravano - Columbia University


Annotation cue words

Annotation: Cue Words

  • okay, mmhm, uhhuh, right, etc.

  • Acknowledgment, Backchannel, Segment Beginning, Segment End, etc.

  • Developed an ad-hoc application in Java.

    • Bad idea!!! Too long development time.

  • Instead, use Praat (or other general-purpose tool).

    • For simple, specific tasks, Praat is not difficult to learn.

    • Create a file with empty points at the middle point of the words that need to be labeled.

    • Annotators only label those words, safely ignoring the rest.

"The Games Corpus" - Agustín Gravano - Columbia University


Other annotations

Other Annotations

  • Turn switches

    • Smooth switches, interruptions, backchannels, etc.

    • The labeler received a Praat file with empty turns.

  • Prosody

    • ToBI Labeling Conventions: Tones and Break Indices.

  • Questions

    • Identification, form and function.

"The Games Corpus" - Agustín Gravano - Columbia University


Guidelines for guidelines

Guidelines for Guidelines

  • Web based (password protected)

  • Highlight recent changes

  • Avoid long lists: categorize, trees.

"The Games Corpus" - Agustín Gravano - Columbia University


Files

Files

  • games/data/session_NN/sNN.GAME.P.Y.ext

    • NN= 01..12

    • GAME = {cards, objects}

    • P = 0..3 if GAME=cards, 0..1 if GAME=objects

    • Y = {A, B}

    • ext = {wav, words, tones, breaks, misc, turns, …}

"The Games Corpus" - Agustín Gravano - Columbia University


Files1

Files

  • Examples:

    games/data/session_08/s08.cards.3.B.wav

    s08.cards.3.B.words

    s08.cards.3.B.misc

    s08.objects.1.A.wav

    s08.objects.1.A.words

    s08.objects.1.A.misc

    games/data/session_11/…

"The Games Corpus" - Agustín Gravano - Columbia University


Files format

Files Format

  • All files (except *.wav) are saved as plain text, with the WaveSurfer format:

    • Start End Value (for interval tiers)

    • Time Value (for point tiers)

  • Advantages

    • Human-readable.

    • Very easy to process.

  • Problems

    • Consistency

    • Rounding

"The Games Corpus" - Agustín Gravano - Columbia University


Files format1

Files Format

  • Other formats:

    • XML

      • General-purpose mark-up language.

      • <TAG attribute=“value”> … </TAG>

      • Solves problems like consistency and rounding.

      • Not human-readable, harder to process.

    • Praat

      • Not human-readable, hard to process.

      • Also has the consistency problem.

"The Games Corpus" - Agustín Gravano - Columbia University


Scripts

Scripts

  • So far, we have needed dozens of Perl scripts.

  • Examples:

    • Convert between Praat and WaveSurfer formats.

    • Create a Praat file with empty CW labels, turns, etc.

    • Find typos, missing labels, and other errors.

    • Unify notation (e.g., “mm-hmm”  “mmhm”).

    • Check consistency of files.

"The Games Corpus" - Agustín Gravano - Columbia University


Back up

Back-up!

  • Back-up wav files only once (too heavy) in different places (DVD, 3+ computers).

  • Back-up everything else (plain text: light) periodically, and automatically.

    • Configure “cron” to make a backup copy every 8 hours.

"The Games Corpus" - Agustín Gravano - Columbia University


Timeline

Timeline

  • Orthographic tier first!

time

design+implem.

orthographic tier

prosody (ToBI)

cue words

turn switches

"The Games Corpus" - Agustín Gravano - Columbia University


The games corpus4

The Games Corpus

Design, implementation and annotation

Agustín Gravano

[email protected]

Spoken Language Processing Group

Columbia University


  • Login