Skip to content

Automatic Mapping of Disfluency Annotations for corrected version of Switchboard

Notifications You must be signed in to change notification settings

vickyzayats/switchboard_corrected_reannotated

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Switchboard reannotated dataset

We provide a new version of Switchboard corpus with disfluency annotations for careful speech transcripts.

The columns in the data correspond to:
sentence - list of words for each sentence in Penn Treebank
ms_sentence - list of words for each sentence in Ms-State transcript
comb_sentence - combination of the two versions of the sentence
names - word ids for sentence
ms_names - word ids for ms_sentence
comb_ann - tags for comb_sentence that indicate which words have to be inserted/deleted/substituted in order to get from MsState to Treebank
tags - BIO tags for sentence (Penn Treebank annotation)
ms_disfl - BIO tags for MsState sentence (silver annotation)

BIO tags are the following:
BE - beginning of the reparandum
IE - inside the reparandum
IP - the last word before the interruption point
BE_IP - single token reparandum
C - repair (correction)
O - non-disfluency
C_IE - the word is both in the reparandum and repair but not before interruption point (in nested disfluencies)
C_IP - the word is both in the reparandum and repair and the last before the interruption point (in nested disfluencies)

Paper

You can find more details in our paper: https://arxiv.org/pdf/1904.04398.pdf.

@article{zayats2019disfluencies,
  title={Disfluencies and Human Speech Transcription Errors},
  author={Zayats, Vicky and Tran, Trang and Wright, Richard and Mansfield, Courtney and Ostendorf, Mari},
  journal={Interspeech},
  year={2019}
}

License

This dataset is an extension of the Switchboard and distributed under LDC License.

About

Automatic Mapping of Disfluency Annotations for corrected version of Switchboard

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published