Speech corpus

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine).^[1] In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.^[2]^[3]

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of speech corpora:

Read Speech – which includes:
- Book excerpts
- Broadcast news
- Lists of words
- Sequences of numbers
Spontaneous Speech – which includes:
- Dialogs – between two or more people (includes meetings; one such corpus is the KEC);
- Narratives – a person telling a story (one such corpus is the Buckeye Corpus);
- Map-tasks – one person explains a route on a map to another;
- Appointment-tasks – two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with a foreign accent.

References

^ Sarangi, Susanta; Sahidullah, Md; Saha, Goutam (September 2020). "Optimization of data-driven filterbank for automatic speaker verification". Digital Signal Processing. 104: 102795. arXiv:2007.10729. Bibcode:2020DSP...10402795S. doi:10.1016/j.dsp.2020.102795. S2CID 220665533.
^ Reece, Andrew; Cooney, Gus; Bull, Peter; Chung, Christine; Dawson, Bryn; Fitzpatrick, Casey; Glazer, Tamara; Knox, Dean; Liebscher, Alex; Marin, Sebastian (2022-03-01). "Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech". arXiv:2203.00674 [cs.CL].
^ "Santa Barbara Corpus of Spoken American English | Department of Linguistics - UC Santa Barbara". www.linguistics.ucsb.edu. Retrieved 2023-04-26.

Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

External links

Santa Barbara Corpus of Spoken American English
Buckeye Corpus The Buckeye Corpus of Conversational Speech
The KEC -- The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings
Spoken Language Corpora at the Research Center on Multilingualism
The Spoken Turkish Corpus at METU Ankara
Spoken Corpus Klient with the Corp-Oral Corpus at ILTEC Lisbon
VoxForge – open source speech corpora
OLAC: Open Language Archives Community
BAS Bavarian Archive for Speech Signals
Simmortel Speech Recognition Corpus for Indian English and Hindi
ELRA: the European Language Resources Association
The PELCRA Conversational Corpus of Polish
The Arabic Speech Corpus
Corpus of Political Speeches : Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library
Large Multimodal Corpus of Human Speech

Natural language processing

General terms

Text analysis

Argument mining
Collocation extraction
Concept mining
Coreference resolution
Deep linguistic processing
Distant reading
Information extraction
Named-entity recognition
Ontology learning
Parsing
- Semantic parsing
- Syntactic parsing
Part-of-speech tagging
Semantic analysis
Semantic role labeling
Semantic decomposition
Semantic similarity
Sentiment analysis

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Machine translation

Distributional semantics models

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet