Speech corpus
A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine).[1] In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.[2][3]
A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).
There are two types of speech corpora:
- Read Speech – which includes:
- Book excerpts
- Broadcast news
- Lists of words
- Sequences of numbers
- Spontaneous Speech – which includes:
- Dialogs – between two or more people (includes meetings; one such corpus is the KEC);
- Narratives – a person telling a story (one such corpus is the Buckeye Corpus);
- Map-tasks – one person explains a route on a map to another;
- Appointment-tasks – two people try to find a common meeting time based on individual schedules.
A special kind of speech corpora are non-native speech databases that contain speech with a foreign accent.
See also
- Arabic Speech Corpus
- Common Voice
- EXMARaLDA
- Lingua Libre, an online libre tool
- List of children's speech corpora
- Non-native speech database
- Praat
- Spoken English Corpus
- The BABEL Speech Corpus
- TIMIT
- Transcriber
- Transcription (linguistics)
References
- ^ Sarangi, Susanta; Sahidullah, Md; Saha, Goutam (September 2020). "Optimization of data-driven filterbank for automatic speaker verification". Digital Signal Processing. 104: 102795. arXiv:2007.10729. Bibcode:2020DSP...10402795S. doi:10.1016/j.dsp.2020.102795. S2CID 220665533.
- ^ Reece, Andrew; Cooney, Gus; Bull, Peter; Chung, Christine; Dawson, Bryn; Fitzpatrick, Casey; Glazer, Tamara; Knox, Dean; Liebscher, Alex; Marin, Sebastian (2022-03-01). "Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech". arXiv:2203.00674 [cs.CL].
- ^ "Santa Barbara Corpus of Spoken American English | Department of Linguistics - UC Santa Barbara". www.linguistics.ucsb.edu. Retrieved 2023-04-26.
- Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
- Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.
External links
- Santa Barbara Corpus of Spoken American English
- Buckeye Corpus The Buckeye Corpus of Conversational Speech
- The KEC -- The Karl Eberhards Corpus of spontaneously spoken southern German in dialogues - audio and articulatory recordings
- Spoken Language Corpora at the Research Center on Multilingualism
- The Spoken Turkish Corpus at METU Ankara
- Spoken Corpus Klient with the Corp-Oral Corpus at ILTEC Lisbon
- VoxForge – open source speech corpora
- OLAC: Open Language Archives Community
- BAS Bavarian Archive for Speech Signals
- Simmortel Speech Recognition Corpus for Indian English and Hindi
- ELRA: the European Language Resources Association
- The PELCRA Conversational Corpus of Polish
- The Arabic Speech Corpus
- Corpus of Political Speeches : Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library
- Large Multimodal Corpus of Human Speech
- v
- t
- e
- Argument mining
- Collocation extraction
- Concept mining
- Coreference resolution
- Deep linguistic processing
- Distant reading
- Information extraction
- Named-entity recognition
- Ontology learning
- Parsing
- Semantic parsing
- Syntactic parsing
- Part-of-speech tagging
- Semantic analysis
- Semantic role labeling
- Semantic decomposition
- Semantic similarity
- Sentiment analysis
Text segmentation |
---|
datasets and corpora
Types and standards |
|
---|---|
Data |
and data capture
reviewing
user interface
- Formal semantics
- Hallucination
- Natural Language Toolkit
- spaCy
This article about a digital library is a stub. You can help Wikipedia by expanding it. |
- v
- t
- e