Audio-speech data

Datasets with audio recordings, labeled speech data, audio clips with different accents and languages.

Common Voice

Common Voice is a multilingual dataset of voice recordings, contributed by volunteers from around the world. It aims to provide a wide variety of speech data for developing and training speech recognition systems. The dataset includes diverse accents, dialects, and languages, making it a valuable resource for researchers and developers working on voice technology and natural language processing.

Google Speech Commands

Google Speech Commands is a dataset consisting of thousands of labeled audio recordings of spoken commands, designed for training machine learning models in speech recognition tasks. The dataset includes a variety of commands spoken by different speakers, enabling the development of voice-activated applications and systems. It is widely used in research and development for creating more efficient and accurate speech recognition systems.

OpenSLR 96

OpenSLR 96 is a dataset that provides a collection of speech recordings designed for training and evaluating automatic speech recognition systems. It includes a diverse range of speakers and acoustic environments, making it suitable for developing robust models that can perform well in real-world conditions. The dataset is openly available for research and development purposes, supporting advancements in speech technology.

VoxCeleb 1

VoxCeleb 1 is a large-scale speaker recognition dataset consisting of thousands of audio clips sourced from YouTube videos. It features a diverse set of speakers from various backgrounds and languages, making it suitable for training and evaluating models in speaker identification and verification tasks. The dataset includes variations in acoustic conditions, providing a comprehensive resource for research in voice recognition technology.

OpenSLR 12

OpenSLR 12 is a dataset designed for automatic speech recognition research, featuring a collection of high-quality audio recordings of read speech. It includes recordings in multiple languages and various speaking styles, providing a rich resource for developing and testing speech recognition models. The dataset is openly available to facilitate research and development in the field of speech technology.

TEDLIUM

TEDLIUM is a dataset derived from TED Talks, featuring a collection of audio recordings along with their corresponding transcripts. This dataset is designed for training and evaluating automatic speech recognition systems and is characterized by diverse speakers, topics, and speaking styles. The rich content of TED Talks provides a valuable resource for research in speech technology and natural language processing.

Urban Sound 8K

Urban Sound 8K is a dataset consisting of 8,732 labeled audio recordings of urban sounds from various environments, including streets, parks, and public transport. It is designed for the development and evaluation of models in sound classification and environmental sound recognition. The dataset covers a wide range of sound categories, making it a valuable resource for research in audio processing and machine learning applications related to urban environments.

DARPA TIMIT

DARPA TIMIT is a widely used dataset for acoustic-phonetic research and automatic speech recognition. It consists of recorded speech from 630 speakers of American English, with a diverse range of dialects and accents. The dataset includes phonetically balanced sentences and their corresponding transcriptions, providing valuable resources for training and evaluating speech recognition models and conducting linguistic analysis.

FMA (Free Music Archive)

FMA (Free Music Archive) is a dataset that provides a large collection of music tracks across various genres, all available for free and open use. It includes metadata such as artist information, track titles, and genre classifications, making it a valuable resource for music information retrieval, analysis, and machine learning applications. The dataset is widely used for research in audio processing, music recommendation systems, and classification tasks.

Google Audioset

Google Audioset is a large-scale dataset designed for audio event classification. It contains over 2 million human-labeled 10-second audio clips from YouTube videos, covering a wide variety of sound events across multiple categories, such as music, speech, environmental sounds, and animal sounds. This diverse dataset is invaluable for training and evaluating machine learning models in the fields of sound recognition, audio classification, and machine learning applications.

VoxForge

VoxForge is an open-source speech corpus that provides a collection of transcribed audio recordings contributed by volunteers from around the world. It is designed to support the development of speech recognition systems in various languages and dialects. The dataset includes diverse speech samples, making it a valuable resource for researchers and developers working on speech technology and natural language processing applications.

REVERB Challenge

The REVERB Challenge dataset is designed for research in reverberation and sound source localization. It consists of recorded audio samples that simulate various acoustic environments with different levels of reverberation. This dataset is used to evaluate algorithms for dereverberation and to improve the performance of speech recognition systems in challenging acoustic conditions. The REVERB Challenge promotes advancements in audio processing and localization technologies.

RAVDESS

RAVDESS (The Radboud Faces Database) is a dataset of emotional speech and song recordings, designed for research in emotion recognition. It consists of a diverse range of actors expressing various emotions, including happiness, sadness, anger, and fear, through spoken phrases and singing. The dataset includes both audio and video recordings, making it a valuable resource for developing and evaluating models for emotion detection in speech and audio processing applications.

NSynth (Neural Synth)

NSynth (Neural Synth) is a dataset created by Google that contains over 300,000 musical notes generated from a wide variety of instruments. Each note is represented as a spectrogram, allowing for rich audio synthesis and machine learning applications. NSynth is designed for training neural networks to generate new sounds and explore the possibilities of sound synthesis, making it a valuable resource for researchers and developers in the fields of music technology and audio processing.

ESC 50

ESC-50 is a dataset for environmental sound classification, containing 2,000 labeled audio recordings of 50 different sound classes. Each class includes 40 recordings, featuring sounds from nature, human activities, and man-made environments. The dataset is designed to facilitate research in sound recognition and machine learning applications, making it a valuable resource for developing and evaluating models for environmental sound classification.

IEMOCAP (Interactive Emotional Dyadic Motion Capture)

IEMOCAP (Interactive Emotional Dyadic Motion Capture) is a multimodal dataset designed for emotion recognition research. It includes audio, video, and motion capture data of actors performing scripted dialogues with varying emotional expressions. The dataset features multiple emotions such as happiness, sadness, anger, and frustration, providing a rich resource for developing and evaluating models for emotional analysis in speech and video processing applications.

VoxConverse

VoxConverse is a dataset designed for studying conversational speech, featuring recordings of natural dialogues between speakers in various settings. It contains diverse conversational topics and a wide range of speech styles, making it suitable for research in areas such as speech recognition, dialogue systems, and emotion detection. The dataset provides a valuable resource for developing and evaluating models that analyze and understand conversational interactions.

AVSpeech

AVSpeech is a dataset designed for research in audiovisual speech recognition, consisting of paired audio and visual recordings of speakers. It includes a diverse range of speakers, languages, and contexts, allowing for the study of how visual cues, such as lip movements, enhance speech recognition accuracy. This dataset is valuable for developing and evaluating models that integrate both audio and visual information in speech processing applications.

Kazakh ASR Dataset

The Kazakh ASR Dataset is designed for automatic speech recognition research in the Kazakh language. It includes a collection of audio recordings from various speakers, covering a range of topics and speech styles. The dataset aims to provide valuable resources for training and evaluating speech recognition models tailored to the Kazakh language, facilitating advancements in speech technology and applications in natural language processing.

Kazakh Speech Corpus

The Kazakh Speech Corpus is a comprehensive dataset designed for speech recognition and linguistic research in the Kazakh language. It comprises a variety of audio recordings from native speakers, covering diverse speech styles, dialects, and topics. This corpus serves as a valuable resource for developing and testing automatic speech recognition systems, phonetic studies, and other applications in natural language processing, promoting advancements in Kazakh language technologies.

EmoReact

EmoReact is a dataset designed for emotion recognition in videos, featuring a collection of video clips annotated with various emotional responses. It includes diverse scenarios, expressions, and contexts, making it suitable for training models to detect and analyze emotions in visual media. This dataset provides a valuable resource for researchers and developers working on applications in affective computing, emotion analysis, and multimedia processing.

Common Voice 17.0

Common Voice 17.0 is a multilingual dataset of voice recordings collected from volunteers around the globe, aimed at improving speech recognition technology. It features a wide variety of spoken phrases in multiple languages, accompanied by diverse accents and dialects. This dataset is valuable for training and evaluating automatic speech recognition systems, making significant contributions to the development of inclusive and accurate voice technologies.

Audio-speech data Datasets with audio recordings, labeled speech data, audio clips with different accents and languages.