Natural Language Processing (NLP)

Text corpora, datasets with emotion labels, contextual texts, dialogue recordings.

Fineweb

Fineweb is a dataset designed for web page classification and analysis, containing a large collection of labeled web pages across various categories. It provides valuable resources for researchers and developers in natural language processing and machine learning, enabling the development of models for web content understanding, classification, and information retrieval.

OpenOrca

OpenOrca is a dataset designed for training and evaluating open-domain conversational AI models. It includes a diverse collection of dialogues, providing context and examples for various conversational scenarios. This dataset is valuable for researchers and developers working on natural language understanding, dialogue systems, and improving AI-driven interactions.

C4 Colossal

C4 Colossal is a large-scale dataset designed for training language models, consisting of a diverse collection of text data sourced from the web. It includes billions of words across various topics and domains, providing a rich resource for natural language processing tasks, including text generation, understanding, and classification.

Wikipedia

The Wikipedia dataset is a large collection of text data extracted from Wikipedia articles across various topics. It includes structured and unstructured content, making it valuable for natural language processing tasks such as text analysis, summarization, question answering, and knowledge extraction, supporting research and development in AI.

CCMatrix

CCMatrix is a large-scale multilingual dataset designed for training machine translation models. It consists of billions of sentence pairs extracted from the Common Crawl corpus, covering numerous languages. This dataset is valuable for researchers and developers aiming to improve translation quality and develop robust multilingual models in natural language processing.

The Pile

The Pile is a large-scale, diverse dataset designed for training language models. It consists of 825 GB of text data sourced from various domains, including books, websites, and academic papers. This dataset is valuable for developing advanced natural language processing models, enhancing tasks such as text generation, understanding, and dialogue systems.

WikiText

WikiText is a dataset derived from Wikipedia articles, designed for training and evaluating language models. It includes high-quality, processed text data with over 100 million tokens, providing a rich resource for natural language processing tasks such as text generation, language modeling, and understanding.

Topical Chat

Topical Chat is a dataset designed for training conversational AI models, containing dialogues on various topics. It includes over 162,000 multi-turn conversations between users and a chatbot, allowing researchers to develop and evaluate models focused on engaging, context-aware interactions in natural language processing.

Persona Chat

Persona Chat is a dataset designed for building conversational agents with personality. It consists of dialogues where participants chat with a persona that includes a defined character and background. The dataset includes over 160,000 conversations, making it valuable for training and evaluating models that aim to create engaging and personalized interactions in natural language processing.

Blended Skill RU

Blended Skill RU is a dataset designed for developing conversational AI models in the Russian language. It includes dialogues that blend various skills, such as answering questions, providing recommendations, and engaging in small talk. This dataset is valuable for training and evaluating models that aim to create natural and effective interactions in Russian.

Blended Skill EN

Blended Skill EN is a dataset designed for developing conversational AI models in English. It includes dialogues that combine various skills, such as answering questions, providing recommendations, and engaging in casual conversation. This dataset is valuable for training and evaluating models aimed at creating natural and engaging interactions in English.

SQUAD

SQuAD (Stanford Question Answering Dataset) is a dataset for reading comprehension tasks, consisting of over 100,000 questions based on a set of Wikipedia articles. Each question is paired with a corresponding passage and a concise answer, making it valuable for training and evaluating models in natural language processing, especially for question answering systems.

SNLI

SNLI (Stanford Natural Language Inference) is a dataset for natural language inference tasks, containing 570,000 labeled sentence pairs. Each pair is annotated with one of three labels: entailment, contradiction, or neutral. This dataset is valuable for training and evaluating models in natural language processing, particularly for tasks related to understanding relationships between sentences.

MultiNLI

MultiNLI (Multi-Genre Natural Language Inference) is a dataset for natural language inference tasks, consisting of 433,000 sentence pairs across various genres. Each pair is labeled as entailment, contradiction, or neutral. This dataset is valuable for training and evaluating models in natural language processing, particularly for understanding sentence relationships in diverse contexts.

MS MARCO

MS MARCO (Microsoft MAchine Reading COmprehension) is a dataset designed for machine reading comprehension tasks, consisting of over 1 million real user queries and their corresponding passages from web documents. It includes questions, answers, and passage annotations, making it valuable for training and evaluating models in information retrieval and natural language processing.

NarrativeQA

NarrativeQA is a dataset designed for reading comprehension and question answering, consisting of over 1,500 stories and corresponding questions. Each question requires understanding the narrative context to generate answers, making it valuable for training and evaluating models in natural language processing, particularly for tasks that involve deeper comprehension of textual information.

Kazakh Wiki

Kazakh Wiki is a dataset derived from the Kazakh Wikipedia, containing a wide range of articles covering various topics in the Kazakh language. It provides a rich resource for natural language processing tasks, including text analysis, summarization, and language modeling, aiding researchers and developers in understanding and processing Kazakh text.

Kazakh Instruct

Kazakh Instruct is a dataset designed for training and evaluating models in instruction-based tasks in the Kazakh language. It includes a variety of tasks and prompts aimed at guiding users through different activities. This dataset is valuable for developing natural language processing applications that focus on user interaction and comprehension in Kazakh.

Alpaca

Alpaca is a dataset designed for training conversational AI models, consisting of instruction-following dialogues generated by fine-tuning language models. It includes diverse prompts and responses, making it valuable for developing and evaluating models that engage in natural and coherent conversations, enhancing applications in dialogue systems and virtual assistants.

MDBKD

MDBKD (Multi-Domain Benchmark for Knowledge Discovery) is a dataset designed for evaluating knowledge discovery and retrieval tasks across multiple domains. It contains various data types and sources, including structured and unstructured information, making it valuable for researchers developing and testing algorithms in information retrieval, data mining, and machine learning.

KazNERD ISSAI

KazNERD ISSAI is a dataset designed for named entity recognition in the Kazakh language. It includes annotated texts across various domains, focusing on identifying entities such as people, organizations, and locations. This dataset is valuable for training and evaluating models in natural language processing, particularly for tasks related to information extraction and understanding.

Kaz Ner

KazNER is a dataset designed for named entity recognition (NER) in the Kazakh language. It consists of annotated text data containing various entities such as names of people, organizations, and locations. This dataset is valuable for training and evaluating models in natural language processing, especially for tasks related to information extraction.

Kazakh Unsorted NITEC

Kazakh Unsorted NITEC is a dataset designed for natural language processing tasks in the Kazakh language. It contains a collection of unstructured text data from various sources, making it suitable for tasks such as text classification, sentiment analysis, and language modeling. This dataset is valuable for researchers and developers working on Kazakh language applications.

Kazakh Literature Collection

The Kazakh Literature Collection is a dataset comprising a wide range of literary works in the Kazakh language, including poetry, prose, and historical texts. This collection is valuable for natural language processing tasks such as text analysis, sentiment analysis, and machine learning applications, supporting research and the development of models focused on Kazakh literature.

Kazakh Dolly

Kazakh Dolly is a dataset designed for training and evaluating dialogue systems in the Kazakh language. It includes various conversational data, such as question-answer pairs and dialogues, aimed at enhancing natural language understanding and generation. This dataset is valuable for developing AI applications that require engaging and context-aware interactions.

Alpaca Kazakh TACO

Alpaca Kazakh TACO is a dataset designed for training conversational AI models in the Kazakh language, featuring a variety of task-oriented dialogues. It includes diverse prompts and responses that simulate user interactions across different tasks. This dataset is valuable for developing natural language processing applications focused on enhancing user experience and engagement.

RuBQ

RuBQ (Russian Bar Questioning) is a dataset designed for question answering tasks in the Russian language. It contains a variety of questions paired with relevant answers sourced from diverse domains. This dataset is valuable for training and evaluating models in natural language processing, particularly for enhancing information retrieval and comprehension capabilities.

Gigaword

Gigaword is a large-scale dataset of news articles, providing a diverse collection of text data for natural language processing tasks. It includes billions of words from various news sources, making it valuable for training models in text summarization, language modeling, and information retrieval, enhancing the understanding of contemporary language usage.

XSum (Extreme Summarization)

XSum (Extreme Summarization) is a dataset designed for single-document extreme summarization tasks. It contains over 226,000 BBC articles paired with one-sentence summaries that capture the essence of the content. This dataset is valuable for training and evaluating models focused on generating concise and informative summaries in natural language processing.

RACE (Reading Comprehension Dataset)

RACE (Reading Comprehension Dataset) is a large-scale dataset designed for reading comprehension tasks, consisting of over 28,000 passages and 97,000 questions. It includes questions based on high school English exams, requiring deep understanding and reasoning. This dataset is valuable for training and evaluating models in natural language processing, particularly for question answering.

Winograd WSC (Winograd Schema Challenge)

The Winograd WSC (Winograd Schema Challenge) dataset is designed for evaluating coreference resolution in natural language processing. It consists of sentences that require understanding the context to resolve ambiguous pronouns. This dataset is valuable for training and testing models focused on language comprehension and reasoning.

Sentiment140

Sentiment140 is a dataset for sentiment analysis, containing 1.6 million tweets labeled with positive and negative sentiments. It is designed to facilitate training and evaluating models in natural language processing, particularly for tasks related to sentiment classification and opinion mining in social media.

Google Natural Questions

Google Natural Questions is a dataset designed for training and evaluating models in natural language understanding and question answering. It contains real user questions paired with long answers from Wikipedia articles, providing a rich resource for developing systems that can comprehend and respond to natural language queries effectively.

KK-EN Corpora

KK-EN Corpora is a dataset designed for Kazakh-English language processing, containing parallel texts that facilitate translation and linguistic analysis. This corpus includes various domains, providing valuable resources for training and evaluating machine translation models and improving language understanding in bilingual applications.

IMDB Dataset of 50K Movie Reviews

The IMDB Dataset of 50K Movie Reviews is a collection of 50,000 film reviews, labeled as positive or negative. It is widely used for sentiment analysis and natural language processing tasks, providing valuable resources for training and evaluating models focused on understanding user opinions and enhancing text classification capabilities.

Yelp Dataset

The Yelp Dataset contains a rich collection of reviews, ratings, and user information for various businesses listed on Yelp. It includes millions of reviews across different categories, making it valuable for sentiment analysis, recommendation systems, and natural language processing tasks focused on understanding consumer opinions and behavior.

Amazon Reviews

The Amazon Reviews dataset contains millions of product reviews from Amazon, including ratings, text feedback, and user information. It covers a wide range of categories and products, making it valuable for sentiment analysis, recommendation systems, and natural language processing tasks focused on understanding consumer sentiment and behavior.

Stanford Sentiment Treebank

The Stanford Sentiment Treebank is a dataset designed for sentiment analysis, containing fine-grained annotations of sentiment for over 11,000 movie reviews. It provides a tree structure for sentences, allowing for detailed sentiment classification at different levels, making it valuable for training and evaluating models in natural language processing.

Book Corpus

The Book Corpus is a dataset comprising over 11,000 books from various genres, providing a diverse range of text for natural language processing tasks. It is valuable for training language models, text generation, and understanding narrative structure, making it an essential resource for researchers and developers in the field of NLP.

Recipe 2M

The Recipe 2M dataset is a large-scale collection of over 2 million cooking recipes, including ingredients, instructions, and cooking times. It provides a rich resource for natural language processing tasks related to recipe generation, recommendation systems, and culinary analysis, helping researchers and developers enhance food-related applications.

XNLI (Cross-lingual Natural Language Inference)

XNLI (Cross-lingual Natural Language Inference) is a dataset for evaluating natural language inference models across multiple languages. It includes sentence pairs in 15 languages with labels for entailment, contradiction, or neutral. This dataset is valuable for training and assessing models that understand cross-lingual relationships and reasoning.

OpenCorpora Russian

OpenCorpora Russian is a linguistic dataset designed for natural language processing tasks in the Russian language. It includes a variety of annotated texts, covering different domains and genres. This dataset is valuable for training and evaluating models in tasks such as part-of-speech tagging, syntactic parsing, and named entity recognition.

RuSentiment

RuSentiment is a dataset for sentiment analysis in the Russian language, consisting of annotated texts from various sources, including social media and reviews. It contains labeled data for positive, negative, and neutral sentiments, making it valuable for training and evaluating models focused on understanding user opinions and emotions in natural language processing.

Lenta.Ru News Dataset

The Lenta.Ru News Dataset is a collection of news articles from the Russian news aggregator Lenta.ru, providing a rich source of text data for natural language processing tasks. It includes various articles across multiple categories, making it valuable for tasks such as topic modeling, sentiment analysis, and text classification.

RuDReC (Russian Dataset for Relation Extraction and Classification)

RuDReC (Russian Dataset for Relation Extraction and Classification) is a dataset designed for extracting and classifying relationships in Russian texts. It includes annotated examples from various domains, providing valuable resources for training and evaluating models in natural language processing tasks focused on relationship extraction, semantic understanding, and information retrieval.

OpenSubtitles Parallel Corpora

OpenSubtitles Parallel Corpora is a multilingual dataset consisting of movie and TV show subtitles aligned in multiple languages. It provides a rich resource for training and evaluating machine translation models and linguistic studies, enabling researchers to analyze dialogue patterns and improve translation quality across languages.

Russian Poetry

The Russian Poetry dataset is a collection of poems written in the Russian language, encompassing various styles, authors, and periods. It serves as a valuable resource for natural language processing tasks, including text analysis, sentiment analysis, and model training focused on literary studies and cultural insights.

Kazakh TTS

Kazakh TTS is a dataset designed for training text-to-speech (TTS) models in the Kazakh language. It includes recorded audio samples along with corresponding text, providing a valuable resource for developing and evaluating speech synthesis systems that generate natural and intelligible Kazakh speech.

FineWeb EDU

FineWeb EDU is a dataset designed for educational purposes, containing a curated collection of web pages and resources. It includes annotated texts and materials across various subjects, making it valuable for developing educational tools, training models in natural language processing, and enhancing content-based learning applications.

SmolLM Corpus

The SmolLM Corpus is a dataset designed for training small language models, containing a diverse collection of text data from various domains. It aims to provide resources for developing efficient and lightweight models suitable for applications with limited computational resources, making it valuable for researchers focusing on scalable natural language processing.

WildChat

WildChat is a dataset designed for training conversational AI models, featuring diverse dialogues collected from various online platforms. It includes informal and spontaneous conversations, making it valuable for developing and evaluating models that can engage in natural, context-aware interactions across different topics.

Dolma (Datasets Optimized for Large Model Applications)

Dolma is a dataset optimized for training and evaluating large language models. It includes a diverse collection of high-quality text data tailored for applications requiring substantial computational resources. This dataset is valuable for researchers and developers working on enhancing model performance and efficiency in various natural language processing tasks.

PeS2o

PeS2o is a dataset designed for evaluating and training models in the context of personal assistant systems. It includes conversational data that mimics interactions with a personal assistant, providing valuable resources for developing natural language understanding and dialogue management capabilities.

Wild Jailbreak

Wild Jailbreak is a dataset designed for evaluating the robustness of conversational AI models against adversarial prompts and manipulative queries. It includes various examples of inputs aimed at "jailbreaking" or bypassing model constraints, making it valuable for researchers focused on enhancing model security and understanding potential vulnerabilities.

AmberDatasets

AmberDatasets is a collection of datasets aimed at training and evaluating models for various natural language processing tasks. It includes annotated texts from multiple domains, providing resources for tasks such as sentiment analysis, text classification, and named entity recognition. This dataset is valuable for researchers and developers working on NLP applications.

Zyda

Zyda is a dataset designed for training and evaluating models in natural language processing tasks, focusing on dialogue systems and conversational AI. It includes annotated dialogues and interaction patterns, providing a valuable resource for developing systems that understand and engage in natural language conversations.

MFAQ (Multilingual Frequently Asked Questions)

MFAQ (Multilingual Frequently Asked Questions) is a dataset that contains a collection of frequently asked questions across multiple languages. It is designed to support the development of multilingual question-answering systems, providing valuable resources for training models to understand and respond to user inquiries in various languages.

UpVoteWeb

UpVoteWeb is a dataset designed for evaluating the performance of recommendation systems and content moderation algorithms. It contains user interactions, including upvotes and downvotes on web content, making it valuable for training models that aim to improve user engagement and content relevance in online platforms.

OSCAR-2301

OSCAR-2301 is a multilingual dataset designed for training language models, comprising 2301 language pairs extracted from web data. It is valuable for natural language processing tasks, including translation and cross-lingual understanding, providing a rich resource for researchers and developers working with multilingual applications.

CrossSum

CrossSum is a dataset designed for cross-lingual summarization tasks, containing pairs of articles and their corresponding summaries in multiple languages. It is valuable for training and evaluating models that aim to generate concise and coherent summaries across different languages, enhancing natural language processing applications in multilingual contexts.

StarcoderData

StarcoderData is a dataset designed for training and evaluating coding assistants and models in software development tasks. It includes a vast collection of code snippets, programming questions, and documentation across various programming languages. This dataset is valuable for enhancing natural language processing applications in coding and debugging.

GlotCC-V1

GlotCC-V1 is a dataset designed for multilingual code-switching and language processing tasks. It includes annotated examples of code-switching in conversations across various languages, providing valuable resources for training and evaluating models in natural language processing that focus on multilingual understanding and interaction.

Natural Language Processing (NLP) Text corpora, datasets with emotion labels, contextual texts, dialogue recordings.