Astanahub Logo
Astanahub Logo
Home
Community
Tax Incentives
Programs
Hub Market
Vacancies
Tech Tasks
Products and services
Events
Online Courses
Tech Orda
Relocation
Contacts us
Additional
Join us Login
Back
Publish

Post

Event

Vacancy

Initiative

Technological task

  • Feed
  • Programs
  • Tax incentives
    • Become a participant
    • Technopark members
  • Technological tasks
  • Events
  • Networking
  • Tech Orda
  • Vacancies
  • Infrastructure
    • Laboratories and equipment
    • Astana Hub pavilions
    • Regional Hubs
  • Marketplace
  • Relocation
    • Open an IT company
    • Expat Centre
  • About astanahub.com
  • Contact us
  • Social media

astanahub.com © 2020-2025. All rights reserved

Privacy policy Terms of use Additionally F.A.Q.

Natural Language Processing (NLP)

Text corpora, datasets with emotion labels, contextual texts, dialogue recordings.

Fineweb

Fineweb is a dataset designed for web page classification and analysis, containing a large collection of labeled web pages across various categories. It provides valuable resources for researchers and developers in natural language processing and machine learning, enabling the development of models for web content understanding, classification, and information retrieval.

Read more

OpenOrca

OpenOrca is a dataset designed for training and evaluating open-domain conversational AI models. It includes a diverse collection of dialogues, providing context and examples for various conversational scenarios. This dataset is valuable for researchers and developers working on natural language understanding, dialogue systems, and improving AI-driven interactions.

Read more

C4 Colossal

C4 Colossal is a large-scale dataset designed for training language models, consisting of a diverse collection of text data sourced from the web. It includes billions of words across various topics and domains, providing a rich resource for natural language processing tasks, including text generation, understanding, and classification.

Read more

Wikipedia

The Wikipedia dataset is a large collection of text data extracted from Wikipedia articles across various topics. It includes structured and unstructured content, making it valuable for natural language processing tasks such as text analysis, summarization, question answering, and knowledge extraction, supporting research and development in AI.

Read more

CCMatrix

CCMatrix is a large-scale multilingual dataset designed for training machine translation models. It consists of billions of sentence pairs extracted from the Common Crawl corpus, covering numerous languages. This dataset is valuable for researchers and developers aiming to improve translation quality and develop robust multilingual models in natural language processing.

Read more

The Pile

The Pile is a large-scale, diverse dataset designed for training language models. It consists of 825 GB of text data sourced from various domains, including books, websites, and academic papers. This dataset is valuable for developing advanced natural language processing models, enhancing tasks such as text generation, understanding, and dialogue systems.

Read more

WikiText

WikiText is a dataset derived from Wikipedia articles, designed for training and evaluating language models. It includes high-quality, processed text data with over 100 million tokens, providing a rich resource for natural language processing tasks such as text generation, language modeling, and understanding.

Read more

Topical Chat

Topical Chat is a dataset designed for training conversational AI models, containing dialogues on various topics. It includes over 162,000 multi-turn conversations between users and a chatbot, allowing researchers to develop and evaluate models focused on engaging, context-aware interactions in natural language processing.

Read more

Persona Chat

Persona Chat is a dataset designed for building conversational agents with personality. It consists of dialogues where participants chat with a persona that includes a defined character and background. The dataset includes over 160,000 conversations, making it valuable for training and evaluating models that aim to create engaging and personalized interactions in natural language processing.

Read more

Blended Skill RU

Blended Skill RU is a dataset designed for developing conversational AI models in the Russian language. It includes dialogues that blend various skills, such as answering questions, providing recommendations, and engaging in small talk. This dataset is valuable for training and evaluating models that aim to create natural and effective interactions in Russian.

Read more

Blended Skill EN

Blended Skill EN is a dataset designed for developing conversational AI models in English. It includes dialogues that combine various skills, such as answering questions, providing recommendations, and engaging in casual conversation. This dataset is valuable for training and evaluating models aimed at creating natural and engaging interactions in English.

Read more

SQUAD

SQuAD (Stanford Question Answering Dataset) is a dataset for reading comprehension tasks, consisting of over 100,000 questions based on a set of Wikipedia articles. Each question is paired with a corresponding passage and a concise answer, making it valuable for training and evaluating models in natural language processing, especially for question answering systems.

Read more

SNLI

SNLI (Stanford Natural Language Inference) is a dataset for natural language inference tasks, containing 570,000 labeled sentence pairs. Each pair is annotated with one of three labels: entailment, contradiction, or neutral. This dataset is valuable for training and evaluating models in natural language processing, particularly for tasks related to understanding relationships between sentences.

Read more

MultiNLI

MultiNLI (Multi-Genre Natural Language Inference) is a dataset for natural language inference tasks, consisting of 433,000 sentence pairs across various genres. Each pair is labeled as entailment, contradiction, or neutral. This dataset is valuable for training and evaluating models in natural language processing, particularly for understanding sentence relationships in diverse contexts.

Read more

MS MARCO

MS MARCO (Microsoft MAchine Reading COmprehension) is a dataset designed for machine reading comprehension tasks, consisting of over 1 million real user queries and their corresponding passages from web documents. It includes questions, answers, and passage annotations, making it valuable for training and evaluating models in information retrieval and natural language processing.

Read more

NarrativeQA

NarrativeQA is a dataset designed for reading comprehension and question answering, consisting of over 1,500 stories and corresponding questions. Each question requires understanding the narrative context to generate answers, making it valuable for training and evaluating models in natural language processing, particularly for tasks that involve deeper comprehension of textual information.

Read more

Kazakh Wiki

Kazakh Wiki is a dataset derived from the Kazakh Wikipedia, containing a wide range of articles covering various topics in the Kazakh language. It provides a rich resource for natural language processing tasks, including text analysis, summarization, and language modeling, aiding researchers and developers in understanding and processing Kazakh text.

Read more

Kazakh Instruct

Kazakh Instruct is a dataset designed for training and evaluating models in instruction-based tasks in the Kazakh language. It includes a variety of tasks and prompts aimed at guiding users through different activities. This dataset is valuable for developing natural language processing applications that focus on user interaction and comprehension in Kazakh.

Read more

Alpaca

Alpaca is a dataset designed for training conversational AI models, consisting of instruction-following dialogues generated by fine-tuning language models. It includes diverse prompts and responses, making it valuable for developing and evaluating models that engage in natural and coherent conversations, enhancing applications in dialogue systems and virtual assistants.

Read more

MDBKD

MDBKD (Multi-Domain Benchmark for Knowledge Discovery) is a dataset designed for evaluating knowledge discovery and retrieval tasks across multiple domains. It contains various data types and sources, including structured and unstructured information, making it valuable for researchers developing and testing algorithms in information retrieval, data mining, and machine learning.

Read more

KazNERD ISSAI

KazNERD ISSAI is a dataset designed for named entity recognition in the Kazakh language. It includes annotated texts across various domains, focusing on identifying entities such as people, organizations, and locations. This dataset is valuable for training and evaluating models in natural language processing, particularly for tasks related to information extraction and understanding.

Read more

Kaz Ner

KazNER is a dataset designed for named entity recognition (NER) in the Kazakh language. It consists of annotated text data containing various entities such as names of people, organizations, and locations. This dataset is valuable for training and evaluating models in natural language processing, especially for tasks related to information extraction.

Read more

Kazakh Unsorted NITEC

Kazakh Unsorted NITEC is a dataset designed for natural language processing tasks in the Kazakh language. It contains a collection of unstructured text data from various sources, making it suitable for tasks such as text classification, sentiment analysis, and language modeling. This dataset is valuable for researchers and developers working on Kazakh language applications.

Read more

Kazakh Literature Collection

The Kazakh Literature Collection is a dataset comprising a wide range of literary works in the Kazakh language, including poetry, prose, and historical texts. This collection is valuable for natural language processing tasks such as text analysis, sentiment analysis, and machine learning applications, supporting research and the development of models focused on Kazakh literature.

Read more

Kazakh Dolly

Kazakh Dolly is a dataset designed for training and evaluating dialogue systems in the Kazakh language. It includes various conversational data, such as question-answer pairs and dialogues, aimed at enhancing natural language understanding and generation. This dataset is valuable for developing AI applications that require engaging and context-aware interactions.

Read more

Alpaca Kazakh TACO

Alpaca Kazakh TACO is a dataset designed for training conversational AI models in the Kazakh language, featuring a variety of task-oriented dialogues. It includes diverse prompts and responses that simulate user interactions across different tasks. This dataset is valuable for developing natural language processing applications focused on enhancing user experience and engagement.

Read more

RuBQ

RuBQ (Russian Bar Questioning) is a dataset designed for question answering tasks in the Russian language. It contains a variety of questions paired with relevant answers sourced from diverse domains. This dataset is valuable for training and evaluating models in natural language processing, particularly for enhancing information retrieval and comprehension capabilities.

Read more

Gigaword

Gigaword is a large-scale dataset of news articles, providing a diverse collection of text data for natural language processing tasks. It includes billions of words from various news sources, making it valuable for training models in text summarization, language modeling, and information retrieval, enhancing the understanding of contemporary language usage.

Read more

XSum (Extreme Summarization)

XSum (Extreme Summarization) is a dataset designed for single-document extreme summarization tasks. It contains over 226,000 BBC articles paired with one-sentence summaries that capture the essence of the content. This dataset is valuable for training and evaluating models focused on generating concise and informative summaries in natural language processing.

Read more

RACE (Reading Comprehension Dataset)

RACE (Reading Comprehension Dataset) is a large-scale dataset designed for reading comprehension tasks, consisting of over 28,000 passages and 97,000 questions. It includes questions based on high school English exams, requiring deep understanding and reasoning. This dataset is valuable for training and evaluating models in natural language processing, particularly for question answering.

Read more

Winograd WSC (Winograd Schema Challenge)

The Winograd WSC (Winograd Schema Challenge) dataset is designed for evaluating coreference resolution in natural language processing. It consists of sentences that require understanding the context to resolve ambiguous pronouns. This dataset is valuable for training and testing models focused on language comprehension and reasoning.

Read more

Sentiment140

Sentiment140 is a dataset for sentiment analysis, containing 1.6 million tweets labeled with positive and negative sentiments. It is designed to facilitate training and evaluating models in natural language processing, particularly for tasks related to sentiment classification and opinion mining in social media.

Read more

Google Natural Questions

Google Natural Questions is a dataset designed for training and evaluating models in natural language understanding and question answering. It contains real user questions paired with long answers from Wikipedia articles, providing a rich resource for developing systems that can comprehend and respond to natural language queries effectively.

Read more

KK-EN Corpora

KK-EN Corpora is a dataset designed for Kazakh-English language processing, containing parallel texts that facilitate translation and linguistic analysis. This corpus includes various domains, providing valuable resources for training and evaluating machine translation models and improving language understanding in bilingual applications.

Read more

IMDB Dataset of 50K Movie Reviews

The IMDB Dataset of 50K Movie Reviews is a collection of 50,000 film reviews, labeled as positive or negative. It is widely used for sentiment analysis and natural language processing tasks, providing valuable resources for training and evaluating models focused on understanding user opinions and enhancing text classification capabilities.

Read more

Yelp Dataset

The Yelp Dataset contains a rich collection of reviews, ratings, and user information for various businesses listed on Yelp. It includes millions of reviews across different categories, making it valuable for sentiment analysis, recommendation systems, and natural language processing tasks focused on understanding consumer opinions and behavior.

Read more

Amazon Reviews

The Amazon Reviews dataset contains millions of product reviews from Amazon, including ratings, text feedback, and user information. It covers a wide range of categories and products, making it valuable for sentiment analysis, recommendation systems, and natural language processing tasks focused on understanding consumer sentiment and behavior.

Read more

Stanford Sentiment Treebank

The Stanford Sentiment Treebank is a dataset designed for sentiment analysis, containing fine-grained annotations of sentiment for over 11,000 movie reviews. It provides a tree structure for sentences, allowing for detailed sentiment classification at different levels, making it valuable for training and evaluating models in natural language processing.

Read more

Book Corpus

The Book Corpus is a dataset comprising over 11,000 books from various genres, providing a diverse range of text for natural language processing tasks. It is valuable for training language models, text generation, and understanding narrative structure, making it an essential resource for researchers and developers in the field of NLP.

Read more

Recipe 2M

The Recipe 2M dataset is a large-scale collection of over 2 million cooking recipes, including ingredients, instructions, and cooking times. It provides a rich resource for natural language processing tasks related to recipe generation, recommendation systems, and culinary analysis, helping researchers and developers enhance food-related applications.

Read more

XNLI (Cross-lingual Natural Language Inference)

XNLI (Cross-lingual Natural Language Inference) is a dataset for evaluating natural language inference models across multiple languages. It includes sentence pairs in 15 languages with labels for entailment, contradiction, or neutral. This dataset is valuable for training and assessing models that understand cross-lingual relationships and reasoning.

Read more

OpenCorpora Russian

OpenCorpora Russian is a linguistic dataset designed for natural language processing tasks in the Russian language. It includes a variety of annotated texts, covering different domains and genres. This dataset is valuable for training and evaluating models in tasks such as part-of-speech tagging, syntactic parsing, and named entity recognition.

Read more

RuSentiment

RuSentiment is a dataset for sentiment analysis in the Russian language, consisting of annotated texts from various sources, including social media and reviews. It contains labeled data for positive, negative, and neutral sentiments, making it valuable for training and evaluating models focused on understanding user opinions and emotions in natural language processing.

Read more

Lenta.Ru News Dataset

The Lenta.Ru News Dataset is a collection of news articles from the Russian news aggregator Lenta.ru, providing a rich source of text data for natural language processing tasks. It includes various articles across multiple categories, making it valuable for tasks such as topic modeling, sentiment analysis, and text classification.

Read more

RuDReC (Russian Dataset for Relation Extraction and Classification)

RuDReC (Russian Dataset for Relation Extraction and Classification) is a dataset designed for extracting and classifying relationships in Russian texts. It includes annotated examples from various domains, providing valuable resources for training and evaluating models in natural language processing tasks focused on relationship extraction, semantic understanding, and information retrieval.

Read more

OpenSubtitles Parallel Corpora

OpenSubtitles Parallel Corpora is a multilingual dataset consisting of movie and TV show subtitles aligned in multiple languages. It provides a rich resource for training and evaluating machine translation models and linguistic studies, enabling researchers to analyze dialogue patterns and improve translation quality across languages.

Read more

Russian Poetry

The Russian Poetry dataset is a collection of poems written in the Russian language, encompassing various styles, authors, and periods. It serves as a valuable resource for natural language processing tasks, including text analysis, sentiment analysis, and model training focused on literary studies and cultural insights.

Read more

Kazakh TTS

Kazakh TTS is a dataset designed for training text-to-speech (TTS) models in the Kazakh language. It includes recorded audio samples along with corresponding text, providing a valuable resource for developing and evaluating speech synthesis systems that generate natural and intelligible Kazakh speech.

Read more

FineWeb EDU

FineWeb EDU is a dataset designed for educational purposes, containing a curated collection of web pages and resources. It includes annotated texts and materials across various subjects, making it valuable for developing educational tools, training models in natural language processing, and enhancing content-based learning applications.

Read more

SmolLM Corpus

The SmolLM Corpus is a dataset designed for training small language models, containing a diverse collection of text data from various domains. It aims to provide resources for developing efficient and lightweight models suitable for applications with limited computational resources, making it valuable for researchers focusing on scalable natural language processing.

Read more

WildChat

WildChat is a dataset designed for training conversational AI models, featuring diverse dialogues collected from various online platforms. It includes informal and spontaneous conversations, making it valuable for developing and evaluating models that can engage in natural, context-aware interactions across different topics.

Read more

Dolma (Datasets Optimized for Large Model Applications)

Dolma is a dataset optimized for training and evaluating large language models. It includes a diverse collection of high-quality text data tailored for applications requiring substantial computational resources. This dataset is valuable for researchers and developers working on enhancing model performance and efficiency in various natural language processing tasks.

Read more

PeS2o

PeS2o is a dataset designed for evaluating and training models in the context of personal assistant systems. It includes conversational data that mimics interactions with a personal assistant, providing valuable resources for developing natural language understanding and dialogue management capabilities.

Read more

Wild Jailbreak

Wild Jailbreak is a dataset designed for evaluating the robustness of conversational AI models against adversarial prompts and manipulative queries. It includes various examples of inputs aimed at "jailbreaking" or bypassing model constraints, making it valuable for researchers focused on enhancing model security and understanding potential vulnerabilities.

Read more

AmberDatasets

AmberDatasets is a collection of datasets aimed at training and evaluating models for various natural language processing tasks. It includes annotated texts from multiple domains, providing resources for tasks such as sentiment analysis, text classification, and named entity recognition. This dataset is valuable for researchers and developers working on NLP applications.

Read more

Zyda

Zyda is a dataset designed for training and evaluating models in natural language processing tasks, focusing on dialogue systems and conversational AI. It includes annotated dialogues and interaction patterns, providing a valuable resource for developing systems that understand and engage in natural language conversations.

Read more

MFAQ (Multilingual Frequently Asked Questions)

MFAQ (Multilingual Frequently Asked Questions) is a dataset that contains a collection of frequently asked questions across multiple languages. It is designed to support the development of multilingual question-answering systems, providing valuable resources for training models to understand and respond to user inquiries in various languages.

Read more

UpVoteWeb

UpVoteWeb is a dataset designed for evaluating the performance of recommendation systems and content moderation algorithms. It contains user interactions, including upvotes and downvotes on web content, making it valuable for training models that aim to improve user engagement and content relevance in online platforms.

Read more

OSCAR-2301

OSCAR-2301 is a multilingual dataset designed for training language models, comprising 2301 language pairs extracted from web data. It is valuable for natural language processing tasks, including translation and cross-lingual understanding, providing a rich resource for researchers and developers working with multilingual applications.

Read more

CrossSum

CrossSum is a dataset designed for cross-lingual summarization tasks, containing pairs of articles and their corresponding summaries in multiple languages. It is valuable for training and evaluating models that aim to generate concise and coherent summaries across different languages, enhancing natural language processing applications in multilingual contexts.

Read more

StarcoderData 

StarcoderData is a dataset designed for training and evaluating coding assistants and models in software development tasks. It includes a vast collection of code snippets, programming questions, and documentation across various programming languages. This dataset is valuable for enhancing natural language processing applications in coding and debugging.

Read more

GlotCC-V1

GlotCC-V1 is a dataset designed for multilingual code-switching and language processing tasks. It includes annotated examples of code-switching in conversations across various languages, providing valuable resources for training and evaluating models in natural language processing that focus on multilingual understanding and interaction.

Read more
QR

Mobile App

Join the Unicorn Game

© 2025, Autonomous cluster fund «Park of innovative technologies»

Privacy Policy User Agreement F.A.Q.

Login

No account? Registration
Forgot your password?

Authorization

Choose the authorization method that is convenient for you
  • Continue with Google Account
  • Continue using EDS
  • Login via email
No account? Registration
Please ensure the confidentiality of the username and password! By continuing, you accept the terms and offers of Astana Hub

Registration

Choose the registration method that is convenient for you
  • Continue with Google Account
  • Continue using EDS
  • Registration via email
Already have an account? Login
Please ensure the confidentiality of the username and password! By continuing, you accept the terms and offers of Astana Hub

Registration

Already have an account? Login

Вход через ЭЦП

У меня уже есть аккаунт. Хочу войти

ИИН:

Регистрация через ЭЦП

У меня уже есть аккаунт. Хочу войти

ИИН:

Продолжая, Вы принимаете условия и предложения AstanaHub

Регистрация

Войти под другим логином

Пройдите по ссылке, которую мы отправили Вам на почту , для завершения регистрации

Восстановление пароля

Смена пароля

Ваш пароль устарел. Пожалуйста, смените пароль в целях безопасности

Password change

Добавить email

Enter a new email address to be used when logging in


Add a phone number

Enter a phone number to use when authorizing the system


Recover password

Sign in with another login

Enter your email address to which you would like to receive a link to recover your password

Log in with another account

Follow the link we sent you in on email

Успешная регистрация!

Поздравляем, вы успешно зарегистрированы на платформе astanahub.com

Отлично

Your account has been
blocked

Log in with another account

Your account has been blocked because your account password has been entered incorrectly more than 3 times

Opportunities are
being created here for the free development
of innovative IT projects

Log in

Or
Log in with Gmail
Please ensure the confidentiality of the username and password! By continuing, you accept the terms and offers of Astana Hub
Opportunities are
being created here for the free development
of innovative IT projects

Enter your password
You log in with email

Forgot your password?
Opportunities are
being created here for the free development
of innovative IT projects

Enter your password
You log in with phone number

Forgot your password?
Opportunities are
being created here for the free development
of innovative IT projects

Enter the SMS code
We have sent it to your number

Opportunities are
being created here for the free development
of innovative IT projects

Enter the SMS code
We have sent it to your number

Request the code again in 0 seconds
Request the code again
Opportunities are
being created here for the free development
of innovative IT projects

Create password

Minimum of 8 characters

Capital letters A-Z

Lowercase letters a-z

One digit

One special character

Upon completion of registration, we will automatically add you to «Networking» for networking in the Astana Hub ecosystem.

Opportunities are
being created here for the free development
of innovative IT projects

Enter the SMS code
We have sent it to your number

Request the code again in 0 seconds
Request the code again
Opportunities are
being created here for the free development
of innovative IT projects

What is your name?

The «Networking» section is designed to develop networking, find like-minded people and expand business connections.

Opportunities are
being created here for the free development
of innovative IT projects

Enter your password
You log in with email

Forgot your password?
Opportunities are
being created here for the free development
of innovative IT projects

Enter the confirmation code
We have sent it to your email

Request the code again
Opportunities are
being created here for the free development
of innovative IT projects

Your account has been
blocked

Your account has been blocked because your account password has been entered incorrectly more than 3 times

Restore password
Log in with another account