Computer vision

Annotated image and video datasets for object, face, and scene recognition.

ImageNet-AB

ImageNet-AB is a dataset derived from the ImageNet collection, specifically designed to facilitate the study of object recognition and classification tasks in computer vision. It features a subset of labeled images that are organized into various categories, providing a rich resource for training and evaluating machine learning models. ImageNet-AB focuses on diverse object appearances and complex scenes, making it a valuable tool for advancing research in visual recognition technologies.

ImageNet 1K Resized 256

ImageNet 1K Resized 256 is a subset of the ImageNet dataset, containing 1,000 categories of images resized to 256x256 pixels. This dataset is widely used in computer vision research for tasks such as image classification and object detection. It provides a diverse range of labeled images, making it an essential resource for training and evaluating machine learning models in visual recognition tasks.

COCO (Common Objects in Context)

COCO (Common Objects in Context) is a large-scale dataset designed for object detection, segmentation, and captioning tasks in computer vision. It contains over 330,000 images, with more than 2.5 million labeled instances of objects belonging to 80 different categories. The dataset features complex scenes where objects are presented in their natural contexts, providing a rich resource for training and evaluating machine learning models in various visual recognition applications.

Open Images

Open Images is a large-scale dataset for image recognition and object detection, consisting of approximately 9 million labeled images across more than 600 object categories. It includes a diverse array of real-world scenes and objects, annotated with bounding boxes, segmentation masks, and image-level labels. This dataset is widely used in computer vision research to train and evaluate models for various tasks, including image classification, object detection, and visual relationship recognition.

CelebA

CelebA (CelebFaces Attributes Dataset) is a large-scale dataset designed for facial recognition and attribute detection tasks. It contains over 200,000 celebrity images, each annotated with 40 different facial attributes, such as gender, age, and various facial features. CelebA is widely used in computer vision research to train and evaluate models for tasks like facial recognition, emotion detection, and generative modeling of faces, making it a valuable resource for advancing studies in image processing and artificial intelligence.

MNIST

MNIST (Modified National Institute of Standards and Technology) is a widely used dataset for handwritten digit recognition. It contains 70,000 images of handwritten digits (0-9), with 60,000 training images and 10,000 test images. Each image is a grayscale 28x28 pixel representation of a single digit. MNIST is commonly used in machine learning and computer vision research as a benchmark for developing and evaluating classification algorithms, making it an essential resource for beginners and experienced researchers alike.

CIFAR-10

CIFAR-10 is a widely used dataset for image classification tasks, consisting of 60,000 32x32 color images in 10 different classes, with 6,000 images per class. The classes include airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. CIFAR-10 is commonly used in machine learning and computer vision research as a benchmark for developing and evaluating classification algorithms, making it an essential resource for both beginners and advanced practitioners.

HKR Dataset

The HKR Dataset (Hong Kong Restaurant Dataset) is a collection of restaurant-related images and textual data designed for tasks such as image recognition, sentiment analysis, and recommendation systems. It features various types of images, including food, dining environments, and customer interactions, along with corresponding reviews and ratings. This dataset is valuable for researchers and developers in the fields of computer vision and natural language processing, providing insights into customer preferences and dining experiences.

KZ Mushrooms Dataset

The KZ Mushrooms Dataset is a collection of data related to various mushroom species found in Kazakhstan, designed for tasks such as classification and species identification. It includes features such as physical characteristics, habitat information, and edibility status, providing valuable insights for researchers and enthusiasts in mycology. This dataset is useful for developing machine learning models aimed at identifying and classifying mushroom species, promoting safe foraging practices.

Fashion MNIST

Fashion MNIST is a dataset for image classification, consisting of 70,000 grayscale images of clothing items, divided into 10 categories, including T-shirts, trousers, pullover, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots. Each image is 28x28 pixels in size. Fashion MNIST serves as a drop-in replacement for the original MNIST dataset, making it a popular choice for testing and benchmarking machine learning algorithms in the fashion industry, while providing a more challenging set of images for model training and evaluation.

CIFAR 100

CIFAR-100 is a widely used dataset for image classification tasks, consisting of 60,000 32x32 color images in 100 different classes, with 600 images per class. The classes are grouped into 20 superclasses, providing a rich and diverse set of labeled images. CIFAR-100 is commonly used in machine learning and computer vision research as a benchmark for developing and evaluating classification algorithms, making it an essential resource for researchers and practitioners in the field.

PASCAL VOC

PASCAL VOC (Visual Object Classes) is a benchmark dataset for object detection, segmentation, and classification in computer vision. It includes a variety of images from 20 object categories, such as person, car, dog, and horse, with corresponding annotations for object bounding boxes and pixel-level segmentation masks. The dataset provides a rich resource for training and evaluating machine learning models, making it essential for research in visual recognition and understanding tasks.a

LFW (Labeled Faces in the Wild)

LFW (Labeled Faces in the Wild) is a dataset designed for face recognition research, consisting of over 13,000 labeled images of faces collected from the web. The images include a diverse range of individuals with various poses, expressions, and lighting conditions. LFW is commonly used to evaluate the performance of face recognition algorithms and is an essential resource for researchers working on biometric recognition, computer vision, and related fields.

ADE20K

ADE20K is a large-scale dataset for semantic segmentation, containing over 20,000 images annotated with pixel-level labels across 150 object categories. The dataset includes a diverse range of scenes from indoor and outdoor environments, making it suitable for training and evaluating models for scene understanding and object recognition. ADE20K is widely used in computer vision research to advance the development of segmentation algorithms and enhance the understanding of complex scenes.

Oxford Pets

Oxford Pets деректер жинағы бейнелерді классификациялау және сегментациялау тапсырмалары үшін арналған, ол 37 түрлі мысық және ит тұқымдарының 37,000-нан астам суретін қамтиды. Әр сурет тұқым белгілерімен және объектіні дәл тану үшін пиксель деңгейіндегі сегментация маскаларымен аннотацияланған. Бұл деректер жинағы компьютерлік көру саласындағы зерттеушілер мен әзірлеушілер үшін құнды ресурс болып табылады, ол үй жануарларының тұқымдарын анықтау және үй жануарларының сипаттамаларын түсіну үшін модельдерді әзірлеуге және бағалауға мүмкіндік береді.

SUN397

SUN397 is a large-scale dataset designed for scene recognition, consisting of over 100,000 images from 397 different scene categories. Each category contains a diverse set of images depicting various environments, such as beaches, forests, urban areas, and indoor spaces. The dataset is annotated with scene labels and provides a rich resource for training and evaluating machine learning models in the field of computer vision, particularly for tasks related to scene classification and understanding.

Indoor Scenes

The Indoor Scenes dataset is designed for scene recognition tasks, containing a collection of images depicting various indoor environments. It includes multiple categories such as kitchens, living rooms, bedrooms, and offices, with thousands of labeled images per category. This dataset is valuable for training and evaluating machine learning models aimed at understanding and classifying indoor scenes, making it a useful resource for research in computer vision and robotics.

Stanford Dogs

The Stanford Dogs dataset is designed for fine-grained image classification, containing over 20,000 images of 120 different dog breeds. Each breed is represented by a diverse set of images, capturing various poses, backgrounds, and conditions. The dataset is annotated with breed labels, providing a valuable resource for training and evaluating machine learning models in tasks such as breed identification and visual recognition, making it essential for research in computer vision and animal classification.

Cityscapes Depth and Segmentation

The Cityscapes Depth and Segmentation dataset is designed for urban scene understanding, focusing on depth estimation and semantic segmentation tasks. It includes high-resolution images collected from various urban environments, annotated with pixel-level segmentation masks for different object categories and depth information. This dataset is valuable for training and evaluating models in computer vision, particularly for applications related to autonomous driving, urban planning, and robotics, enabling a comprehensive understanding of complex cityscapes.

CBIS-DDSM (Curated Breast Imaging Subset of the Digital Database for Screening Mammography)

The CBIS-DDSM (Curated Breast Imaging Subset of the Digital Database for Screening Mammography) is a dataset specifically designed for breast cancer research and mammography image analysis. It contains a curated collection of mammographic images, including both normal and abnormal findings, along with associated annotations such as cancer diagnoses and lesion classifications. This dataset is valuable for developing and evaluating machine learning models for tasks such as automated cancer detection, improving diagnostic accuracy, and advancing research in medical imaging.

EuroSAT

EuroSAT is a dataset designed for land use and land cover classification, consisting of over 27,000 labeled images collected from Sentinel-2 satellite imagery. The dataset includes 10 different classes, such as forests, grassland, and urban areas, providing a diverse range of geographical features. EuroSAT is valuable for training and evaluating machine learning models in remote sensing applications, enabling advancements in environmental monitoring, urban planning, and agriculture.

iNaturalist 2017

iNaturalist 2017 is a large-scale dataset designed for species classification and biodiversity research, consisting of over 5 million images of plants, animals, and fungi. The dataset includes annotations for thousands of species, collected from various contributors and sources. iNaturalist 2017 provides a rich resource for training and evaluating machine learning models in computer vision, particularly for applications related to species identification, ecological research, and conservation efforts.

UCF101

UCF101 is a dataset designed for action recognition in video clips, containing 13,320 videos across 101 action categories. The dataset includes a wide variety of activities, such as sports, daily actions, and human interactions, providing diverse examples for training and evaluating machine learning models. UCF101 is widely used in computer vision research to benchmark and develop algorithms for recognizing human actions in videos, making it a valuable resource for applications in surveillance, robotics, and human-computer interaction.

YouTube-8M

YouTube-8M is a large-scale video classification dataset containing over 8 million YouTube video IDs with associated labels across diverse categories. It is used for training and evaluating machine learning models for action recognition and video understanding, advancing video analysis technologies.

KITTI

Filling out an electronic application and attach the necessary documentsaThe KITTI dataset is a widely used resource for autonomous driving and robotics research. It includes stereo camera images, LiDAR point clouds, and GPS/IMU data from real-world driving scenarios. Annotations cover tasks like stereo matching, object detection, and scene flow estimation, essential for developing navigation systems.

Visual Genome

Visual Genome is a dataset for visual understanding, containing over 108,000 images with detailed annotations of objects, attributes, and relationships. It includes region descriptions and question-answer pairs, making it valuable for tasks such as object detection, scene understanding, and visual question answering in computer vision research.

Flickr8k

Flickr8k is a dataset for image captioning, consisting of 8,000 images, each paired with five different captions. The dataset features a variety of scenes, making it useful for training and evaluating models in natural language processing and computer vision, particularly for tasks related to image description generation.

CamVid

CamVid is a dataset for semantic segmentation, containing over 700 images of urban scenes captured from a moving vehicle. Each image is annotated with pixel-level labels for 32 object categories, making it suitable for training and evaluating models in autonomous driving and computer vision applications focused on scene understanding.

CheXpert

CheXpert is a large dataset for chest X-ray image analysis, containing over 224,000 images from various patients. Each image is annotated with 14 common chest conditions, including pneumonia and lung cancer, making it valuable for training and evaluating machine learning models in medical imaging and diagnostics.

DeepFashion

DeepFashion is a large-scale dataset for fashion recognition and analysis, containing over 800,000 images across various clothing categories. It includes annotations for attributes, landmarks, and instance segmentation, making it suitable for tasks such as clothing retrieval, outfit recommendation, and fashion item classification in computer vision.

300W

The 300W dataset is designed for facial landmark detection, containing over 3,000 images with annotated facial landmarks. It includes diverse expressions, poses, and occlusions, making it suitable for training and evaluating models in facial alignment and recognition tasks, particularly in computer vision applications.

EMNIST (Extended MNIST)

EMNIST (Extended MNIST) is a dataset for handwritten character recognition, expanding the original MNIST by including letters and additional handwritten characters. It consists of several subsets, such as EMNIST ByClass and ByMerge, with a total of over 800,000 images. EMNIST is useful for training and evaluating machine learning models in optical character recognition and related tasks.

LabelMe

LabelMe is a dataset for image annotation, providing over 50,000 images with detailed object segmentation masks and polygonal annotations. It includes a diverse range of scenes and objects, making it useful for training and evaluating models in computer vision tasks such as object detection, segmentation, and scene understanding.

Human3.6M

Human3.6M is a large-scale dataset for human motion capture, containing over 3.6 million 3D human poses captured in various activities. It includes detailed annotations of body joints and corresponding video sequences, making it valuable for training and evaluating models in human pose estimation, action recognition, and motion analysis in computer vision.

CASIA Gait Database

The CASIA Gait Database is a comprehensive dataset designed for gait recognition research, containing over 11,000 video sequences of individuals walking under various conditions. It includes data captured from multiple viewpoints and different walking styles, making it valuable for developing and evaluating algorithms in biometric recognition and computer vision applications.

CIFAR-10H

CIFAR-10H is an extension of the CIFAR-10 dataset, designed for human-centered image classification tasks. It consists of the original 60,000 images, along with additional annotations from human raters that provide insights into model predictions and help evaluate the performance of classifiers. This dataset is useful for understanding human perception in image recognition.

MPII Human Pose Dataset

The MPII Human Pose Dataset is designed for human pose estimation tasks, containing over 25,000 images with annotated body joints. It includes various activities and poses captured in real-world scenarios, making it valuable for training and evaluating models in human pose detection, action recognition, and related applications in computer vision.

Visual Dialog

Visual Dialog is a dataset designed for the task of visual question answering. It contains over 120,000 images paired with dialogues where a question is asked about the image, and responses are provided. This dataset is valuable for training and evaluating models that understand and generate natural language in the context of visual content, enhancing applications in computer vision and conversational AI.

Google Landmark Dataset

The Google Landmark Dataset is a large-scale dataset designed for landmark recognition and retrieval tasks, containing over 2 million images of landmarks from around the world. Each image is labeled with a specific landmark name, making it valuable for training and evaluating machine learning models in computer vision, particularly for applications related to image search and geographical information systems.

MegaFace

MegaFace is a large-scale dataset for facial recognition, designed to evaluate algorithms under various conditions. It contains over 4.7 million facial images from more than 672,000 identities. The dataset includes challenging conditions such as variations in pose, illumination, and expression, making it valuable for training and benchmarking face recognition systems.

Visual Question Answering (VQA)

The Visual Question Answering (VQA) dataset is designed for the task of answering questions about images. It contains over 265,000 images and 1.1 million questions, with corresponding answers. This dataset is valuable for training and evaluating models that integrate visual understanding and natural language processing, enhancing applications in AI and computer vision.

Fashion 150K

The Fashion 150K dataset is a large-scale collection of images for fashion item recognition, containing 150,000 images across various categories such as clothing, accessories, and footwear. Each image is annotated with detailed labels, making it useful for training and evaluating machine learning models in fashion-related tasks, including item classification and retrieval.

Tate Dataset

The Tate Dataset is a collection of high-quality images of artworks from the Tate galleries, including paintings, sculptures, and photographs. It provides annotations for various attributes such as artist, title, and medium, making it useful for training and evaluating machine learning models in art recognition and classification tasks, as well as for research in cultural heritage.

BIGstockimage2M

BIGstockimage2M is a large-scale dataset containing 2 million stock images covering a wide range of categories and themes. Each image is tagged with relevant keywords and descriptions, making it suitable for training and evaluating machine learning models in image classification, retrieval, and content-based image analysis.

RDD 2020 (Reddit Data Dataset 2020)

RDD 2020 (Reddit Data Dataset 2020) is a large-scale dataset containing over 1.4 billion posts and comments from Reddit. It includes data across various subreddits and topics, making it valuable for natural language processing tasks, sentiment analysis, and social media research. The dataset supports the development and evaluation of models focused on understanding online interactions.

Computer vision Annotated image and video datasets for object, face, and scene recognition.