Introduction
What are Datasets?
A dataset in Labellerr is a standalone collection of files (images, videos, audio, documents, or text) that can be created independently and attached to one or multiple projects. This modular approach allows you to:
- Reuse the same dataset across multiple annotation projects
- Manage your data separately from project configurations
- Connect cloud storage (AWS S3, Google Cloud Storage) for seamless data access
- Enable advanced features like multimodal indexing
Supported Data Types
| Data Type | Description | Supported Extensions |
|---|---|---|
| image | Image files for visual annotation | .jpg, .jpeg, .png, .bmp, .tiff |
| video | Video content for temporal annotation | .mp4 |
| audio | Audio files for sound annotation | .mp3, .wav |
| document | Document files for text analysis | |
| text | Plain text files for text annotation | .txt |
Creating Datasets
Import Required Modules
Required Imports
Method 1: Create Dataset with Local Files
- Upload from Folder
- Upload Specific Files
Create Dataset from Folder
Create Dataset with Folder
Limitations:
- Maximum of 2,500 files per folder
- Total folder size should not exceed 2.5 GB
Method 2: Create Dataset with AWS S3 Connection
Connect AWS S3 Bucket
Create Dataset with AWS S3
The SDK creates a connection to your S3 bucket and links it to the dataset. Files are accessed directly from S3 without local downloads.
Method 3: Create Dataset with Google Cloud Storage
Connect GCS Bucket
Create Dataset with GCS
Method 4: Use Existing Cloud Connection
Reuse Connection
If you’ve already created a cloud connection, you can reuse it for new datasets:
Create Dataset with Existing Connection
Connection Reuse: You can create multiple datasets from the same connection by specifying different paths within your cloud storage.
Working with Datasets
Retrieve an Existing Dataset
Get Dataset by ID
Retrieve Dataset
Status Codes:
300: Dataset is ready and contains files501: Dataset not found or invalid
Fetch Files from Dataset
List Dataset Files
Fetch Files
Enable Multimodal Indexing
Multimodal Indexing Feature
Enable advanced AI-powered multimodal indexing for your dataset to enable semantic search and intelligent file organization.
Enable Multimodal Indexing
What is Multimodal Indexing?Multimodal indexing uses AI to analyze and understand the content of your files (images, videos, audio, text) enabling:
- Natural language search across your dataset
- Semantic similarity detection
- Intelligent file grouping and recommendations
- Enhanced AI-assisted annotation workflows
Delete a Dataset
Delete Dataset
Delete Dataset
Sync Cloud Datasets
Synchronize Cloud Storage
For datasets connected to cloud storage (AWS S3 or GCS), you can sync to fetch newly added files:
Sync Dataset
Complete Workflow Example
End-to-End Dataset Creation
Complete Dataset Workflow
Error Handling
Best Practices for Error Handling
Error Handling Example
Troubleshooting Cloud Connections
Common Issues
Troubleshooting Reference
| Issue | Symptom | Solution |
|---|---|---|
| Dataset status 500 | ”Dataset created successfully” but status shows 500/Failed | Connection lacks bucket permissions - test connection first |
| ”Dataset has no files” | Files exist in S3/GCS but dataset shows 0 files | IAM user/service account missing read permissions |
| Internal server error | Error with tracking ID in UI | Check connection permissions, contact support with tracking ID |
Best Practice: Always Test Connection First
Test Connection Before Creating Datasets
Always test your connection before creating datasets to catch permission issues early:
Test Connection
Path Formats:
- AWS S3:
s3://bucket-name/path/to/folder/ - GCS:
gs://bucket-name/path/to/folder/
Checking Dataset Status After Creation
Verify Dataset Status
Always check dataset status after creation to ensure it processed successfully:
Check Dataset Status
Common Use Cases
Reusable Training Data
Create a master dataset of training images that can be used across multiple annotation projects with different labeling requirements.
Cloud Storage Integration
Connect your existing AWS S3 or GCS buckets to avoid data duplication and manage files directly from cloud storage.
Multi-Project Workflows
Use the same dataset for different annotation tasks - object detection in one project, segmentation in another.
Incremental Data Addition
Sync cloud datasets to continuously add new data to ongoing projects without manual uploads.
Dataset Configuration Reference
DatasetConfig Parameters
| Parameter | Type | Required | Description | Example Value |
|---|---|---|---|---|
| dataset_name | String | Yes | Name of the dataset | ”Training Images 2024” |
| data_type | String | Yes | Type of data in dataset | ”image”, “video”, “audio”, “document”, “text” |
| dataset_description | String | No | Description of dataset contents | ”Customer-provided training data” |
| connector_type | String | No | Type of connector (default: “local”) | “local”, “aws”, “gcp” |
Related Documentation
Create Projects
Learn how to create projects using standalone datasets
Retrieve Datasets
View and manage existing datasets and projects
Getting Started
SDK installation and initialization guide
For technical support, contact [email protected]

