Skip to main content

Introduction

What are Datasets?

A dataset in Labellerr is a standalone collection of files (images, videos, audio, documents, or text) that can be created independently and attached to one or multiple projects. This modular approach allows you to:
  • Reuse the same dataset across multiple annotation projects
  • Manage your data separately from project configurations
  • Connect cloud storage (AWS S3, Google Cloud Storage) for seamless data access
  • Enable advanced features like multimodal indexing

Supported Data Types

Data TypeDescriptionSupported Extensions
imageImage files for visual annotation.jpg, .jpeg, .png, .bmp, .tiff
videoVideo content for temporal annotation.mp4
audioAudio files for sound annotation.mp3, .wav
documentDocument files for text analysis.pdf
textPlain text files for text annotation.txt

Creating Datasets

Import Required Modules

Required Imports
from labellerr.client import LabellerrClient
from labellerr.core.schemas import DatasetConfig
from labellerr.core.datasets import (
    create_dataset_from_local,
    create_dataset_from_connection,
    LabellerrDataset
)

Method 1: Create Dataset with Local Files

  • Upload from Folder
  • Upload Specific Files

Create Dataset from Folder

Create Dataset with Folder
from labellerr.client import LabellerrClient
from labellerr.core.schemas import DatasetConfig
from labellerr.core.datasets import create_dataset_from_local

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

dataset = create_dataset_from_local(
    client=client,
    dataset_config=DatasetConfig(
        dataset_name="My Image Dataset",
        dataset_description="A collection of images for object detection",
        data_type="image"
    ),
    folder_to_upload="path/to/your/image/folder"
)

print(f"Dataset created with ID: {dataset.dataset_id}")
print(f"Total files: {dataset.files_count}")
Limitations:
  • Maximum of 2,500 files per folder
  • Total folder size should not exceed 2.5 GB

Method 2: Create Dataset with AWS S3 Connection

Connect AWS S3 Bucket

Create Dataset with AWS S3
from labellerr.client import LabellerrClient
from labellerr.core.schemas import DatasetConfig, AWSConnectionParams, DatasetDataType, ConnectionType
from labellerr.core.datasets import create_dataset_from_connection
from labellerr.core.connectors import LabellerrS3Connection

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

# Create AWS S3 connection
s3_connection = LabellerrS3Connection.create_connection(
    client=client,
    params=AWSConnectionParams(
        aws_access_key="your_aws_access_key",
        aws_secrets_key="your_aws_secret_key",
        path="s3://my-s3-bucket/path/to/data/",
        data_type=DatasetDataType.image,
        connection_type=ConnectionType._IMPORT,
        name="My S3 Import Connection",
        description="AWS S3 bucket for image datasets"
    )
)

# Create dataset using the connection
dataset = create_dataset_from_connection(
    client=client,
    dataset_config=DatasetConfig(
        dataset_name="S3 Image Dataset",
        dataset_description="Images stored in AWS S3",
        data_type="image"
    ),
    connection=s3_connection,
    path="path/to/data/in/bucket"  # Relative path within the bucket
)

print(f"Dataset created with S3 connection: {dataset.dataset_id}")
The SDK creates a connection to your S3 bucket and links it to the dataset. Files are accessed directly from S3 without local downloads.

Method 3: Create Dataset with Google Cloud Storage

Connect GCS Bucket

Create Dataset with GCS
from labellerr.client import LabellerrClient
from labellerr.core.schemas import DatasetConfig, GCSConnectionParams, DatasetDataType, ConnectionType
from labellerr.core.datasets import create_dataset_from_connection
from labellerr.core.connectors import LabellerrGCSConnection

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

# Create GCS connection
gcs_connection = LabellerrGCSConnection.create_connection(
    client=client,
    params=GCSConnectionParams(
        svc_account_json="path/to/service-account-key.json",
        path="gs://my-gcs-bucket/path/to/data/",
        data_type=DatasetDataType.video,
        connection_type=ConnectionType._IMPORT,
        name="My GCS Import Connection",
        description="Google Cloud Storage bucket for video datasets"
    )
)

# Create dataset using the connection
dataset = create_dataset_from_connection(
    client=client,
    dataset_config=DatasetConfig(
        dataset_name="GCS Video Dataset",
        dataset_description="Videos stored in Google Cloud Storage",
        data_type="video"
    ),
    connection=gcs_connection,
    path="path/to/data/in/bucket"  # Relative path within the bucket
)

print(f"Dataset created with GCS connection: {dataset.dataset_id}")

Method 4: Use Existing Cloud Connection

Reuse Connection

If you’ve already created a cloud connection, you can reuse it for new datasets:
Create Dataset with Existing Connection
from labellerr.client import LabellerrClient
from labellerr.core.schemas import DatasetConfig
from labellerr.core.datasets import create_dataset_from_connection
from labellerr.core.connectors import LabellerrConnection

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

# Get existing connection
connection = LabellerrConnection(client=client, connection_id="existing_connection_id_here")

# Create dataset using existing connection
dataset = create_dataset_from_connection(
    client=client,
    dataset_config=DatasetConfig(
        dataset_name="Reusing S3 Connection",
        data_type="image"
    ),
    connection=connection,
    path="path/to/data/in/bucket"
)

print(f"Dataset created using existing connection: {dataset.dataset_id}")
Connection Reuse: You can create multiple datasets from the same connection by specifying different paths within your cloud storage.

Working with Datasets

Retrieve an Existing Dataset

Get Dataset by ID

Retrieve Dataset
from labellerr.client import LabellerrClient
from labellerr.core.datasets import LabellerrDataset

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

dataset = LabellerrDataset(client=client, dataset_id="your_dataset_id")

# Access dataset properties
print(f"Dataset ID: {dataset.dataset_id}")
print(f"Data Type: {dataset.data_type}")
print(f"Files Count: {dataset.files_count}")
print(f"Status Code: {dataset.status_code}")
Status Codes:
  • 300: Dataset is ready and contains files
  • 501: Dataset not found or invalid

Fetch Files from Dataset

List Dataset Files

Fetch Files
from labellerr.client import LabellerrClient
from labellerr.core.datasets import LabellerrDataset

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

dataset = LabellerrDataset(client=client, dataset_id="your_dataset_id")
files = dataset.fetch_files()

print(f"Retrieved {len(files)} files from dataset")
for file in files:
    print(f"File ID: {file['file_id']}, Name: {file['file_name']}")

Enable Multimodal Indexing

Multimodal Indexing Feature

Enable advanced AI-powered multimodal indexing for your dataset to enable semantic search and intelligent file organization.
Enable Multimodal Indexing
from labellerr.client import LabellerrClient
from labellerr.core.datasets import LabellerrDataset

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

dataset = LabellerrDataset(client=client, dataset_id="your_dataset_id")
result = dataset.enable_multimodal_indexing(is_multimodal=True)

print(f"Multimodal indexing enabled: {result}")
What is Multimodal Indexing?Multimodal indexing uses AI to analyze and understand the content of your files (images, videos, audio, text) enabling:
  • Natural language search across your dataset
  • Semantic similarity detection
  • Intelligent file grouping and recommendations
  • Enhanced AI-assisted annotation workflows

Delete a Dataset

Delete Dataset

Delete Dataset
from labellerr.client import LabellerrClient
from labellerr.core.datasets import LabellerrDataset

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

dataset = LabellerrDataset(client=client, dataset_id="dataset_to_delete")
result = dataset.delete_dataset(dataset_id=dataset.dataset_id)

print(f"Dataset deleted: {result}")
Caution: Deleting a dataset will remove it permanently. Ensure it’s not attached to any active projects.

Sync Cloud Datasets

Synchronize Cloud Storage

For datasets connected to cloud storage (AWS S3 or GCS), you can sync to fetch newly added files:
Sync Dataset
from labellerr.client import LabellerrClient
from labellerr.core.datasets import LabellerrDataset

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

dataset = LabellerrDataset(client=client, dataset_id="your_dataset_id")
result = dataset.sync_datasets(
    project=project, #pass entire object
    path="path/in/bucket",
    data_type="image",
    connection=connection #pass entire object   
)

print(f"Dataset synchronized: {result}")
Use this feature when new files are added to your cloud storage bucket and you want to make them available in your Labellerr dataset without creating a new dataset.

Complete Workflow Example

End-to-End Dataset Creation

Complete Dataset Workflow
from labellerr.client import LabellerrClient
from labellerr.core.schemas import DatasetConfig
from labellerr.core.datasets import create_dataset_from_local, LabellerrDataset

client = LabellerrClient(
    api_key='your_api_key',
    api_secret='your_api_secret',
    client_id='your_client_id'
)

# Step 1: Create dataset with local files
dataset = create_dataset_from_local(
    client=client,
    dataset_config=DatasetConfig(
        dataset_name="Production Image Dataset",
        dataset_description="High-quality images for production annotation",
        data_type="image"
    ),
    folder_to_upload="path/to/images"
)

# Step 2: Wait for dataset processing to complete
print(f"Dataset ID: {dataset.dataset_id}")
dataset.status()  # Wait for dataset to be ready
print(f"Files uploaded: {dataset.files_count}")

# Step 3: Enable multimodal indexing
indexing_result = dataset.enable_multimodal_indexing(is_multimodal=True)
print(f"Multimodal indexing enabled: {indexing_result}")

# Step 4: Fetch files for verification
files = dataset.fetch_files()
print(f"Total files in dataset: {len(files)}")

# Now this dataset can be attached to one or more projects
print(f"Dataset {dataset.dataset_id} is ready to be used in projects!")

Error Handling

Best Practices for Error Handling

Error Handling Example
from labellerr.core.datasets import create_dataset_from_local, LabellerrDataset
from labellerr.core.exceptions import LabellerrError

try:
    dataset = create_dataset_from_local(
        client=client,
        dataset_config=DatasetConfig(dataset_name="Test Dataset", data_type="image"),
        folder_to_upload="path/to/folder"
    )
    print(f"Dataset created successfully: {dataset.dataset_id}")
    
    # Wait for processing to complete
    dataset.status()
    print(f"Dataset is ready with {dataset.files_count} files")
except LabellerrError as e:
    print(f"Dataset creation failed: {str(e)}")

Troubleshooting Cloud Connections

Required PermissionsBefore creating datasets from cloud storage, ensure your IAM user (S3) or Service Account (GCS) has the required permissions. See:

Common Issues

Troubleshooting Reference

IssueSymptomSolution
Dataset status 500”Dataset created successfully” but status shows 500/FailedConnection lacks bucket permissions - test connection first
”Dataset has no files”Files exist in S3/GCS but dataset shows 0 filesIAM user/service account missing read permissions
Internal server errorError with tracking ID in UICheck connection permissions, contact support with tracking ID

Best Practice: Always Test Connection First

Test Connection Before Creating Datasets

Always test your connection before creating datasets to catch permission issues early:
Test Connection
from labellerr.client import LabellerrClient
from labellerr.core.connectors import LabellerrConnection
from labellerr.core.schemas import ConnectionType, DatasetDataType

# Initialize client
client = LabellerrClient(api_key, api_secret, client_id)

# Get your existing connection
connection = LabellerrConnection(client=client, connection_id="your_connection_id")

# Test the connection on your specific path
test_result = connection.test(
    path="s3://your-bucket/path/to/data/",  # or gs:// for GCS
    connection_type=ConnectionType._IMPORT,
    data_type=DatasetDataType.image
)
print(f"Connection test: {test_result}")
Path Formats:
  • AWS S3: s3://bucket-name/path/to/folder/
  • GCS: gs://bucket-name/path/to/folder/

Checking Dataset Status After Creation

Verify Dataset Status

Always check dataset status after creation to ensure it processed successfully:
Check Dataset Status
from labellerr.core.datasets import create_dataset_from_connection

# Create dataset
dataset = create_dataset_from_connection(
    client=client,
    dataset_config=dataset_config,
    connection=connection,
    path="s3://your-bucket/path/to/data/"
)
print(f"Dataset ID: {dataset.dataset_id}")

# Check status - this waits for processing to complete
dataset.status()

# Verify status code
print(f"Status Code: {dataset.status_code}")
print(f"Files Count: {dataset.files_count}")

# Status codes:
# 300 = Ready (success)
# 500 = Failed (usually permissions issue)
If status_code is 500, the dataset creation failed. This is usually due to:
  1. Missing bucket permissions on the IAM user/service account
  2. Invalid path format
  3. Bucket doesn’t exist or is inaccessible
Check the connection permissions and test the connection before retrying.

Common Use Cases

Reusable Training Data

Create a master dataset of training images that can be used across multiple annotation projects with different labeling requirements.

Cloud Storage Integration

Connect your existing AWS S3 or GCS buckets to avoid data duplication and manage files directly from cloud storage.

Multi-Project Workflows

Use the same dataset for different annotation tasks - object detection in one project, segmentation in another.

Incremental Data Addition

Sync cloud datasets to continuously add new data to ongoing projects without manual uploads.

Dataset Configuration Reference

DatasetConfig Parameters

ParameterTypeRequiredDescriptionExample Value
dataset_nameStringYesName of the dataset”Training Images 2024”
data_typeStringYesType of data in dataset”image”, “video”, “audio”, “document”, “text”
dataset_descriptionStringNoDescription of dataset contents”Customer-provided training data”
connector_typeStringNoType of connector (default: “local”)“local”, “aws”, “gcp”

For technical support, contact [email protected]