File Management Guide

This guide covers uploading, processing, and managing files in Satori enclaves.

Supported File Types

Satori supports a wide variety of file types:

Documents

PDF: application/pdf
Text: text/plain, text/csv, text/tsv
Word: .docx, .doc
Excel: .xlsx, .xls
PowerPoint: .pptx, .ppt
OpenDocument: .odt, .ods, .odp
Other: JSON, XML, RTF

Images

JPEG, PNG, GIF, WebP, SVG, TIFF, BMP

Video (with transcription)

MP4, MPEG, AVI, MOV, WMV, WebM, MKV, FLV

Audio (with transcription)

MP3, WAV, OGG, M4A, AAC, MIDI

Uploading Files

Basic Upload

curl -X POST "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@/path/to/document.pdf"

Upload with Metadata

curl -X POST "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@document.pdf" \
  -F 'metadata={"author": "John Doe", "category": "research", "date": "2025-01-15"}'

Upload with Webhook

curl -X POST "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>" \
  -F "file=@document.pdf" \
  -F "webhook_url=https://your-server.com/webhook/file-processed"

Python Example

import requests

def upload_file(file_path, enclave_id, metadata=None, webhook_url=None):
    url = f"{BASE_URL}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/"
    headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

    files = {"file": open(file_path, "rb")}
    data = {}

    if metadata:
        data["metadata"] = json.dumps(metadata)
    if webhook_url:
        data["webhook_url"] = webhook_url

    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()

# Usage
file_info = upload_file(
    "document.pdf",
    enclave_id,
    metadata={"author": "John Doe", "category": "research"},
    webhook_url="https://myapp.com/webhook"
)
print(f"File uploaded: {file_info['id']}, Status: {file_info['status']}")

JavaScript/TypeScript Example

async function uploadFile(
  file: File,
  enclaveId: string,
  metadata?: Record<string, any>,
  webhookUrl?: string
) {
  const formData = new FormData();
  formData.append("file", file);

  if (metadata) {
    formData.append("metadata", JSON.stringify(metadata));
  }
  if (webhookUrl) {
    formData.append("webhook_url", webhookUrl);
  }

  const response = await fetch(
    `/api/tenants/${tenantId}/enclaves/${enclaveId}/files/`,
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${token}`,
      },
      body: formData,
    }
  );

  return await response.json();
}

File Processing Pipeline

Files go through several processing stages:

pending → File uploaded, queued for processing
processing → Content extraction in progress
clearing_artifacts → Cleaning up temporary files
building_artifacts → Creating vector embeddings
classifying → AI classification (optional)
ready → File ready for queries
failed → Processing failed (check logs)

Processing Times

Small PDFs (< 10MB): 30-60 seconds
Large PDFs (> 100MB): 2-5 minutes
Videos: 1-10 minutes (depends on length)
Audio: 30 seconds - 3 minutes
Images: 10-30 seconds

Monitoring File Status

Check Single File Status

curl -X GET "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

List All Files

curl -X GET "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

Polling for Ready Status

import time

def wait_for_file_ready(file_id, max_wait=300, poll_interval=5):
    """Wait for file to be ready, with timeout."""
    start_time = time.time()

    while time.time() - start_time < max_wait:
        response = requests.get(
            f"{BASE_URL}/files/{file_id}",
            headers={"Authorization": f"Bearer {JWT_TOKEN}"}
        )
        file = response.json()

        if file["status"] == "ready":
            return file
        elif file["status"] == "failed":
            raise Exception(f"File processing failed: {file_id}")

        time.sleep(poll_interval)

    raise TimeoutError(f"File not ready within {max_wait} seconds")

Webhooks

Webhooks notify your server when file processing completes.

Webhook Payload

{
  "event": "file.status_changed",
  "file_id": "850e8400-e29b-41d4-a716-446655440000",
  "status": "ready",
  "tenant_id": "550e8400-e29b-41d4-a716-446655440000",
  "enclave_id": "750e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2025-01-15T10:05:30Z",
  "metadata": {
    "file_name": "contract.pdf",
    "size_bytes": 245000
  }
}

Webhook Implementation

from fastapi import FastAPI, Request

app = FastAPI()

@app.post("/webhook/file-processed")
async def handle_file_webhook(request: Request):
    payload = await request.json()

    if payload["status"] == "ready":
        file_id = payload["file_id"]
        # File is ready - start querying
        await process_ready_file(file_id)
    elif payload["status"] == "failed":
        # Handle failure
        await handle_failed_upload(payload["file_id"])

    return {"status": "received"}

Webhook Requirements

HTTPS only: Webhook URLs must use HTTPS
Retry logic: Satori retries failed webhooks (3 attempts with exponential backoff)
Response: Your endpoint should return 200 OK

File Metadata

Adding Metadata

Metadata is stored as JSON and can include any key-value pairs:

metadata = {
    "author": "John Doe",
    "date": "2025-01-15",
    "category": "research",
    "department": "engineering",
    "project": "project-alpha",
    "version": "1.0",
    "tags": ["important", "reviewed"]
}

Best Practices

Keep under 10KB: Large metadata can slow processing
Use searchable fields: Include fields you might want to filter by
Consistent structure: Use the same fields across similar files
Include timestamps: Track when files were created/uploaded

Retrieving Metadata

response = requests.get(
    f"{BASE_URL}/files/{file_id}",
    headers={"Authorization": f"Bearer {JWT_TOKEN}"}
)
file = response.json()
metadata = file.get("file_meta", {})
print(f"Author: {metadata.get('author')}")

File Limits

Size Limits

Maximum file size: 512MB
Recommended: Keep files under 100MB for faster processing
Large files: Consider splitting into multiple files

Handling Large Files

def split_large_pdf(file_path, max_size_mb=100):
    """Split large PDF into smaller chunks."""
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb > max_size_mb:
        # Use PDF splitting library
        # Upload each chunk separately
        pass

Getting Transcripts

For video and audio files, retrieve transcripts:

curl -X GET "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}/transcript" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

Response:

{
  "file_id": "850e8400-e29b-41d4-a716-446655440000",
  "filename": "meeting_recording.mp4",
  "content_type": "video/mp4",
  "transcript": "Welcome everyone to today's meeting...",
  "keywords": ["quarterly results", "revenue increase"],
  "created_at": "2025-01-15T10:05:00Z",
  "updated_at": "2025-01-15T10:05:00Z"
}

Deleting Files

⚠️ Warning: Deletion is permanent and cannot be undone.

curl -X DELETE "{api_host}/api/tenants/{tenant_id}/enclaves/{enclave_id}/files/{file_id}" \
  -H "Authorization: Bearer <YOUR_JWT_TOKEN>"

What gets deleted:

File record from database
Object storage object
Vector embeddings
Transcripts
All processing artifacts

Duplicate Handling

Files are deduplicated by SHA-256 hash:

Uploading the same file twice returns the existing file
Use a different file_id to force a new upload
Duplicate detection happens automatically

Error Handling

Common Errors

413 Payload Too Large

File exceeds 512MB limit
Solution: Split or compress the file

415 Unsupported Media Type

File type not allowed
Solution: Check supported file types list

400 Bad Request

Invalid metadata JSON
Invalid webhook URL (must be HTTPS)
Missing required fields

404 Not Found

File doesn't exist
Check file_id and enclave_id

Best Practices

✅ DO:

Use webhooks for async processing
Add meaningful metadata
Monitor file processing status
Handle file size limits
Use appropriate file types

❌ DON'T:

Upload files without checking status
Upload duplicate files unnecessarily
Upload files larger than 512MB
Ignore failed processing status
Use insecure webhook URLs (HTTP)

Next Steps

Querying Documents Guide - Query your uploaded files
Best Practices Guide - General usage best practices
API Reference - Full API documentation

File Management Guide

Supported File Types

Documents

Images

Video (with transcription)

Audio (with transcription)

Archives

Uploading Files

Basic Upload

Upload with Metadata

Upload with Webhook

Python Example

JavaScript/TypeScript Example

File Processing Pipeline

Processing Times

Monitoring File Status

Check Single File Status

List All Files

Polling for Ready Status

Webhooks

Webhook Payload

Webhook Implementation

Webhook Requirements

File Metadata

Adding Metadata

Best Practices

Retrieving Metadata

File Limits

Size Limits

Handling Large Files

Getting Transcripts

Deleting Files

Duplicate Handling

Error Handling

Common Errors

Best Practices

✅ DO:

❌ DON'T:

Next Steps