Skip to main content

What is an Output Schema?

An output schema is a self-contained, reproducible extraction recipe attached to a document. It bundles three things:
ComponentPurpose
SchemaThe shape of the output (JSON Schema)
PromptExtraction instructions sent to the LLM
ModelWhich LLM runs the extraction
Once defined, the SDK extracts data from the document, validates it against the schema, and materializes the result — storing it permanently alongside a full audit trail.

Why Output Schemas?

Reproducibility. Every output records exactly what produced it: which model, what prompt, the raw LLM response before parsing. You can always trace back from a result to its source. Zero-cost reads. Materialized outputs are written to R2 on creation. Public reads serve directly from R2 — the Durable Object never wakes. No compute cost on read. Composability. A single document can have many output schemas: invoice, receipt, contract_terms, compliance_flags. Each is an independent extraction with its own recipe.

How It Works

SDK extracts data


  ┌─────────────────────────┐
  │   Durable Object        │
  │   ┌───────────────────┐ │
  │   │ output_profiles   │ │  ← recipe (schema + prompt + model)
  │   │ materialized_data │ │  ← result + audit trail
  │   └───────────────────┘ │
  │           │             │
  │     write to R2         │
  └─────────────────────────┘


  ┌─────────────────────────┐
  │   R2 (data only)        │  ← public reads, no DO wake
  │   /o_invoice/data.json  │
  └─────────────────────────┘

Use Cases

Invoice Processing

Extract vendor, total, line items, and dates from uploaded invoices. Attach the output schema once, then every invoice in the collection gets the same extraction recipe applied.
const profile = {
  schema: {
    type: 'object',
    properties: {
      vendor: { type: 'string' },
      invoice_number: { type: 'string' },
      date: { type: 'string', format: 'date' },
      total: { type: 'number' },
      currency: { type: 'string' },
      line_items: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            description: { type: 'string' },
            quantity: { type: 'number' },
            unit_price: { type: 'number' },
            amount: { type: 'number' },
          },
        },
      },
    },
  },
  prompt: 'Extract all invoice fields. For line items, include description, quantity, unit price, and line total.',
  model: 'claude-sonnet-4-5-20250929',
};
After materialization, any system can read the structured invoice data at:
GET /v1/documents/{id}/o_invoice/data.json
No API key needed. No server wake. Just JSON.

Compliance Screening

Flag regulatory risks in financial filings. The schema defines the flags, the prompt instructs what to look for, the model does the analysis.
const profile = {
  schema: {
    type: 'object',
    properties: {
      risk_level: { type: 'string', enum: ['low', 'medium', 'high', 'critical'] },
      flags: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            category: { type: 'string' },
            description: { type: 'string' },
            page: { type: 'number' },
            severity: { type: 'string' },
          },
        },
      },
      summary: { type: 'string' },
    },
  },
  prompt: 'Analyze this filing for regulatory compliance risks. Flag material weaknesses, related party transactions, going concern language, and restatement disclosures.',
  model: 'claude-sonnet-4-5-20250929',
};

Contract Term Extraction

Pull key terms from legal documents for deal review dashboards.
const profile = {
  schema: {
    type: 'object',
    properties: {
      parties: { type: 'array', items: { type: 'string' } },
      effective_date: { type: 'string', format: 'date' },
      termination_date: { type: 'string', format: 'date' },
      governing_law: { type: 'string' },
      payment_terms: { type: 'string' },
      auto_renewal: { type: 'boolean' },
      non_compete_months: { type: 'number' },
      liability_cap: { type: 'string' },
    },
  },
  prompt: 'Extract key contract terms including parties, dates, governing law, payment terms, renewal clauses, non-compete duration, and liability caps.',
  model: 'claude-sonnet-4-5-20250929',
};

Resume Parsing

Structure candidate data from uploaded resumes for ATS integrations.
const profile = {
  schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      email: { type: 'string' },
      phone: { type: 'string' },
      skills: { type: 'array', items: { type: 'string' } },
      experience: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            company: { type: 'string' },
            title: { type: 'string' },
            start_date: { type: 'string' },
            end_date: { type: 'string' },
          },
        },
      },
      education: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            institution: { type: 'string' },
            degree: { type: 'string' },
            year: { type: 'number' },
          },
        },
      },
    },
  },
  prompt: 'Extract structured candidate information from this resume.',
  model: 'claude-sonnet-4-5-20250929',
};

Audit Trail

Every materialized output stores a full audit record alongside the data:
FieldDescription
modelThe model that ran the extraction
promptThe exact prompt that was sent
raw_responseThe raw LLM output before JSON parsing
created_atTimestamp of materialization
Access the audit trail at:
GET /document/{id}/output/{name}/audit
This is authenticated and never exposed publicly — the public R2 path only serves the validated data.

Public URL Pattern

Materialized outputs are available at a predictable, cacheable URL:
GET /v1/documents/{id}/o_{name}/data.json
The o_ prefix tells the worker to read from R2 directly. The Durable Object never wakes. Combine with the t_ transform prefix for provider-specific extractions:
GET /v1/documents/{id}/t_llamaparse/o_invoice/data.json
Response headers include Cache-Control: public, max-age=3600 and Access-Control-Allow-Origin: * for easy embedding. The t_ and o_ URL segments are inspired by Cloudinary’s URL-as-API pattern — encode transforms in the path so results are cacheable, embeddable, and readable without an SDK.