Content Endpoint Guide

Turn any web page into clean, structured data. The Contents API extracts content from URLs with batch processing, AI-powered summaries, and structured outputs.

What You Can Do

Feed your AI - Clean data without noise
Aggregate content - Extract structured data from multiple sources
Transform content - Convert web pages into usable formats
Automate research - Pull key information from articles, papers, and reports

Features

Batch Processing

Submit up to 10 URLs synchronously, or up to 50 URLs with async mode.

AI-Powered Structuring

Use JSON schemas to extract specific data points.

Smart Summarisation

Generate tailored summaries with custom instructions.

Pay-per-Success

Only pay for URLs that are successfully processed.

Getting Started

Basic Extraction

from valyu import Valyu

valyu = Valyu()  # Uses VALYU_API_KEY from env

data = valyu.contents(
    urls=[
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
    ],
    response_length="medium",
    extract_effort="auto",
)
print(data["results"][0]["content"][:500])

Returns clean markdown content for each URL.

Response Length

Length	Characters	Use for
`short`	25,000	Summaries, key points
`medium`	50,000	Articles, blog posts
`large`	100,000	Academic papers, long-form content
`max`	Unlimited	Full document extraction
Custom integer	1,000-1,000,000	Specific requirements

Extract Effort

Effort	Description
`normal`	Standard speed and quality (default)
`high`	Better quality, slower
`auto`	Automatically chooses the right level

Screenshot Capture

Capture visual screenshots of pages alongside content extraction:

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="auto",
    screenshot=True,
)
print(data["results"][0]["screenshot_url"])

Screenshots are captured during page rendering and returned as pre-signed URLs. PDF files do not support screenshots.

Advanced Features

Summary Options

The summary field accepts four types of values:

No AI Processing (`false`)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="normal",
    summary=False,
)
print(data["results"][0]["content"][:300])

Basic Summary (`true`)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="auto",
    summary=True,
)
print(data["results"][0]["content"])

Custom Instructions (`string`)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/research-paper"],
    extract_effort="auto",
    summary="Summarise the methodology, key findings, and practical applications in 2-3 paragraphs",
)
print(data["results"][0]["content"])

Structured Extraction (`object`)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/product-page"],
    extract_effort="auto",
    summary={
        "type": "object",
        "properties": {
            "product_name": {"type": "string", "description": "Name of the product"},
            "price": {"type": "number", "description": "Product price in USD"},
            "features": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Key product features",
            },
            "availability": {
                "type": "string",
                "enum": ["in_stock", "out_of_stock", "preorder"],
                "description": "Product availability status",
            },
        },
        "required": ["product_name", "price"],
    },
)
print(data["results"][0]["content"])

JSON Schema Reference

For structured extraction, you can use any valid JSON Schema. See the JSON Schema Type Reference for details. Limits:

5,000 characters max
3 levels deep max
20 properties per object max

Common types:

string - Text with optional format validation
number / integer - Numbers with optional min/max
boolean - True/false
array - Lists with optional size limits
object - Nested structures

Async Processing

For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode. Submit URLs, get a job_id back immediately, then either poll for results or receive them via webhook.

When to use async

More than 10 URLs — required for batches of 11-50 URLs
Non-blocking workflows — submit and continue processing while extraction runs in the background
Webhook-driven architectures — receive results via webhook instead of polling
Extended processing — async mode provides 120s timeout per URL vs 25s for sync

Async mode is required when submitting more than 10 URLs. For 1-10 URLs, you can optionally use async mode for non-blocking workflows.

Quick start

from valyu import Valyu

valyu = Valyu()

# Submit async job — returns immediately with a job_id
job = valyu.contents(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        # ... up to 50 URLs
    ],
    async_mode=True,
    webhook_url="https://your-app.com/webhooks/valyu",  # optional
)

print(f"Job ID: {job['job_id']}")

# Option 1: Block until complete (SDK handles polling)
result = valyu.wait_for_contents_job(
    job["job_id"],
    poll_interval=5,        # seconds between polls
    max_wait_time=3600,     # max seconds to wait
    on_progress=lambda s: print(f"  {s['status']} — batch {s.get('current_batch', '?')}/{s.get('total_batches', '?')}"),
)

# Option 2: Or pass wait=True when submitting to auto-poll
result = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2"],
    async_mode=True,
    wait=True,  # blocks until job completes
)

for r in result["results"]:
    print(f"{r['title']}: {r['length']} characters")
print(f"Total cost: ${result['actual_cost_dollars']}")

The webhook_secret is only returned in the initial 202 response. Store it immediately — you cannot retrieve it later.

Initial response (HTTP 202)

{
  "success": true,
  "job_id": "cj_a1b2c3d4e5f6g7h8",
  "status": "pending",
  "urls_total": 25,
  "poll_url": "/contents/jobs/cj_a1b2c3d4e5f6g7h8",
  "tx_id": "tx_async-1234-5678-abcd-ef0123456789",
  "webhook_secret": "a1b2c3d4e5f6..."
}

Manual polling

If you prefer to poll manually instead of using the SDK convenience methods:

import time

while True:
    status = valyu.get_contents_job(job["job_id"])
    print(f"Status: {status['status']} ({status.get('current_batch', '?')}/{status.get('total_batches', '?')} batches)")

    if status["status"] in ("completed", "partial", "failed"):
        break
    time.sleep(2)

if status["status"] in ("completed", "partial"):
    for result in status["results"]:
        print(f"{result['title']}: {result['length']} characters")

if status["status"] in ("partial", "failed"):
    print(f"Error: {status.get('error', 'Unknown')}")

Job lifecycle

Jobs progress through these statuses:

Status	Description
`pending`	Job created, not yet started
`processing`	URLs being processed in batches of 5
`completed`	All URLs processed successfully
`partial`	Finished with some URL failures
`failed`	All URLs failed

Job status fields

Field	Type	Description	Present
`job_id`	string	Job identifier (prefixed `cj_`)	Always
`status`	string	Current job status (see above)	Always
`urls_total`	number	Total URLs submitted	Always
`urls_processed`	number	URLs successfully processed	Always
`urls_failed`	number	URLs that failed processing	Always
`created_at`	number	Job creation time (ms since epoch)	Always
`updated_at`	number	Last update time (ms since epoch)	Always
`current_batch`	number	Current batch being processed	`processing` only
`total_batches`	number	Total number of batches	`processing` only
`results`	array	Extraction results	`completed`/`partial`
`actual_cost_dollars`	number	Total cost in dollars	`completed`/`partial`
`error`	string	Error description	`partial`/`failed` only

Webhooks

When you provide a webhook_url, Valyu sends a POST request to your endpoint when the job completes (or partially completes/fails). Headers:

Header	Value
`Content-Type`	`application/json`
`User-Agent`	`Valyu-Contents/1.0`
`X-Webhook-Signature`	`sha256={hex_digest}`
`X-Webhook-Timestamp`	Unix timestamp in seconds

Payload:

{
  "job_id": "cj_a1b2c3d4e5f6g7h8",
  "status": "completed",
  "urls_total": 25,
  "urls_processed": 25,
  "urls_failed": 0,
  "results": [...],
  "actual_cost_dollars": 0.025,
  "error": null
}

Retries: Max 5 attempts with exponential backoff (1s, 2s, 4s, 8s). No retry on 4xx. 5s connect timeout, 15s read timeout.

Verifying webhook signatures

The payload is signed using HMAC-SHA256. The signed payload is the timestamp and JSON body joined by a period: "{timestamp}.{json_payload}".

import hmac
import hashlib

def verify_webhook(
    payload: bytes,
    signature_header: str,
    timestamp_header: str,
    webhook_secret: str
) -> bool:
    signed_payload = f"{timestamp_header}.{payload.decode('utf-8')}"

    expected = hmac.new(
        webhook_secret.encode("utf-8"),
        signed_payload.encode("utf-8"),
        hashlib.sha256
    ).hexdigest()

    received = signature_header.removeprefix("sha256=")
    return hmac.compare_digest(expected, received)

The TypeScript SDK exports a verifyContentsWebhookSignature() helper that handles this for you.

Limits and pricing

Detail	Value
Max URLs (sync)	10
Max URLs (async)	50
Batch size	5 URLs per batch
Timeout per URL (sync)	25 seconds
Timeout per URL (async)	120 seconds
Base pricing	$0.001 per URL
AI features (summary/schema)	+$0.001 per URL
Job expiry (TTL)	7 days

Examples

News Aggregator

Extract structured article data:

{
  "urls": [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://archive.org/details/texts",
    "https://www.gov.uk/search/news"
  ],
  "extract_effort": "auto",
  "summary": {
    "type": "object",
    "properties": {
      "headline": { "type": "string" },
      "summary_text": { "type": "string" },
      "category": { "type": "string" },
      "tags": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["headline"]
  }
}

Research Paper

Extract structured academic data:

{
  "urls": ["https://arxiv.org/paper/example"],
  "response_length": "max",
  "extract_effort": "high",
  "summary": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "abstract": { "type": "string" },
      "methodology": { "type": "string" },
      "key_findings": {
        "type": "array",
        "items": { "type": "string" }
      },
      "limitations": { "type": "string" }
    },
    "required": ["title"]
  }
}

Product Info

Extract product data:

{
  "urls": ["https://company.com/product-A", "https://company.com/product-B"],
  "extract_effort": "auto",
  "summary": {
    "type": "object",
    "properties": {
      "product_name": { "type": "string" },
      "features": {
        "type": "array",
        "items": { "type": "string" }
      },
      "pricing": { "type": "string" },
      "target_audience": { "type": "string" }
    },
    "required": ["product_name"]
  }
}

Response Format

Raw Content (summary: false)

{
  "success": true,
  "error": null,
  "tx_id": "tx_12345678-1234-1234-1234-123456789abc",
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "url": "https://example.com/article?utm_source=valyu",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "description": "Latest AI developments",
      "source": "web",
      "price": 0.001,
      "length": 12840,
      "data_type": "unstructured",
      "image_url": {
        "main": "https://example.com/hero-image.jpg"
      }
    }
  ],
  "urls_requested": 1,
  "urls_processed": 1,
  "urls_failed": 0,
  "total_cost_dollars": 0.001,
  "total_characters": 12840
}

Summary (summary: true or string)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "This article discusses a breakthrough in AI...",
      "summary_success": true,
      "price": 0.002,
      "data_type": "unstructured"
    }
  ]
}

Structured (JSON Schema)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": {
        "title": "AI Breakthrough in Natural Language Processing",
        "author": "John Doe",
        "category": "technology",
        "key_points": [
          "New AI model achieves 95% accuracy",
          "Reduces computational requirements by 40%"
        ]
      },
      "summary_success": true,
      "price": 0.002,
      "data_type": "structured"
    }
  ]
}

Response Fields

Field	Description
`title`	Extracted page title
`url`	Original URL with UTM tracking parameters
`content`	Extracted content (markdown or JSON)
`description`	Page meta description or excerpt
`source`	Source type - always “web” for URL processing
`price`	Cost for processing this URL in dollars
`length`	Character count of extracted content
`data_type`	`"unstructured"` or `"structured"`
`summary_success`	Whether AI processing succeeded (only when `summary` parameter is used)
`image_url`	Dictionary of extracted image URLs
`screenshot_url`	Pre-signed URL to page screenshot (only when `screenshot=true` was requested)

Best Practices

Choosing Summary Type

false: Fastest and cheapest—no AI
true: Basic summary for overviews
"string": Custom instructions for specific needs
{object}: Structured extraction for data processing

JSON Schema Tips

Use clear descriptions to guide extraction
Use enums for consistent categorisation
Keep schemas under 3 levels deep
Mark essential fields as required

Batch Processing

Group similar content types together
Choose appropriate response length
Check summary_success for AI status
Track total_cost_dollars

Error Handling

# Check for partial failures (HTTP 206)
if response.status_code == 206:
    successful_results = [r for r in response.json()["results"]]
    failed_count = response.json()["urls_failed"]

# Check AI processing success
for result in results:
    if "summary" in result and "summary_success" in result:
        if not result["summary_success"]:
            print(f"AI processing failed for {result['url']}")

# Handle complete failures (HTTP 422)
if response.status_code == 422:
    error_message = response.json()["error"]

Try the Contents API

Full API reference with interactive examples

Next Steps

API Reference

Complete parameter documentation

Python SDK

Python integration

TypeScript SDK

TypeScript integration

Integrations

LangChain, LlamaIndex, and more

Getting Started

Guides & Best Practices

AI SDK Tooling Guides

Use Cases

Core Concepts

Data Sources

Compare

Important Updates

Account & Pricing

Other

​What You Can Do

​Features

Batch Processing

AI-Powered Structuring

Smart Summarisation

Pay-per-Success

​Getting Started

​Basic Extraction

​Response Length

​Extract Effort

​Screenshot Capture

​Advanced Features

​Summary Options

​No AI Processing (false)

​Basic Summary (true)

​Custom Instructions (string)

​Structured Extraction (object)

​JSON Schema Reference

​Async Processing

​When to use async

​Quick start

​Initial response (HTTP 202)

​Manual polling

​Job lifecycle

​Job status fields

​Webhooks

​Verifying webhook signatures

​Limits and pricing

​Examples

​News Aggregator

​Research Paper

​Product Info

​Response Format

​Raw Content (summary: false)

​Summary (summary: true or string)

​Structured (JSON Schema)

​Response Fields

​Best Practices

​Choosing Summary Type

​JSON Schema Tips

​Batch Processing

​Error Handling

Try the Contents API

​Next Steps

API Reference

Python SDK

TypeScript SDK

Integrations

What You Can Do

Features

Getting Started

Basic Extraction

Response Length

Extract Effort

Screenshot Capture

Advanced Features

Summary Options

No AI Processing (`false`)

Basic Summary (`true`)

Custom Instructions (`string`)

Structured Extraction (`object`)

JSON Schema Reference

Async Processing

When to use async

Quick start

Initial response (HTTP 202)

Manual polling

Job lifecycle

Job status fields

Webhooks

Verifying webhook signatures

Limits and pricing

Examples

News Aggregator

Research Paper

Product Info

Response Format

Raw Content (summary: false)

Summary (summary: true or string)

Structured (JSON Schema)

Response Fields

Best Practices

Choosing Summary Type

JSON Schema Tips

Batch Processing

Error Handling

Next Steps