Skip to main content
Turn any web page into clean, structured data. The Contents API extracts content from URLs with batch processing, AI-powered summaries, and structured outputs.

What You Can Do

  • Feed your AI - Clean data without noise
  • Aggregate content - Extract structured data from multiple sources
  • Transform content - Convert web pages into usable formats
  • Automate research - Pull key information from articles, papers, and reports

Features

Batch Processing

Submit up to 10 URLs synchronously, or up to 50 URLs with async mode.

AI-Powered Structuring

Use JSON schemas to extract specific data points.

Smart Summarisation

Generate tailored summaries with custom instructions.

Pay-per-Success

Only pay for URLs that are successfully processed.

Getting Started

Basic Extraction

from valyu import Valyu

valyu = Valyu()  # Uses VALYU_API_KEY from env

data = valyu.contents(
    urls=[
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
    ],
    response_length="medium",
    extract_effort="auto",
)
print(data["results"][0]["content"][:500])
Returns clean markdown content for each URL.

Response Length

LengthCharactersUse for
short25,000Summaries, key points
medium50,000Articles, blog posts
large100,000Academic papers, long-form content
maxUnlimitedFull document extraction
Custom integer1,000-1,000,000Specific requirements

Extract Effort

EffortDescription
normalStandard speed and quality (default)
highBetter quality, slower
autoAutomatically chooses the right level

Screenshot Capture

Capture visual screenshots of pages alongside content extraction:
from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="auto",
    screenshot=True,
)
print(data["results"][0]["screenshot_url"])
Screenshots are captured during page rendering and returned as pre-signed URLs. PDF files do not support screenshots.

Advanced Features

Summary Options

The summary field accepts four types of values:

No AI Processing (false)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="normal",
    summary=False,
)
print(data["results"][0]["content"][:300])

Basic Summary (true)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="auto",
    summary=True,
)
print(data["results"][0]["content"])

Custom Instructions (string)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/research-paper"],
    extract_effort="auto",
    summary="Summarise the methodology, key findings, and practical applications in 2-3 paragraphs",
)
print(data["results"][0]["content"])

Structured Extraction (object)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/product-page"],
    extract_effort="auto",
    summary={
        "type": "object",
        "properties": {
            "product_name": {"type": "string", "description": "Name of the product"},
            "price": {"type": "number", "description": "Product price in USD"},
            "features": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Key product features",
            },
            "availability": {
                "type": "string",
                "enum": ["in_stock", "out_of_stock", "preorder"],
                "description": "Product availability status",
            },
        },
        "required": ["product_name", "price"],
    },
)
print(data["results"][0]["content"])

JSON Schema Reference

For structured extraction, you can use any valid JSON Schema. See the JSON Schema Type Reference for details. Limits:
  • 5,000 characters max
  • 3 levels deep max
  • 20 properties per object max
Common types:
  • string - Text with optional format validation
  • number / integer - Numbers with optional min/max
  • boolean - True/false
  • array - Lists with optional size limits
  • object - Nested structures

Async Processing

For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode. Submit URLs, get a job_id back immediately, then either poll for results or receive them via webhook.

When to use async

  • More than 10 URLs — required for batches of 11-50 URLs
  • Non-blocking workflows — submit and continue processing while extraction runs in the background
  • Webhook-driven architectures — receive results via webhook instead of polling
  • Extended processing — async mode provides 120s timeout per URL vs 25s for sync
Async mode is required when submitting more than 10 URLs. For 1-10 URLs, you can optionally use async mode for non-blocking workflows.

Quick start

from valyu import Valyu

valyu = Valyu()

# Submit async job — returns immediately with a job_id
job = valyu.contents(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        # ... up to 50 URLs
    ],
    async_mode=True,
    webhook_url="https://your-app.com/webhooks/valyu",  # optional
)

print(f"Job ID: {job['job_id']}")

# Option 1: Block until complete (SDK handles polling)
result = valyu.wait_for_contents_job(
    job["job_id"],
    poll_interval=5,        # seconds between polls
    max_wait_time=3600,     # max seconds to wait
    on_progress=lambda s: print(f"  {s['status']} — batch {s.get('current_batch', '?')}/{s.get('total_batches', '?')}"),
)

# Option 2: Or pass wait=True when submitting to auto-poll
result = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2"],
    async_mode=True,
    wait=True,  # blocks until job completes
)

for r in result["results"]:
    print(f"{r['title']}: {r['length']} characters")
print(f"Total cost: ${result['actual_cost_dollars']}")
The webhook_secret is only returned in the initial 202 response. Store it immediately — you cannot retrieve it later.

Initial response (HTTP 202)

{
  "success": true,
  "job_id": "cj_a1b2c3d4e5f6g7h8",
  "status": "pending",
  "urls_total": 25,
  "poll_url": "/contents/jobs/cj_a1b2c3d4e5f6g7h8",
  "tx_id": "tx_async-1234-5678-abcd-ef0123456789",
  "webhook_secret": "a1b2c3d4e5f6..."
}

Manual polling

If you prefer to poll manually instead of using the SDK convenience methods:
import time

while True:
    status = valyu.get_contents_job(job["job_id"])
    print(f"Status: {status['status']} ({status.get('current_batch', '?')}/{status.get('total_batches', '?')} batches)")

    if status["status"] in ("completed", "partial", "failed"):
        break
    time.sleep(2)

if status["status"] in ("completed", "partial"):
    for result in status["results"]:
        print(f"{result['title']}: {result['length']} characters")

if status["status"] in ("partial", "failed"):
    print(f"Error: {status.get('error', 'Unknown')}")

Job lifecycle

Jobs progress through these statuses:
StatusDescription
pendingJob created, not yet started
processingURLs being processed in batches of 5
completedAll URLs processed successfully
partialFinished with some URL failures
failedAll URLs failed

Job status fields

FieldTypeDescriptionPresent
job_idstringJob identifier (prefixed cj_)Always
statusstringCurrent job status (see above)Always
urls_totalnumberTotal URLs submittedAlways
urls_processednumberURLs successfully processedAlways
urls_failednumberURLs that failed processingAlways
created_atnumberJob creation time (ms since epoch)Always
updated_atnumberLast update time (ms since epoch)Always
current_batchnumberCurrent batch being processedprocessing only
total_batchesnumberTotal number of batchesprocessing only
resultsarrayExtraction resultscompleted/partial
actual_cost_dollarsnumberTotal cost in dollarscompleted/partial
errorstringError descriptionpartial/failed only

Webhooks

When you provide a webhook_url, Valyu sends a POST request to your endpoint when the job completes (or partially completes/fails). Headers:
HeaderValue
Content-Typeapplication/json
User-AgentValyu-Contents/1.0
X-Webhook-Signaturesha256={hex_digest}
X-Webhook-TimestampUnix timestamp in seconds
Payload:
{
  "job_id": "cj_a1b2c3d4e5f6g7h8",
  "status": "completed",
  "urls_total": 25,
  "urls_processed": 25,
  "urls_failed": 0,
  "results": [...],
  "actual_cost_dollars": 0.025,
  "error": null
}
Retries: Max 5 attempts with exponential backoff (1s, 2s, 4s, 8s). No retry on 4xx. 5s connect timeout, 15s read timeout.

Verifying webhook signatures

The payload is signed using HMAC-SHA256. The signed payload is the timestamp and JSON body joined by a period: "{timestamp}.{json_payload}".
import hmac
import hashlib

def verify_webhook(
    payload: bytes,
    signature_header: str,
    timestamp_header: str,
    webhook_secret: str
) -> bool:
    signed_payload = f"{timestamp_header}.{payload.decode('utf-8')}"

    expected = hmac.new(
        webhook_secret.encode("utf-8"),
        signed_payload.encode("utf-8"),
        hashlib.sha256
    ).hexdigest()

    received = signature_header.removeprefix("sha256=")
    return hmac.compare_digest(expected, received)
The TypeScript SDK exports a verifyContentsWebhookSignature() helper that handles this for you.

Limits and pricing

DetailValue
Max URLs (sync)10
Max URLs (async)50
Batch size5 URLs per batch
Timeout per URL (sync)25 seconds
Timeout per URL (async)120 seconds
Base pricing$0.001 per URL
AI features (summary/schema)+$0.001 per URL
Job expiry (TTL)7 days

Examples

News Aggregator

Extract structured article data:
{
  "urls": [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://archive.org/details/texts",
    "https://www.gov.uk/search/news"
  ],
  "extract_effort": "auto",
  "summary": {
    "type": "object",
    "properties": {
      "headline": { "type": "string" },
      "summary_text": { "type": "string" },
      "category": { "type": "string" },
      "tags": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["headline"]
  }
}

Research Paper

Extract structured academic data:
{
  "urls": ["https://arxiv.org/paper/example"],
  "response_length": "max",
  "extract_effort": "high",
  "summary": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "abstract": { "type": "string" },
      "methodology": { "type": "string" },
      "key_findings": {
        "type": "array",
        "items": { "type": "string" }
      },
      "limitations": { "type": "string" }
    },
    "required": ["title"]
  }
}

Product Info

Extract product data:
{
  "urls": ["https://company.com/product-A", "https://company.com/product-B"],
  "extract_effort": "auto",
  "summary": {
    "type": "object",
    "properties": {
      "product_name": { "type": "string" },
      "features": {
        "type": "array",
        "items": { "type": "string" }
      },
      "pricing": { "type": "string" },
      "target_audience": { "type": "string" }
    },
    "required": ["product_name"]
  }
}

Response Format

Raw Content (summary: false)

{
  "success": true,
  "error": null,
  "tx_id": "tx_12345678-1234-1234-1234-123456789abc",
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "url": "https://example.com/article?utm_source=valyu",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "description": "Latest AI developments",
      "source": "web",
      "price": 0.001,
      "length": 12840,
      "data_type": "unstructured",
      "image_url": {
        "main": "https://example.com/hero-image.jpg"
      }
    }
  ],
  "urls_requested": 1,
  "urls_processed": 1,
  "urls_failed": 0,
  "total_cost_dollars": 0.001,
  "total_characters": 12840
}

Summary (summary: true or string)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "This article discusses a breakthrough in AI...",
      "summary_success": true,
      "price": 0.002,
      "data_type": "unstructured"
    }
  ]
}

Structured (JSON Schema)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": {
        "title": "AI Breakthrough in Natural Language Processing",
        "author": "John Doe",
        "category": "technology",
        "key_points": [
          "New AI model achieves 95% accuracy",
          "Reduces computational requirements by 40%"
        ]
      },
      "summary_success": true,
      "price": 0.002,
      "data_type": "structured"
    }
  ]
}

Response Fields

FieldDescription
titleExtracted page title
urlOriginal URL with UTM tracking parameters
contentExtracted content (markdown or JSON)
descriptionPage meta description or excerpt
sourceSource type - always “web” for URL processing
priceCost for processing this URL in dollars
lengthCharacter count of extracted content
data_type"unstructured" or "structured"
summary_successWhether AI processing succeeded (only when summary parameter is used)
image_urlDictionary of extracted image URLs
screenshot_urlPre-signed URL to page screenshot (only when screenshot=true was requested)

Best Practices

Choosing Summary Type

  • false: Fastest and cheapest—no AI
  • true: Basic summary for overviews
  • "string": Custom instructions for specific needs
  • {object}: Structured extraction for data processing

JSON Schema Tips

  • Use clear descriptions to guide extraction
  • Use enums for consistent categorisation
  • Keep schemas under 3 levels deep
  • Mark essential fields as required

Batch Processing

  • Group similar content types together
  • Choose appropriate response length
  • Check summary_success for AI status
  • Track total_cost_dollars

Error Handling

# Check for partial failures (HTTP 206)
if response.status_code == 206:
    successful_results = [r for r in response.json()["results"]]
    failed_count = response.json()["urls_failed"]

# Check AI processing success
for result in results:
    if "summary" in result and "summary_success" in result:
        if not result["summary_success"]:
            print(f"AI processing failed for {result['url']}")

# Handle complete failures (HTTP 422)
if response.status_code == 422:
    error_message = response.json()["error"]

Try the Contents API

Full API reference with interactive examples

Next Steps

API Reference

Complete parameter documentation

Python SDK

Python integration

TypeScript SDK

TypeScript integration

Integrations

LangChain, LlamaIndex, and more