Turn any web page into clean, structured data. The Contents API extracts content from URLs with batch processing, AI-powered summaries, and structured outputs.
What You Can Do
Feed your AI - Clean data without noise
Aggregate content - Extract structured data from multiple sources
Transform content - Convert web pages into usable formats
Automate research - Pull key information from articles, papers, and reports
Features
Batch Processing Submit up to 10 URLs synchronously, or up to 50 URLs with async mode.
AI-Powered Structuring Use JSON schemas to extract specific data points.
Smart Summarisation Generate tailored summaries with custom instructions.
Pay-per-Success Only pay for URLs that are successfully processed.
Getting Started
from valyu import Valyu
valyu = Valyu() # Uses VALYU_API_KEY from env
data = valyu.contents(
urls = [
"https://en.wikipedia.org/wiki/Artificial_intelligence" ,
],
response_length = "medium" ,
extract_effort = "auto" ,
)
print (data[ "results" ][ 0 ][ "content" ][: 500 ])
Returns clean markdown content for each URL.
Response Length
Length Characters Use for short25,000 Summaries, key points medium50,000 Articles, blog posts large100,000 Academic papers, long-form content maxUnlimited Full document extraction Custom integer 1,000-1,000,000 Specific requirements
Effort Description normalStandard speed and quality (default) highBetter quality, slower autoAutomatically chooses the right level
Screenshot Capture
Capture visual screenshots of pages alongside content extraction:
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/article" ],
extract_effort = "auto" ,
screenshot = True ,
)
print (data[ "results" ][ 0 ][ "screenshot_url" ])
Screenshots are captured during page rendering and returned as pre-signed URLs. PDF files do not support screenshots.
Advanced Features
Summary Options
The summary field accepts four types of values:
No AI Processing (false)
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/article" ],
extract_effort = "normal" ,
summary = False ,
)
print (data[ "results" ][ 0 ][ "content" ][: 300 ])
Basic Summary (true)
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/article" ],
extract_effort = "auto" ,
summary = True ,
)
print (data[ "results" ][ 0 ][ "content" ])
Custom Instructions (string)
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/research-paper" ],
extract_effort = "auto" ,
summary = "Summarise the methodology, key findings, and practical applications in 2-3 paragraphs" ,
)
print (data[ "results" ][ 0 ][ "content" ])
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/product-page" ],
extract_effort = "auto" ,
summary = {
"type" : "object" ,
"properties" : {
"product_name" : { "type" : "string" , "description" : "Name of the product" },
"price" : { "type" : "number" , "description" : "Product price in USD" },
"features" : {
"type" : "array" ,
"items" : { "type" : "string" },
"maxItems" : 5 ,
"description" : "Key product features" ,
},
"availability" : {
"type" : "string" ,
"enum" : [ "in_stock" , "out_of_stock" , "preorder" ],
"description" : "Product availability status" ,
},
},
"required" : [ "product_name" , "price" ],
},
)
print (data[ "results" ][ 0 ][ "content" ])
JSON Schema Reference
For structured extraction, you can use any valid JSON Schema. See the JSON Schema Type Reference for details.
Limits:
5,000 characters max
3 levels deep max
20 properties per object max
Common types:
string - Text with optional format validation
number / integer - Numbers with optional min/max
boolean - True/false
array - Lists with optional size limits
object - Nested structures
Async Processing
For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode. Submit URLs, get a job_id back immediately, then either poll for results or receive them via webhook.
When to use async
More than 10 URLs — required for batches of 11-50 URLs
Non-blocking workflows — submit and continue processing while extraction runs in the background
Webhook-driven architectures — receive results via webhook instead of polling
Extended processing — async mode provides 120s timeout per URL vs 25s for sync
Async mode is required when submitting more than 10 URLs. For 1-10 URLs, you can optionally use async mode for non-blocking workflows.
Quick start
from valyu import Valyu
valyu = Valyu()
# Submit async job — returns immediately with a job_id
job = valyu.contents(
urls = [
"https://example.com/page1" ,
"https://example.com/page2" ,
"https://example.com/page3" ,
# ... up to 50 URLs
],
async_mode = True ,
webhook_url = "https://your-app.com/webhooks/valyu" , # optional
)
print ( f "Job ID: { job[ 'job_id' ] } " )
# Option 1: Block until complete (SDK handles polling)
result = valyu.wait_for_contents_job(
job[ "job_id" ],
poll_interval = 5 , # seconds between polls
max_wait_time = 3600 , # max seconds to wait
on_progress = lambda s : print ( f " { s[ 'status' ] } — batch { s.get( 'current_batch' , '?' ) } / { s.get( 'total_batches' , '?' ) } " ),
)
# Option 2: Or pass wait=True when submitting to auto-poll
result = valyu.contents(
urls = [ "https://example.com/page1" , "https://example.com/page2" ],
async_mode = True ,
wait = True , # blocks until job completes
)
for r in result[ "results" ]:
print ( f " { r[ 'title' ] } : { r[ 'length' ] } characters" )
print ( f "Total cost: $ { result[ 'actual_cost_dollars' ] } " )
The webhook_secret is only returned in the initial 202 response. Store it immediately — you cannot retrieve it later.
Initial response (HTTP 202)
{
"success" : true ,
"job_id" : "cj_a1b2c3d4e5f6g7h8" ,
"status" : "pending" ,
"urls_total" : 25 ,
"poll_url" : "/contents/jobs/cj_a1b2c3d4e5f6g7h8" ,
"tx_id" : "tx_async-1234-5678-abcd-ef0123456789" ,
"webhook_secret" : "a1b2c3d4e5f6..."
}
Manual polling
If you prefer to poll manually instead of using the SDK convenience methods:
import time
while True :
status = valyu.get_contents_job(job[ "job_id" ])
print ( f "Status: { status[ 'status' ] } ( { status.get( 'current_batch' , '?' ) } / { status.get( 'total_batches' , '?' ) } batches)" )
if status[ "status" ] in ( "completed" , "partial" , "failed" ):
break
time.sleep( 2 )
if status[ "status" ] in ( "completed" , "partial" ):
for result in status[ "results" ]:
print ( f " { result[ 'title' ] } : { result[ 'length' ] } characters" )
if status[ "status" ] in ( "partial" , "failed" ):
print ( f "Error: { status.get( 'error' , 'Unknown' ) } " )
Job lifecycle
Jobs progress through these statuses:
Status Description pendingJob created, not yet started processingURLs being processed in batches of 5 completedAll URLs processed successfully partialFinished with some URL failures failedAll URLs failed
Job status fields
Field Type Description Present job_idstring Job identifier (prefixed cj_) Always statusstring Current job status (see above) Always urls_totalnumber Total URLs submitted Always urls_processednumber URLs successfully processed Always urls_failednumber URLs that failed processing Always created_atnumber Job creation time (ms since epoch) Always updated_atnumber Last update time (ms since epoch) Always current_batchnumber Current batch being processed processing onlytotal_batchesnumber Total number of batches processing onlyresultsarray Extraction results completed/partialactual_cost_dollarsnumber Total cost in dollars completed/partialerrorstring Error description partial/failed only
Webhooks
When you provide a webhook_url, Valyu sends a POST request to your endpoint when the job completes (or partially completes/fails).
Headers:
Header Value Content-Typeapplication/jsonUser-AgentValyu-Contents/1.0X-Webhook-Signaturesha256={hex_digest}X-Webhook-TimestampUnix timestamp in seconds
Payload:
{
"job_id" : "cj_a1b2c3d4e5f6g7h8" ,
"status" : "completed" ,
"urls_total" : 25 ,
"urls_processed" : 25 ,
"urls_failed" : 0 ,
"results" : [ ... ],
"actual_cost_dollars" : 0.025 ,
"error" : null
}
Retries: Max 5 attempts with exponential backoff (1s, 2s, 4s, 8s). No retry on 4xx. 5s connect timeout, 15s read timeout.
Verifying webhook signatures
The payload is signed using HMAC-SHA256. The signed payload is the timestamp and JSON body joined by a period: "{timestamp}.{json_payload}".
import hmac
import hashlib
def verify_webhook (
payload : bytes ,
signature_header : str ,
timestamp_header : str ,
webhook_secret : str
) -> bool :
signed_payload = f " { timestamp_header } . { payload.decode( 'utf-8' ) } "
expected = hmac.new(
webhook_secret.encode( "utf-8" ),
signed_payload.encode( "utf-8" ),
hashlib.sha256
).hexdigest()
received = signature_header.removeprefix( "sha256=" )
return hmac.compare_digest(expected, received)
The TypeScript SDK exports a verifyContentsWebhookSignature() helper that handles this for you.
Limits and pricing
Detail Value Max URLs (sync) 10 Max URLs (async) 50 Batch size 5 URLs per batch Timeout per URL (sync) 25 seconds Timeout per URL (async) 120 seconds Base pricing $0.001 per URL AI features (summary/schema) +$0.001 per URL Job expiry (TTL) 7 days
Examples
News Aggregator
Extract structured article data:
{
"urls" : [
"https://en.wikipedia.org/wiki/Artificial_intelligence" ,
"https://archive.org/details/texts" ,
"https://www.gov.uk/search/news"
],
"extract_effort" : "auto" ,
"summary" : {
"type" : "object" ,
"properties" : {
"headline" : { "type" : "string" },
"summary_text" : { "type" : "string" },
"category" : { "type" : "string" },
"tags" : {
"type" : "array" ,
"items" : { "type" : "string" },
"maxItems" : 5
}
},
"required" : [ "headline" ]
}
}
Research Paper
Extract structured academic data:
{
"urls" : [ "https://arxiv.org/paper/example" ],
"response_length" : "max" ,
"extract_effort" : "high" ,
"summary" : {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" },
"abstract" : { "type" : "string" },
"methodology" : { "type" : "string" },
"key_findings" : {
"type" : "array" ,
"items" : { "type" : "string" }
},
"limitations" : { "type" : "string" }
},
"required" : [ "title" ]
}
}
Product Info
Extract product data:
{
"urls" : [ "https://company.com/product-A" , "https://company.com/product-B" ],
"extract_effort" : "auto" ,
"summary" : {
"type" : "object" ,
"properties" : {
"product_name" : { "type" : "string" },
"features" : {
"type" : "array" ,
"items" : { "type" : "string" }
},
"pricing" : { "type" : "string" },
"target_audience" : { "type" : "string" }
},
"required" : [ "product_name" ]
}
}
Raw Content (summary: false)
{
"success" : true ,
"error" : null ,
"tx_id" : "tx_12345678-1234-1234-1234-123456789abc" ,
"results" : [
{
"title" : "AI Breakthrough in Natural Language Processing" ,
"url" : "https://example.com/article?utm_source=valyu" ,
"content" : "# AI Breakthrough in Natural Language Processing \n\n Page content in markdown..." ,
"description" : "Latest AI developments" ,
"source" : "web" ,
"price" : 0.001 ,
"length" : 12840 ,
"data_type" : "unstructured" ,
"image_url" : {
"main" : "https://example.com/hero-image.jpg"
}
}
],
"urls_requested" : 1 ,
"urls_processed" : 1 ,
"urls_failed" : 0 ,
"total_cost_dollars" : 0.001 ,
"total_characters" : 12840
}
Summary (summary: true or string)
{
"success" : true ,
"results" : [
{
"title" : "AI Breakthrough in Natural Language Processing" ,
"content" : "This article discusses a breakthrough in AI..." ,
"summary_success" : true ,
"price" : 0.002 ,
"data_type" : "unstructured"
}
]
}
Structured (JSON Schema)
{
"success" : true ,
"results" : [
{
"title" : "AI Breakthrough in Natural Language Processing" ,
"content" : {
"title" : "AI Breakthrough in Natural Language Processing" ,
"author" : "John Doe" ,
"category" : "technology" ,
"key_points" : [
"New AI model achieves 95% accuracy" ,
"Reduces computational requirements by 40%"
]
},
"summary_success" : true ,
"price" : 0.002 ,
"data_type" : "structured"
}
]
}
Response Fields
Field Description titleExtracted page title urlOriginal URL with UTM tracking parameters contentExtracted content (markdown or JSON) descriptionPage meta description or excerpt sourceSource type - always “web” for URL processing priceCost for processing this URL in dollars lengthCharacter count of extracted content data_type"unstructured" or "structured"summary_successWhether AI processing succeeded (only when summary parameter is used) image_urlDictionary of extracted image URLs screenshot_urlPre-signed URL to page screenshot (only when screenshot=true was requested)
Best Practices
Choosing Summary Type
false: Fastest and cheapest—no AI
true: Basic summary for overviews
"string": Custom instructions for specific needs
{object}: Structured extraction for data processing
JSON Schema Tips
Use clear descriptions to guide extraction
Use enums for consistent categorisation
Keep schemas under 3 levels deep
Mark essential fields as required
Batch Processing
Group similar content types together
Choose appropriate response length
Check summary_success for AI status
Track total_cost_dollars
Error Handling
# Check for partial failures (HTTP 206)
if response.status_code == 206 :
successful_results = [r for r in response.json()[ "results" ]]
failed_count = response.json()[ "urls_failed" ]
# Check AI processing success
for result in results:
if "summary" in result and "summary_success" in result:
if not result[ "summary_success" ]:
print ( f "AI processing failed for { result[ 'url' ] } " )
# Handle complete failures (HTTP 422)
if response.status_code == 422 :
error_message = response.json()[ "error" ]
Try the Contents API Full API reference with interactive examples
Next Steps
API Reference Complete parameter documentation
Python SDK Python integration
TypeScript SDK TypeScript integration
Integrations LangChain, LlamaIndex, and more