Skip to main content
The Contents API enables you to extract clean, structured content from web pages with optional AI-powered processing, including summarization and structured data extraction.

Basic Usage

from valyu import Valyu

valyu = Valyu()

response = valyu.contents([
    "https://en.wikipedia.org/wiki/Machine_learning"
])

print(f"Processed {response.urls_processed} of {response.urls_requested} URLs")
if response.results:
    for result in response.results:
        print(f"Title: {result.title}")
        print(f"Content length: {result.length} characters")
        print(f"Content preview: {result.content[:200]}...")

Parameters

URLs (Required)

ParameterTypeDescription
urlsList[str]Array of URLs to process (max 10 sync, max 50 async)

Options (Optional)

ParameterTypeDescriptionDefault
summarybool | str | dictAI processing configuration: False (none), True (auto), string (custom), or JSON schemaFalse
extract_effort"normal" | "high" | "auto"Processing effort level for content extraction”normal”
response_lengthstr | intContent length per URL: "short" (25k), "medium" (50k), "large" (100k), "max", or custom”short”
screenshotboolRequest page screenshots. When True, results include screenshot_url fieldFalse

Response Format

class ContentsResponse:
    success: bool
    error: Optional[str]
    tx_id: str
    urls_requested: int
    urls_processed: int
    urls_failed: int
    results: List[ContentsResult]
    total_cost_dollars: float
    total_characters: int

class ContentsResult:
    url: str
    title: str
    content: Union[str, dict]  # string for raw content, dict for structured
    description: Optional[str]
    length: int
    price: float
    source: str
    summary_success: Optional[bool]
    data_type: Optional[str]
    image_url: Optional[Dict[str, str]]
    screenshot_url: Optional[str]  # Only present when screenshot=True
    citation: Optional[str]

Parameter Examples

Basic Content Extraction

Extract clean content without AI processing:
response = valyu.contents([
    "https://www.python.org",
    "https://nodejs.org"
])

if response.results:
    for result in response.results:
        print(f"{result.title}: {result.length} characters")

AI Summary (Boolean)

Get automatic AI summaries of the extracted content:
response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], summary=True, response_length="medium")

if response.results and response.results[0].content:
    print("AI Summary:", response.results[0].content)

Custom Summary Instructions

Provide specific instructions for AI summarization:
response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], 
summary="Summarize the main AI trends mentioned in exactly 3 bullet points",
response_length="medium",
extract_effort="high")

Structured Data Extraction

Extract specific data points using JSON schema:
response = valyu.contents([
    "https://www.openai.com"
], 
extract_effort="high",
response_length="large",
summary={
    "type": "object",
    "properties": {
        "company_name": { 
            "type": "string",
            "description": "The name of the company"
        },
        "industry": { 
            "type": "string",
            "enum": ["tech", "finance", "healthcare", "retail", "other"],
            "description": "Primary industry sector"
        },
        "key_products": {
            "type": "array",
            "items": {"type": "string"},
            "maxItems": 5,
            "description": "Main products or services"
        },
        "founded_year": {
            "type": "number",
            "description": "Year the company was founded"
        }
    },
    "required": ["company_name", "industry"]
})

if response.results and response.results[0].content:
    print("Extracted data:", response.results[0].content)

Response Length Control

Control the amount of content extracted per URL:
response = valyu.contents([
    "https://arxiv.org/abs/2301.00001",
    "https://arxiv.org/abs/1706.03762",
    "https://www.science.org/doi/10.1126/science.1234567"
], 
response_length="large",  # More content for academic papers
summary="Extract the main research findings and methodology",
extract_effort="high")

Extract Effort Levels

Control the extraction quality and processing intensity:
# Normal (default) - Fast
normal_response = valyu.contents(urls, extract_effort="normal")

# High - Enhanced quality for complex layouts and JS heavy pages
high_quality_response = valyu.contents(urls, extract_effort="high")

# Auto - Intelligent effort selection
auto_response = valyu.contents(urls, extract_effort="auto")

Response Length Options

Control content length with predefined or custom limits:
# Predefined lengths
short_response = valyu.contents(urls, response_length="short")    # 25k characters
medium_response = valyu.contents(urls, response_length="medium")  # 50k characters  
large_response = valyu.contents(urls, response_length="large")    # 100k characters
full_response = valyu.contents(urls, response_length="max")       # No limit

# Custom length
custom_response = valyu.contents(urls, response_length=15000)     # Custom character limit

Use Case Examples

Research Paper Analysis

Build an AI-powered academic research assistant that extracts and analyzes research papers:
def analyze_research_paper(paper_url: str):
    response = valyu.contents([paper_url], 
    summary={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "authors": { 
                "type": "array", 
                "items": {"type": "string"} 
            },
            "abstract": {"type": "string"},
            "key_contributions": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Main contributions of the research"
            },
            "methodology": { 
                "type": "string",
                "description": "Research methodology and approach"
            },
            "results_summary": { 
                "type": "string",
                "description": "Summary of key findings and results"
            },
            "implications": {
                "type": "string",
                "description": "Broader implications and significance"
            },
            "citations_count": {"type": "number"},
            "publication_date": {"type": "string"}
        },
        "required": ["title", "abstract", "key_contributions", "methodology"]
    },
    response_length="max",
    extract_effort="high")

    if response.success and response.results and response.results[0].summary:
        analysis = response.results[0].summary
        
        print("=== Research Paper Analysis ===")
        print(f"Title: {analysis['title']}")
        print(f"Authors: {', '.join(analysis.get('authors', []))}")
        print(f"\nAbstract: {analysis['abstract']}")
        
        print("\nKey Contributions:")
        for i, contrib in enumerate(analysis.get('key_contributions', []), 1):
            print(f"{i}. {contrib}")
        
        print(f"\nMethodology: {analysis['methodology']}")
        print(f"\nResults: {analysis['results_summary']}")
        print(f"\nImplications: {analysis['implications']}")
        
        return analysis
    
    return None

# Usage
paper_analysis = analyze_research_paper(
    "https://arxiv.org/abs/2024.01234"
)

E-commerce Product Intelligence

Create a product research tool that extracts comprehensive product data:
def analyze_products(product_urls: List[str]):
    response = valyu.contents(product_urls, 
    summary={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "product_name": {"type": "string"},
                        "brand": {"type": "string"},
                        "price": {"type": "string"},
                        "original_price": {"type": "string"},
                        "discount_percentage": {"type": "string"},
                        "description": {"type": "string"},
                        "key_features": {
                            "type": "array",
                            "items": {"type": "string"},
                            "maxItems": 8
                        },
                        "specifications": {
                            "type": "object",
                            "description": "Technical specifications"
                        },
                        "customer_rating": {"type": "number"},
                        "review_count": {"type": "number"},
                        "availability": { 
                            "type": "string",
                            "enum": ["in_stock", "out_of_stock", "limited", "pre_order"]
                        },
                        "shipping_info": {"type": "string"},
                        "warranty_info": {"type": "string"}
                    },
                    "required": ["product_name", "price", "description"]
                }
            },
            "comparison_summary": {
                "type": "string",
                "description": "Overall comparison of the products"
            }
        }
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        analysis = response.results[0].content
        
        print("=== Product Analysis ===")
        for i, product in enumerate(analysis.get('products', []), 1):
            print(f"\n{i}. {product['product_name']}")
            print(f"   Brand: {product['brand']}")
            print(f"   Price: {product['price']}")
            print(f"   Rating: {product['customer_rating']}/5 ({product['review_count']} reviews)")
            print(f"   Availability: {product['availability']}")
            
            if product.get('key_features'):
                print("   Key Features:")
                for feature in product['key_features']:
                    print(f"     • {feature}")
        
        print(f"\n=== Comparison Summary ===")
        print(analysis['comparison_summary'])
        
        return analysis
    
    return None

# Usage
product_comparison = analyze_products([
    "https://amazon.com/product1",
    "https://bestbuy.com/product2",
    "https://target.com/product3"
])

Technical Documentation Processor

Build a documentation analysis tool that extracts API information and technical details:
def process_documentation(doc_urls: List[str]):
    response = valyu.contents(doc_urls, 
    summary={
        "type": "object",
        "properties": {
            "documentation_overview": {
                "type": "string",
                "description": "Overview of what the documentation covers"
            },
            "api_endpoints": {
                "type": "array",
                "items": {
                    "type": "object", 
                    "properties": {
                        "method": {"type": "string"},
                        "path": {"type": "string"},
                        "description": {"type": "string"},
                        "parameters": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "name": {"type": "string"},
                                    "type": {"type": "string"},
                                    "required": {"type": "boolean"},
                                    "description": {"type": "string"}
                                }
                            }
                        },
                        "response_format": {"type": "string"}
                    }
                }
            },
            "authentication": {
                "type": "object",
                "properties": {
                    "method": {"type": "string"},
                    "description": {"type": "string"},
                    "example": {"type": "string"}
                }
            },
            "rate_limits": {"type": "string"},
            "code_examples": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "language": {"type": "string"},
                        "example": {"type": "string"},
                        "description": {"type": "string"}
                    }
                }
            },
            "common_errors": {
                "type": "array",
                "items": {"type": "string"}
            }
        },
        "required": ["documentation_overview", "api_endpoints", "authentication"]
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        docs = response.results[0].content
        
        print("=== API Documentation Analysis ===")
        print(f"\nOverview: {docs['documentation_overview']}")
        
        print("\n=== Authentication ===")
        auth = docs.get('authentication', {})
        print(f"Method: {auth.get('method')}")
        print(f"Description: {auth.get('description')}")
        
        print("\n=== API Endpoints ===")
        for i, endpoint in enumerate(docs.get('api_endpoints', []), 1):
            print(f"\n{i}. {endpoint['method']} {endpoint['path']}")
            print(f"   Description: {endpoint['description']}")
            
            if endpoint.get('parameters'):
                print("   Parameters:")
                for param in endpoint['parameters']:
                    required = "(required)" if param['required'] else "(optional)"
                    print(f"     • {param['name']} ({param['type']}) {required}: {param['description']}")
        
        if docs.get('rate_limits'):
            print(f"\n=== Rate Limits ===")
            print(docs['rate_limits'])
        
        return docs
    
    return None

# Usage
api_docs = process_documentation([
    "https://docs.example.com/api-reference",
    "https://developers.service.com/guide"
])

Async Processing

For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode.
Async mode is required when submitting more than 10 URLs. Max 50 URLs per request, processed in batches of 5 with 120s timeout per URL (vs 25s sync). Jobs expire after 7 days.

Submit and wait

The simplest approach — pass wait=True to block until the job completes:
result = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    wait=True,              # blocks until job completes
    poll_interval=5,        # seconds between polls (default: 5)
    max_wait_time=3600,     # max seconds to wait (default: 3600)
)

for r in result["results"]:
    print(f"{r['title']}: {r['length']} characters")
print(f"Total cost: ${result['actual_cost_dollars']}")

Submit, then wait separately

For more control, submit first, then call wait_for_contents_job() with a progress callback:
# Submit — returns immediately
job = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    webhook_url="https://your-app.com/webhooks/valyu",  # optional
)
print(f"Job ID: {job['job_id']}")

# Store the webhook_secret immediately — it is ONLY returned here
if job.get("webhook_secret"):
    save_webhook_secret(job["job_id"], job["webhook_secret"])

# Wait with progress tracking
result = valyu.wait_for_contents_job(
    job["job_id"],
    poll_interval=5,
    max_wait_time=3600,
    on_progress=lambda s: print(f"  {s['status']} — batch {s.get('current_batch', '?')}/{s.get('total_batches', '?')}"),
)

if result["status"] in ("completed", "partial"):
    for r in result["results"]:
        print(f"{r['title']}: {r['length']} characters")

Manual polling

If you prefer full control over the polling loop:
import time

while True:
    status = valyu.get_contents_job(job["job_id"])
    print(f"Status: {status['status']}")

    if status["status"] in ("completed", "partial", "failed"):
        break
    time.sleep(2)

Async parameters

ParameterTypeDescriptionDefault
async_modeboolProcess URLs asynchronously. Required for more than 10 URLs.False
webhook_urlstrHTTPS URL to receive results via webhook POST.None
waitboolBlock until the job completes (SDK handles polling).False
poll_intervalintSeconds between polls when wait=True or using wait_for_contents_job.5
max_wait_timeintMax seconds to wait before timing out.3600
on_progressCallableCallback invoked on each poll with the current status dict.None

Webhook verification

Webhooks are signed using HMAC-SHA256 with format "{timestamp}.{json_body}". See the Content Extraction guide for full verification examples.

Async response types

# Initial response (HTTP 202)
class ContentsAsyncResponse:
    success: bool
    job_id: str
    status: str               # Always "pending"
    urls_total: int
    poll_url: str
    tx_id: str
    webhook_secret: Optional[str]  # ONLY returned here — store immediately

# Job status response (polling / wait result)
class ContentsJobResponse:
    success: bool
    job_id: str
    status: str               # "pending" | "processing" | "completed" | "partial" | "failed"
    urls_total: int
    urls_processed: int
    urls_failed: int
    created_at: int           # Milliseconds since epoch
    updated_at: int
    current_batch: Optional[int]           # Present during "processing"
    total_batches: Optional[int]           # Present during "processing"
    results: Optional[List[ContentsResult]]  # Present when completed/partial
    actual_cost_dollars: Optional[float]     # Present when completed/partial
    error: Optional[str]                     # Present when partial/failed

Error Handling

response = valyu.contents(urls, **options)

if not response.success:
    print("Contents extraction failed:", response.error)
    return

# Check for partial failures
if response.urls_failed and response.urls_failed > 0:
    print(f"{response.urls_failed} of {response.urls_requested} URLs failed")

# Process successful results
if response.results:
    for index, result in enumerate(response.results):
        print(f"Result {index + 1}:")
        print(f"  Title: {result.title}")
        print(f"  URL: {result.url}")
        print(f"  Length: {result.length} characters")
        
        if result.summary_success:
            print(f"  Summary: {result.content}")