Contents API | Python SDK

The Contents API enables you to extract clean, structured content from web pages with optional AI-powered processing, including summarization and structured data extraction.

Basic Usage

from valyu import Valyu

valyu = Valyu()

response = valyu.contents([
    "https://en.wikipedia.org/wiki/Machine_learning"
])

print(f"Processed {response.urls_processed} of {response.urls_requested} URLs")
if response.results:
    for result in response.results:
        print(f"Title: {result.title}")
        print(f"Content length: {result.length} characters")
        print(f"Content preview: {result.content[:200]}...")

Parameters

URLs (Required)

Parameter	Type	Description
`urls`	List[str]	Array of URLs to process (max 10 sync, max 50 async)

Options (Optional)

Parameter	Type	Description	Default
`summary`	bool \| str \| dict	AI processing configuration: `False` (none), `True` (auto), string (custom), or JSON schema	False
`extract_effort`	`"normal"` \| `"high"` \| `"auto"`	Processing effort level for content extraction	”normal”
`response_length`	str \| int	Content length per URL: `"short"` (25k), `"medium"` (50k), `"large"` (100k), `"max"`, or custom	”short”
`screenshot`	bool	Request page screenshots. When `True`, results include `screenshot_url` field	False

Response Format

class ContentsResponse:
    success: bool
    error: Optional[str]
    tx_id: str
    urls_requested: int
    urls_processed: int
    urls_failed: int
    results: List[ContentsResult]
    total_cost_dollars: float
    total_characters: int

class ContentsResult:
    url: str
    title: str
    content: Union[str, dict]  # string for raw content, dict for structured
    description: Optional[str]
    length: int
    price: float
    source: str
    summary_success: Optional[bool]
    data_type: Optional[str]
    image_url: Optional[Dict[str, str]]
    screenshot_url: Optional[str]  # Only present when screenshot=True
    citation: Optional[str]

Parameter Examples

Basic Content Extraction

Extract clean content without AI processing:

response = valyu.contents([
    "https://www.python.org",
    "https://nodejs.org"
])

if response.results:
    for result in response.results:
        print(f"{result.title}: {result.length} characters")

AI Summary (Boolean)

Get automatic AI summaries of the extracted content:

response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], summary=True, response_length="medium")

if response.results and response.results[0].content:
    print("AI Summary:", response.results[0].content)

Custom Summary Instructions

Provide specific instructions for AI summarization:

response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], 
summary="Summarize the main AI trends mentioned in exactly 3 bullet points",
response_length="medium",
extract_effort="high")

Structured Data Extraction

Extract specific data points using JSON schema:

response = valyu.contents([
    "https://www.openai.com"
], 
extract_effort="high",
response_length="large",
summary={
    "type": "object",
    "properties": {
        "company_name": { 
            "type": "string",
            "description": "The name of the company"
        },
        "industry": { 
            "type": "string",
            "enum": ["tech", "finance", "healthcare", "retail", "other"],
            "description": "Primary industry sector"
        },
        "key_products": {
            "type": "array",
            "items": {"type": "string"},
            "maxItems": 5,
            "description": "Main products or services"
        },
        "founded_year": {
            "type": "number",
            "description": "Year the company was founded"
        }
    },
    "required": ["company_name", "industry"]
})

if response.results and response.results[0].content:
    print("Extracted data:", response.results[0].content)

Response Length Control

Control the amount of content extracted per URL:

response = valyu.contents([
    "https://arxiv.org/abs/2301.00001",
    "https://arxiv.org/abs/1706.03762",
    "https://www.science.org/doi/10.1126/science.1234567"
], 
response_length="large",  # More content for academic papers
summary="Extract the main research findings and methodology",
extract_effort="high")

Extract Effort Levels

Control the extraction quality and processing intensity:

# Normal (default) - Fast
normal_response = valyu.contents(urls, extract_effort="normal")

# High - Enhanced quality for complex layouts and JS heavy pages
high_quality_response = valyu.contents(urls, extract_effort="high")

# Auto - Intelligent effort selection
auto_response = valyu.contents(urls, extract_effort="auto")

Response Length Options

Control content length with predefined or custom limits:

# Predefined lengths
short_response = valyu.contents(urls, response_length="short")    # 25k characters
medium_response = valyu.contents(urls, response_length="medium")  # 50k characters  
large_response = valyu.contents(urls, response_length="large")    # 100k characters
full_response = valyu.contents(urls, response_length="max")       # No limit

# Custom length
custom_response = valyu.contents(urls, response_length=15000)     # Custom character limit

Use Case Examples

Research Paper Analysis

Build an AI-powered academic research assistant that extracts and analyzes research papers:

def analyze_research_paper(paper_url: str):
    response = valyu.contents([paper_url], 
    summary={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "authors": { 
                "type": "array", 
                "items": {"type": "string"} 
            },
            "abstract": {"type": "string"},
            "key_contributions": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Main contributions of the research"
            },
            "methodology": { 
                "type": "string",
                "description": "Research methodology and approach"
            },
            "results_summary": { 
                "type": "string",
                "description": "Summary of key findings and results"
            },
            "implications": {
                "type": "string",
                "description": "Broader implications and significance"
            },
            "citations_count": {"type": "number"},
            "publication_date": {"type": "string"}
        },
        "required": ["title", "abstract", "key_contributions", "methodology"]
    },
    response_length="max",
    extract_effort="high")

    if response.success and response.results and response.results[0].summary:
        analysis = response.results[0].summary
        
        print("=== Research Paper Analysis ===")
        print(f"Title: {analysis['title']}")
        print(f"Authors: {', '.join(analysis.get('authors', []))}")
        print(f"\nAbstract: {analysis['abstract']}")
        
        print("\nKey Contributions:")
        for i, contrib in enumerate(analysis.get('key_contributions', []), 1):
            print(f"{i}. {contrib}")
        
        print(f"\nMethodology: {analysis['methodology']}")
        print(f"\nResults: {analysis['results_summary']}")
        print(f"\nImplications: {analysis['implications']}")
        
        return analysis
    
    return None

# Usage
paper_analysis = analyze_research_paper(
    "https://arxiv.org/abs/2024.01234"
)

E-commerce Product Intelligence

Create a product research tool that extracts comprehensive product data:

def analyze_products(product_urls: List[str]):
    response = valyu.contents(product_urls, 
    summary={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "product_name": {"type": "string"},
                        "brand": {"type": "string"},
                        "price": {"type": "string"},
                        "original_price": {"type": "string"},
                        "discount_percentage": {"type": "string"},
                        "description": {"type": "string"},
                        "key_features": {
                            "type": "array",
                            "items": {"type": "string"},
                            "maxItems": 8
                        },
                        "specifications": {
                            "type": "object",
                            "description": "Technical specifications"
                        },
                        "customer_rating": {"type": "number"},
                        "review_count": {"type": "number"},
                        "availability": { 
                            "type": "string",
                            "enum": ["in_stock", "out_of_stock", "limited", "pre_order"]
                        },
                        "shipping_info": {"type": "string"},
                        "warranty_info": {"type": "string"}
                    },
                    "required": ["product_name", "price", "description"]
                }
            },
            "comparison_summary": {
                "type": "string",
                "description": "Overall comparison of the products"
            }
        }
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        analysis = response.results[0].content
        
        print("=== Product Analysis ===")
        for i, product in enumerate(analysis.get('products', []), 1):
            print(f"\n{i}. {product['product_name']}")
            print(f"   Brand: {product['brand']}")
            print(f"   Price: {product['price']}")
            print(f"   Rating: {product['customer_rating']}/5 ({product['review_count']} reviews)")
            print(f"   Availability: {product['availability']}")
            
            if product.get('key_features'):
                print("   Key Features:")
                for feature in product['key_features']:
                    print(f"     • {feature}")
        
        print(f"\n=== Comparison Summary ===")
        print(analysis['comparison_summary'])
        
        return analysis
    
    return None

# Usage
product_comparison = analyze_products([
    "https://amazon.com/product1",
    "https://bestbuy.com/product2",
    "https://target.com/product3"
])

Technical Documentation Processor

Build a documentation analysis tool that extracts API information and technical details:

def process_documentation(doc_urls: List[str]):
    response = valyu.contents(doc_urls, 
    summary={
        "type": "object",
        "properties": {
            "documentation_overview": {
                "type": "string",
                "description": "Overview of what the documentation covers"
            },
            "api_endpoints": {
                "type": "array",
                "items": {
                    "type": "object", 
                    "properties": {
                        "method": {"type": "string"},
                        "path": {"type": "string"},
                        "description": {"type": "string"},
                        "parameters": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "name": {"type": "string"},
                                    "type": {"type": "string"},
                                    "required": {"type": "boolean"},
                                    "description": {"type": "string"}
                                }
                            }
                        },
                        "response_format": {"type": "string"}
                    }
                }
            },
            "authentication": {
                "type": "object",
                "properties": {
                    "method": {"type": "string"},
                    "description": {"type": "string"},
                    "example": {"type": "string"}
                }
            },
            "rate_limits": {"type": "string"},
            "code_examples": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "language": {"type": "string"},
                        "example": {"type": "string"},
                        "description": {"type": "string"}
                    }
                }
            },
            "common_errors": {
                "type": "array",
                "items": {"type": "string"}
            }
        },
        "required": ["documentation_overview", "api_endpoints", "authentication"]
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        docs = response.results[0].content
        
        print("=== API Documentation Analysis ===")
        print(f"\nOverview: {docs['documentation_overview']}")
        
        print("\n=== Authentication ===")
        auth = docs.get('authentication', {})
        print(f"Method: {auth.get('method')}")
        print(f"Description: {auth.get('description')}")
        
        print("\n=== API Endpoints ===")
        for i, endpoint in enumerate(docs.get('api_endpoints', []), 1):
            print(f"\n{i}. {endpoint['method']} {endpoint['path']}")
            print(f"   Description: {endpoint['description']}")
            
            if endpoint.get('parameters'):
                print("   Parameters:")
                for param in endpoint['parameters']:
                    required = "(required)" if param['required'] else "(optional)"
                    print(f"     • {param['name']} ({param['type']}) {required}: {param['description']}")
        
        if docs.get('rate_limits'):
            print(f"\n=== Rate Limits ===")
            print(docs['rate_limits'])
        
        return docs
    
    return None

# Usage
api_docs = process_documentation([
    "https://docs.example.com/api-reference",
    "https://developers.service.com/guide"
])

Async Processing

For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode.

Async mode is required when submitting more than 10 URLs. Max 50 URLs per request, processed in batches of 5 with 120s timeout per URL (vs 25s sync). Jobs expire after 7 days.

Submit and wait

The simplest approach — pass wait=True to block until the job completes:

result = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    wait=True,              # blocks until job completes
    poll_interval=5,        # seconds between polls (default: 5)
    max_wait_time=3600,     # max seconds to wait (default: 3600)
)

for r in result["results"]:
    print(f"{r['title']}: {r['length']} characters")
print(f"Total cost: ${result['actual_cost_dollars']}")

Submit, then wait separately

For more control, submit first, then call wait_for_contents_job() with a progress callback:

# Submit — returns immediately
job = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    webhook_url="https://your-app.com/webhooks/valyu",  # optional
)
print(f"Job ID: {job['job_id']}")

# Store the webhook_secret immediately — it is ONLY returned here
if job.get("webhook_secret"):
    save_webhook_secret(job["job_id"], job["webhook_secret"])

# Wait with progress tracking
result = valyu.wait_for_contents_job(
    job["job_id"],
    poll_interval=5,
    max_wait_time=3600,
    on_progress=lambda s: print(f"  {s['status']} — batch {s.get('current_batch', '?')}/{s.get('total_batches', '?')}"),
)

if result["status"] in ("completed", "partial"):
    for r in result["results"]:
        print(f"{r['title']}: {r['length']} characters")

Manual polling

If you prefer full control over the polling loop:

import time

while True:
    status = valyu.get_contents_job(job["job_id"])
    print(f"Status: {status['status']}")

    if status["status"] in ("completed", "partial", "failed"):
        break
    time.sleep(2)

Async parameters

Parameter	Type	Description	Default
`async_mode`	bool	Process URLs asynchronously. Required for more than 10 URLs.	False
`webhook_url`	str	HTTPS URL to receive results via webhook POST.	None
`wait`	bool	Block until the job completes (SDK handles polling).	False
`poll_interval`	int	Seconds between polls when `wait=True` or using `wait_for_contents_job`.	5
`max_wait_time`	int	Max seconds to wait before timing out.	3600
`on_progress`	Callable	Callback invoked on each poll with the current status dict.	None

Webhook verification

Webhooks are signed using HMAC-SHA256 with format "{timestamp}.{json_body}". See the Content Extraction guide for full verification examples.

Async response types

# Initial response (HTTP 202)
class ContentsAsyncResponse:
    success: bool
    job_id: str
    status: str               # Always "pending"
    urls_total: int
    poll_url: str
    tx_id: str
    webhook_secret: Optional[str]  # ONLY returned here — store immediately

# Job status response (polling / wait result)
class ContentsJobResponse:
    success: bool
    job_id: str
    status: str               # "pending" | "processing" | "completed" | "partial" | "failed"
    urls_total: int
    urls_processed: int
    urls_failed: int
    created_at: int           # Milliseconds since epoch
    updated_at: int
    current_batch: Optional[int]           # Present during "processing"
    total_batches: Optional[int]           # Present during "processing"
    results: Optional[List[ContentsResult]]  # Present when completed/partial
    actual_cost_dollars: Optional[float]     # Present when completed/partial
    error: Optional[str]                     # Present when partial/failed

Error Handling

response = valyu.contents(urls, **options)

if not response.success:
    print("Contents extraction failed:", response.error)
    return

# Check for partial failures
if response.urls_failed and response.urls_failed > 0:
    print(f"{response.urls_failed} of {response.urls_requested} URLs failed")

# Process successful results
if response.results:
    for index, result in enumerate(response.results):
        print(f"Result {index + 1}:")
        print(f"  Title: {result.title}")
        print(f"  URL: {result.url}")
        print(f"  Length: {result.length} characters")
        
        if result.summary_success:
            print(f"  Summary: {result.content}")

SDKs

​Basic Usage

​Parameters

​URLs (Required)

​Options (Optional)

​Response Format

​Parameter Examples

​Basic Content Extraction

​AI Summary (Boolean)

​Custom Summary Instructions

​Structured Data Extraction

​Response Length Control

​Extract Effort Levels

​Response Length Options

​Use Case Examples

​Research Paper Analysis

​E-commerce Product Intelligence

​Technical Documentation Processor

​Async Processing

​Submit and wait

​Submit, then wait separately

​Manual polling

​Async parameters

​Webhook verification

​Async response types

​Error Handling

Basic Usage

Parameters

URLs (Required)

Options (Optional)

Response Format

Parameter Examples

Basic Content Extraction

AI Summary (Boolean)

Custom Summary Instructions

Structured Data Extraction

Response Length Control

Extract Effort Levels

Response Length Options

Use Case Examples

Research Paper Analysis

E-commerce Product Intelligence

Technical Documentation Processor

Async Processing

Submit and wait

Submit, then wait separately

Manual polling

Async parameters

Webhook verification

Async response types

Error Handling