Skip to main content
Turn any web page into clean, structured data. The Contents API extracts content from URLs with batch processing, AI-powered summaries, and structured outputs.

What You Can Do

  • Feed your AI - Clean data without noise
  • Aggregate content - Extract structured data from multiple sources
  • Transform content - Convert web pages into usable formats
  • Automate research - Pull key information from articles, papers, and reports

Features

Batch Processing

Submit up to 10 URLs per request.

AI-Powered Structuring

Use JSON schemas to extract specific data points.

Smart Summarisation

Generate tailored summaries with custom instructions.

Pay-per-Success

Only pay for URLs that are successfully processed.

Getting Started

Basic Extraction

from valyu import Valyu

valyu = Valyu()  # Uses VALYU_API_KEY from env

data = valyu.contents(
    urls=[
        "https://techcrunch.com/category/artificial-intelligence/",
    ],
    response_length="medium",
    extract_effort="auto",
)
print(data["results"][0]["content"][:500])
Returns clean markdown content for each URL.

Response Length

LengthCharactersUse for
short25,000Summaries, key points
medium50,000Articles, blog posts
large100,000Academic papers, long-form content
maxUnlimitedFull document extraction
Custom integer1,000-1,000,000Specific requirements

Extract Effort

EffortDescription
normalStandard speed and quality (default)
highBetter quality, slower
autoAutomatically chooses the right level

Advanced Features

Summary Options

The summary field accepts four types of values:

No AI Processing (false)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="normal",
    summary=False,
)
print(data["results"][0]["content"][:300])

Basic Summary (true)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/article"],
    extract_effort="auto",
    summary=True,
)
print(data["results"][0]["content"])

Custom Instructions (string)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/research-paper"],
    extract_effort="auto",
    summary="Summarise the methodology, key findings, and practical applications in 2-3 paragraphs",
)
print(data["results"][0]["content"])

Structured Extraction (object)

from valyu import Valyu

valyu = Valyu()

data = valyu.contents(
    urls=["https://example.com/product-page"],
    extract_effort="auto",
    summary={
        "type": "object",
        "properties": {
            "product_name": {"type": "string", "description": "Name of the product"},
            "price": {"type": "number", "description": "Product price in USD"},
            "features": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Key product features",
            },
            "availability": {
                "type": "string",
                "enum": ["in_stock", "out_of_stock", "preorder"],
                "description": "Product availability status",
            },
        },
        "required": ["product_name", "price"],
    },
)
print(data["results"][0]["content"])

JSON Schema Reference

For structured extraction, you can use any valid JSON Schema. See the JSON Schema Type Reference for details. Limits:
  • 5,000 characters max
  • 3 levels deep max
  • 20 properties per object max
Common types:
  • string - Text with optional format validation
  • number / integer - Numbers with optional min/max
  • boolean - True/false
  • array - Lists with optional size limits
  • object - Nested structures

Examples

News Aggregator

Extract structured article data:
{
  "urls": [
    "https://techcrunch.com/category/artificial-intelligence/",
    "https://venturebeat.com/category/entrepreneur/",
    "https://www.bbc.co.uk/news/technology"
  ],
  "extract_effort": "auto",
  "summary": {
    "type": "object",
    "properties": {
      "headline": { "type": "string" },
      "summary_text": { "type": "string" },
      "category": { "type": "string" },
      "tags": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["headline"]
  }
}

Research Paper

Extract structured academic data:
{
  "urls": ["https://arxiv.org/paper/example"],
  "response_length": "max",
  "extract_effort": "high",
  "summary": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "abstract": { "type": "string" },
      "methodology": { "type": "string" },
      "key_findings": {
        "type": "array",
        "items": { "type": "string" }
      },
      "limitations": { "type": "string" }
    },
    "required": ["title"]
  }
}

Product Info

Extract product data:
{
  "urls": ["https://company.com/product-A", "https://company.com/product-B"],
  "extract_effort": "auto",
  "summary": {
    "type": "object",
    "properties": {
      "product_name": { "type": "string" },
      "features": {
        "type": "array",
        "items": { "type": "string" }
      },
      "pricing": { "type": "string" },
      "target_audience": { "type": "string" }
    },
    "required": ["product_name"]
  }
}

Response Format

Raw Content (summary: false)

{
  "success": true,
  "error": null,
  "tx_id": "tx_12345678-1234-1234-1234-123456789abc",
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "url": "https://example.com/article?utm_source=valyu",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "description": "Latest AI developments",
      "source": "web",
      "price": 0.001,
      "length": 12840,
      "data_type": "unstructured",
      "source_type": "website",
      "publication_date": "2024-01-15",
      "id": "https://example.com/article"
    }
  ],
  "urls_requested": 1,
  "urls_processed": 1,
  "urls_failed": 0,
  "total_cost_dollars": 0.001,
  "total_characters": 12840
}

Summary (summary: true or string)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "This article discusses a breakthrough in AI...",
      "summary_success": true,
      "price": 0.002,
      "data_type": "unstructured"
    }
  ]
}

Structured (JSON Schema)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": {
        "title": "AI Breakthrough in Natural Language Processing",
        "author": "John Doe",
        "category": "technology",
        "key_points": [
          "New AI model achieves 95% accuracy",
          "Reduces computational requirements by 40%"
        ]
      },
      "summary_success": true,
      "price": 0.002,
      "data_type": "structured"
    }
  ]
}

Response Fields

FieldDescription
contentExtracted content (markdown or JSON)
data_type"unstructured" or "structured"
summary_successWhether AI processing succeeded
priceCost for this URL

Best Practices

Choosing Summary Type

  • false: Fastest and cheapest—no AI
  • true: Basic summary for overviews
  • "string": Custom instructions for specific needs
  • {object}: Structured extraction for data processing

JSON Schema Tips

  • Use clear descriptions to guide extraction
  • Use enums for consistent categorisation
  • Keep schemas under 3 levels deep
  • Mark essential fields as required

Batch Processing

  • Group similar content types together
  • Choose appropriate response length
  • Check summary_success for AI status
  • Track total_cost_dollars

Error Handling

# Check for partial failures (HTTP 206)
if response.status_code == 206:
    successful_results = [r for r in response.json()["results"]]
    failed_count = response.json()["urls_failed"]

# Check AI processing success
for result in results:
    if "summary" in result and "summary_success" in result:
        if not result["summary_success"]:
            print(f"AI processing failed for {result['url']}")

# Handle complete failures (HTTP 422)
if response.status_code == 422:
    error_message = response.json()["error"]

Try the Contents API

Full API reference with interactive examples

Next Steps