Contents API Reference

Overview

Endpoint: POST https://api.exa.ai/contents Auth: Pass your API key via the x-api-key header. Get one at https://dashboard.exa.ai/api-keys The Contents API extracts clean, LLM-ready content from any URL. It handles JavaScript-rendered pages, PDFs, and complex layouts. Returns full text, highlights, summaries, or any combination.

Installation

pip install exa-py    # Python
npm install exa-js    # JavaScript

Minimal Working Example

curl -X POST "https://api.exa.ai/contents" \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{"urls": ["https://example.com"], "text": true}'

from exa_py import Exa
exa = Exa(api_key="YOUR_API_KEY")
result = exa.get_contents(["https://example.com"], text=True)

import Exa from "exa-js";
const exa = new Exa("YOUR_API_KEY");
const result = await exa.getContents(["https://example.com"], { text: true });

Request Parameters

Parameter	Type	Default	Description
`urls`	string[]	(required)	Array of URLs to extract content from. Also accepts `ids` (document IDs from search results).
`text`	boolean or object	—	Return full page text as markdown. Object form: `{maxCharacters, includeHtmlTags, verbosity, includeSections, excludeSections}`.
`highlights`	boolean or object	—	Return key excerpts relevant to a query. Pass `true` for the highest-quality default. Object form: `{query, maxCharacters}` — use `query` to guide which highlights are returned, `maxCharacters` to cap the budget.
`summary`	boolean or object	—	Return LLM-generated summary. Object form: `{query, schema}`.
`maxAgeHours`	integer	—	Max age of cached content in hours. `0` = always livecrawl. `-1` = never livecrawl. Omit for default (livecrawl as fallback).
`livecrawlTimeout`	integer	`10000`	Timeout for livecrawling in milliseconds. Recommended: 10000-15000.
`subpages`	integer	`0`	Number of subpages to crawl from each URL.
`subpageTarget`	string or string[]	—	Keywords to prioritize when selecting subpages.
`extras.links`	integer	`0`	Number of URLs to extract from each page.
`extras.imageLinks`	integer	`0`	Number of image URLs to extract from each page.
`compliance`	string	—	Enterprise-only compliance mode. Set to `"hipaa"` for HIPAA mode. Uses cache-only retrieval; summaries and livecrawl are not supported.

Text Object Options

Parameter	Type	Default	Description
`maxCharacters`	integer	—	Character limit for returned text.
`includeHtmlTags`	boolean	`false`	Preserve HTML tags in output.
`verbosity`	string	`"compact"`	`compact`, `standard`, or `full`. Should use `maxAgeHours: 0` for fresh content.
`includeSections`	string[]	—	Only include these page sections: `header`, `navigation`, `banner`, `body`, `sidebar`, `footer`, `metadata`. Should use `maxAgeHours: 0` for fresh content.
`excludeSections`	string[]	—	Exclude these page sections. Same options as above. Should use `maxAgeHours: 0` for fresh content.

Highlights Object Options

Prefer highlights: true for the highest-quality default. Only supply this object when you specifically need to guide selection with a custom query or cap output size.

Parameter	Type	Default	Description
`query`	string	—	Custom query that guides which highlights the LLM picks.
`maxCharacters`	integer	—	Cap on total highlight characters per URL. Omit unless you have a specific budget — leaving it unset gives the highest-quality default.

Summary Object Options

Parameter	Type	Default	Description
`query`	string	—	Custom query for the summary.
`schema`	object	—	JSON Schema (Draft 7) for structured summary output.

Content Modes

Text — Full page content as clean markdown. Best for deep analysis.

{"urls": ["https://example.com"], "text": {"maxCharacters": 8000}}

Highlights — Extractive key excerpts from the page. Best for agent workflows (10x fewer tokens). These are pulled directly from the source, not generated.

{"urls": ["https://example.com"], "highlights": {"query": "key findings"}}

Summary — LLM-generated abstract. Supports JSON schema for structured extraction.

{
  "urls": ["https://example.com"],
  "summary": {
    "query": "Extract company information",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "industry": {"type": "string"}
      },
      "required": ["name", "industry"]
    }
  }
}

You can combine all three in a single request.

Content Freshness

`maxAgeHours` value	Behavior
Omit (default)	Livecrawl only when no cached content exists. Recommended.
Positive (e.g. `24`)	Use cache if less than N hours old, otherwise livecrawl.
`0`	Always livecrawl, never use cache. Increases latency.
`-1`	Never livecrawl, cache only. Maximum speed.

When using maxAgeHours, pair with livecrawlTimeout (10000-15000ms recommended).

Subpage Crawling

Automatically discover and extract content from linked pages within a site.

{
  "urls": ["https://docs.example.com"],
  "subpages": 10,
  "subpageTarget": ["api", "reference", "guide"],
  "text": {"maxCharacters": 5000}
}

subpages: Max subpages to crawl per URL.
subpageTarget: Keywords to prioritize when selecting which subpages to crawl.
Start small (5-10) and increase if needed.

Response Schema

{
  "requestId": "e492118ccdedcba5088bfc4357a8a125",
  "results": [
    {
      "title": "Page Title",
      "url": "https://example.com/page",
      "id": "https://example.com/page",
      "publishedDate": "2024-01-15T00:00:00.000Z",
      "author": "Author Name",
      "image": "https://example.com/image.png",
      "favicon": "https://example.com/favicon.ico",
      "text": "Full page content as markdown...",
      "highlights": ["Key excerpt from the page..."],
      "highlightScores": [0.46],
      "summary": "LLM-generated summary...",
      "subpages": [],
      "extras": {
        "links": ["https://example.com/related"]
      }
    }
  ],
  "statuses": [
    {
      "id": "https://example.com/page",
      "status": "success"
    }
  ],
  "costDollars": {
    "total": 0.003
  }
}

Response Fields

Field	Type	Description
`requestId`	string	Unique request identifier.
`results`	array	List of result objects with extracted content.
`results[].title`	string	Page title.
`results[].url`	string	Page URL.
`results[].id`	string	Document ID (same as URL).
`results[].publishedDate`	string or null	Estimated publication date.
`results[].author`	string or null	Author if available.
`results[].text`	string	Full page text (if `text` requested).
`results[].highlights`	string[]	Key excerpts (if `highlights` requested).
`results[].highlightScores`	float[]	Cosine similarity scores for each highlight.
`results[].summary`	string	LLM summary (if `summary` requested).
`results[].subpages`	array	Nested results from subpage crawling. Same shape as results.
`results[].extras.links`	string[]	Extracted links from the page.
`statuses`	array	Per-URL status information. Always check this for errors.
`statuses[].id`	string	The URL that was requested.
`statuses[].status`	string	`"success"` or `"error"`.
`statuses[].error.tag`	string	Error type (see Error Handling).
`statuses[].error.httpStatusCode`	integer or null	Corresponding HTTP status code.
`costDollars.total`	float	Total dollar cost for the request.

Error Handling

The endpoint returns HTTP 200 even when individual URLs fail. Per-URL errors appear in the statuses array.

Per-URL Error Tags

Tag	HTTP Code	Meaning
`CRAWL_NOT_FOUND`	404	Content not found.
`CRAWL_TIMEOUT`	504	Crawl timed out fetching content.
`CRAWL_LIVECRAWL_TIMEOUT`	504	Livecrawl exceeded `livecrawlTimeout`.
`SOURCE_NOT_AVAILABLE`	403	Access forbidden.
`UNSUPPORTED_URL`	—	URL type not supported.
`CRAWL_UNKNOWN_ERROR`	500+	Other errors.

Request-Level Errors

HTTP Status	Meaning
400	Bad request — invalid parameters.
401	Invalid or missing API key.
422	Validation error.
429	Rate limit exceeded.

Always check statuses to handle per-URL failures:

result = exa.get_contents(["https://example.com", "https://example.com/maybe-broken"])
for status in result.statuses:
    if status.status == "error":
        print(f"Failed: {status.id} - {status.error.tag}")

Common Mistakes

LLMs frequently generate these incorrect parameters. Do NOT use any of the following:

Wrong	Correct
`useAutoprompt: true`	Remove it. `useAutoprompt` does not exist on the `/contents` endpoint.
`numSentences`	Remove it. This highlights parameter is deprecated. Use `highlights: true` instead.
`highlightsPerUrl`	Remove it. This highlights parameter is deprecated. Use `highlights: true` instead.
`livecrawl: "always"`	Use `maxAgeHours: 0` instead. The `livecrawl` parameter is deprecated.
`tokensNum`	Remove it. This parameter does not exist. Use `text.maxCharacters` to limit text length.
`stream: true`	Remove it. The `/contents` endpoint does not support streaming.
`contents: { text: ... }`	On `/contents`, `text`, `highlights`, and `summary` are top-level — do NOT wrap them in a `contents` object. This is different from `/search`.

Remember: On the /contents endpoint, text, highlights, and summary are top-level parameters. Do NOT nest them inside a contents object (that nesting is only for the /search endpoint).

Patterns and Gotchas

Always check statuses. The endpoint returns 200 even when individual URLs fail. Unchecked, you’ll silently miss failed URLs.
Use highlights over text for agent workflows. Highlights are 10x more token-efficient and return the most relevant excerpts.
Set livecrawlTimeout when using maxAgeHours. Default is 10000ms. For slow sites, use 12000-15000ms.
subpageTarget focuses crawling. Without it, subpage selection is best-effort. Use specific terms like ["api", "docs"].
Python SDK uses snake_case. subpageTarget → subpage_target, maxAgeHours → max_age_hours, maxCharacters → max_characters.
urls and ids are interchangeable. Both accept URL strings. ids exists for backward compatibility with document IDs from search results.
Combine modes freely. Request text, highlights, and summary in the same call for different views of the same content.

Complete Examples

Basic text extraction

{
  "urls": ["https://arxiv.org/abs/2301.07041"],
  "text": true
}

Highlights with custom query

{
  "urls": ["https://example.com/research-paper"],
  "highlights": {
    "query": "methodology and results"
  }
}

Documentation crawling

{
  "urls": ["https://platform.openai.com/docs"],
  "subpages": 15,
  "subpageTarget": ["api", "models", "embeddings"],
  "maxAgeHours": 24,
  "livecrawlTimeout": 15000,
  "text": {"maxCharacters": 5000}
}

Structured company extraction

{
  "urls": ["https://stripe.com"],
  "subpages": 8,
  "subpageTarget": ["about", "careers", "press", "blog"],
  "summary": {
    "query": "Company overview, culture, and recent news",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "industry": {"type": "string"},
        "employee_count": {"type": "string"},
        "recent_news": {"type": "array", "items": {"type": "string"}}
      },
      "required": ["name", "industry"]
    }
  }
}

Documentation Index

​Overview

​Installation

​Minimal Working Example

​Request Parameters

​Text Object Options

​Highlights Object Options

​Summary Object Options

​Content Modes

​Content Freshness

​Subpage Crawling

​Response Schema

​Response Fields

​Error Handling

​Per-URL Error Tags

​Request-Level Errors

​Common Mistakes

​Patterns and Gotchas

​Complete Examples

​Basic text extraction

​Highlights with custom query

​Documentation crawling

​Structured company extraction