Web Scraping

Totalum allows you to scrape web pages and extract structured data from them using the Totalum API or SDK. The web scraping service supports multiple output formats, anti-bot bypass, JavaScript rendering, presets for popular platforms, and AI-powered data extraction.

📚 Setup Required: For installation and usage of the Totalum SDK or API, see the Installation Guide.

Scrape a web page

Use Case:

Scrape any web page and get its content in a specific format (text, markdown, HTML, JSON, etc.). This is useful when you need to read the content of a website programmatically.

const result = await totalumClient.scrapping.scrape({
    url: 'https://example.com',
    format: 'markdown' // 'raw' | 'json' | 'text' | 'markdown' | 'clean_html'
});

const pageContent = result.data.content;
console.log(pageContent);

Scrape with JavaScript rendering

Some websites load their content dynamically using JavaScript. Use render_js to render the page before scraping.

const result = await totalumClient.scrapping.scrape({
    url: 'https://example.com/dynamic-page',
    format: 'text',
    render_js: true, // renders JavaScript before scraping
    wait_for_selector: '.main-content', // optional: wait until this CSS selector appears
    rendering_wait: 3000 // optional: wait 3 seconds after page load
});

const pageContent = result.data.content;

Scrape using a preset for popular platforms

Presets automatically configure the optimal scraping settings for popular websites. Available presets: google, amazon, instagram, linkedin, twitter, youtube, ebay, walmart.

// Scrape an Amazon product page with the Amazon preset
const result = await totalumClient.scrapping.scrape({
    url: 'https://www.amazon.com/dp/B0EXAMPLE',
    preset: 'amazon',
    format: 'raw'
});

const pageContent = result.data.content;

Scrape with proxy

Use a proxy pool to avoid being blocked by the target website. Available pools: public_datacenter_pool (faster, cheaper) and public_residential_pool (harder to detect).

const result = await totalumClient.scrapping.scrape({
    url: 'https://example.com',
    format: 'markdown',
    proxy_pool: 'public_residential_pool'
});

Scrape with custom headers and cookies

const result = await totalumClient.scrapping.scrape({
    url: 'https://example.com',
    format: 'text',
    headers: {
        'Accept-Language': 'es-ES',
        'User-Agent': 'CustomAgent/1.0'
    },
    cookies: {
        'session_id': 'abc123',
        'consent': 'true'
    }
});

Scrape and extract data in one call

You can scrape a page and extract structured data from it in a single call by adding an extraction_prompt or extraction_model.

// Using a custom prompt
const result = await totalumClient.scrapping.scrape({
    url: 'https://example.com/product/123',
    format: 'text',
    extraction_prompt: 'Extract the product name, price, and description as JSON: { "name": "", "price": 0, "description": "" }'
});

const extractedData = result.data.extracted_data;
console.log(extractedData); // { name: "Product X", price: 29.99, description: "..." }

// Using a pre-built extraction model
const result = await totalumClient.scrapping.scrape({
    url: 'https://example.com/product/123',
    format: 'text',
    extraction_model: 'product' // automatically extracts product data
});

const extractedData = result.data.extracted_data;

All scrape options

Parameter	Type	Required	Description
`url`	string	Yes	The URL of the page to scrape
`format`	string	No	Output format: `raw`, `json`, `text`, `markdown`, `clean_html`. Default: `raw`
`format_options`	string[]	No	Format modifiers: `no_links`, `no_images`, `only_content`
`preset`	string	No	Platform preset: `google`, `amazon`, `instagram`, `linkedin`, `twitter`, `youtube`, `ebay`, `walmart`, `generic`
`render_js`	boolean	No	Render JavaScript before scraping
`rendering_wait`	number	No	Milliseconds to wait after page load (when `render_js` is true)
`wait_for_selector`	string	No	CSS selector to wait for before scraping
`auto_scroll`	boolean	No	Automatically scroll the page to load lazy content
`js`	string	No	Custom JavaScript to execute on the page
`proxy_pool`	string	No	Proxy pool: `public_datacenter_pool` or `public_residential_pool`
`asp`	boolean	No	Enable anti-scraping protection bypass
`country`	string	No	Country code for geo-targeting (e.g., `US`, `ES`)
`headers`	object	No	Custom HTTP headers
`cookies`	object	No	Custom cookies
`method`	string	No	HTTP method: `GET`, `POST`, `PUT`, `DELETE`. Default: `GET`
`body`	string	No	HTTP request body (for POST/PUT requests)
`cache`	boolean	No	Enable response caching
`cache_ttl`	number	No	Cache time-to-live in seconds
`timeout`	number	No	Request timeout in milliseconds
`extraction_model`	string	No	Pre-built AI extraction model (see extraction models table below)
`extraction_prompt`	string	No	Custom AI extraction prompt
`extraction_template`	string	No	Custom extraction template

Scrape response

const result = await totalumClient.scrapping.scrape({ url: 'https://example.com', format: 'text' });

// result.data contains:
{
    url: 'https://example.com',         // final URL (after redirects)
    status_code: 200,                    // HTTP status code
    content: '...',                      // scraped content in the requested format
    content_type: 'text/html',           // content type of the page
    headers: { ... },                    // response headers
    format: 'text',                      // format used
    extracted_data: { ... },             // only present if extraction was requested
}

Extract structured data from a page

Use Case:

Extract structured data from a URL or from HTML/markdown content you already have. This is useful when you need specific data points from a page (e.g., product info, article content, contact details) instead of the full raw content.

The key difference between scrape and extract is that extract uses AI to parse and structure the data, returning clean JSON instead of raw page content.

Extract from a URL with a custom prompt

const result = await totalumClient.scrapping.extract({
    url: 'https://example.com/contact',
    extraction_prompt: 'Extract all contact information as JSON: { "emails": [], "phones": [], "address": "" }'
});

const contactInfo = result.data.data;
console.log(contactInfo); // { emails: ["[email protected]"], phones: ["+34 123 456 789"], address: "..." }

Extract from a URL with a pre-built model

Instead of writing a custom prompt, you can use a pre-built extraction model that automatically extracts the right data based on the page type.

const result = await totalumClient.scrapping.extract({
    url: 'https://example.com/product/123',
    extraction_model: 'product' // automatically extracts product data (name, price, description, images, etc.)
});

const productData = result.data.data;
console.log(productData);

Extract from HTML content you already have

If you already have the HTML or markdown content (e.g., from a previous scrape or from your own data), you can extract data from it directly without making a new HTTP request.

const htmlContent = '<html><body><h1>John Doe</h1><p>Age: 30</p><p>City: Barcelona</p></body></html>';

const result = await totalumClient.scrapping.extract({
    content: htmlContent,
    content_type: 'text/html',
    extraction_prompt: 'Extract name, age, and city as JSON: { "name": "", "age": 0, "city": "" }'
});

const personData = result.data.data;
console.log(personData); // { name: "John Doe", age: 30, city: "Barcelona" }

Extract from a URL with custom scrape settings

You can customize how the page is scraped before extraction by passing scrape_config.

const result = await totalumClient.scrapping.extract({
    url: 'https://example.com/dynamic-page',
    scrape_config: {
        render_js: true,
        wait_for_selector: '.product-details',
        proxy_pool: 'public_residential_pool'
    },
    extraction_prompt: 'Extract the product name and price as JSON: { "name": "", "price": 0 }'
});

const productData = result.data.data;

All extract options

Parameter	Type	Required	Description
`url`	string	No*	The URL to scrape and extract from
`content`	string	No*	HTML or markdown content to extract from directly
`content_type`	string	No	Content type of the provided content: `text/html` or `text/markdown`
`scrape_config`	object	No	Custom scrape settings when extracting from a URL (same options as `scrape` except `url`)
`extraction_model`	string	No**	Pre-built AI extraction model (see extraction models table below)
`extraction_prompt`	string	No**	Custom AI extraction prompt
`extraction_template`	string	No**	Custom extraction template

* You must provide either url or content, but not both.

** You must provide at least one of extraction_model, extraction_prompt, or extraction_template.

Extract response

const result = await totalumClient.scrapping.extract({ url: 'https://example.com', extraction_prompt: '...' });

// result.data contains:
{
    content_type: 'application/json',    // type of extracted data
    data: { ... },                       // the extracted structured data
    data_quality: {                      // quality assessment of the extraction
        errors: [],                      // any extraction errors
        fulfilled: true,                 // whether the extraction was successful
        fulfillment_percent: 100         // percentage of requested data extracted
    },
    url: 'https://example.com',          // source URL (if extracted from URL)
}

Available extraction models

Both scrape (via extraction_model) and extract (via extraction_model) support these pre-built models. They automatically extract structured data without needing a custom prompt.

Model	Description
`product`	Single product page (name, price, description, images, etc.)
`product_listing`	Product listing/category page (list of products)
`article`	News article or blog post (title, author, date, content)
`review_list`	List of reviews (reviewer, rating, text, date)
`real_estate_property`	Single real estate listing
`real_estate_property_listing`	List of real estate properties
`job_posting`	Single job posting (title, company, location, salary)
`job_listing`	List of job postings
`hotel`	Single hotel page
`hotel_listing`	List of hotels
`event`	Event page (name, date, location, description)
`food_recipe`	Recipe (ingredients, steps, cooking time)
`organization`	Organization/company page
`social_media_post`	Social media post
`search_engine_results`	Search engine results page
`software`	Software/app page
`stock`	Stock/financial data
`vehicle_ad`	Vehicle advertisement

Important notes

Credits: Each scrape or extract call consumes credits from your account. The cost depends on the complexity of the request (JavaScript rendering and proxy usage cost more).
Anti-bot protection: For websites with strong anti-bot protection, enable asp: true and use proxy_pool: 'public_residential_pool' for best results.
JavaScript rendering: Only enable render_js: true when necessary, as it increases cost and response time.
Format choice: Use markdown or text for clean content. Use raw when you need the original HTML. Use clean_html for simplified HTML without scripts and styles.
Extraction models vs prompts: Pre-built extraction_model values are optimized for their specific content type and generally produce better results than custom prompts for supported page types.

Web Scraping

Scrape a web page​

Scrape with JavaScript rendering​

Scrape using a preset for popular platforms​

Scrape with proxy​

Scrape with custom headers and cookies​

Scrape and extract data in one call​

All scrape options​

Scrape response​

Extract structured data from a page​

Extract from a URL with a custom prompt​

Extract from a URL with a pre-built model​

Extract from HTML content you already have​

Extract from a URL with custom scrape settings​

All extract options​

Extract response​

Available extraction models​

Important notes​