Skip to main content

Web Scraping

Totalum allows you to scrape web pages and extract structured data from them using the Totalum API or SDK. The web scraping service supports multiple output formats, anti-bot bypass, JavaScript rendering, presets for popular platforms, and AI-powered data extraction.


📚 Setup Required: For installation and usage of the Totalum SDK or API, see the Installation Guide.


Scrape a web page

Use Case:

Scrape any web page and get its content in a specific format (text, markdown, HTML, JSON, etc.). This is useful when you need to read the content of a website programmatically.

const result = await totalumClient.scrapping.scrape({
url: 'https://example.com',
format: 'markdown' // 'raw' | 'json' | 'text' | 'markdown' | 'clean_html'
});

const pageContent = result.data.content;
console.log(pageContent);

Scrape with JavaScript rendering

Some websites load their content dynamically using JavaScript. Use render_js to render the page before scraping.

const result = await totalumClient.scrapping.scrape({
url: 'https://example.com/dynamic-page',
format: 'text',
render_js: true, // renders JavaScript before scraping
wait_for_selector: '.main-content', // optional: wait until this CSS selector appears
rendering_wait: 3000 // optional: wait 3 seconds after page load
});

const pageContent = result.data.content;

Presets automatically configure the optimal scraping settings for popular websites. Available presets: google, amazon, instagram, linkedin, twitter, youtube, ebay, walmart.

// Scrape an Amazon product page with the Amazon preset
const result = await totalumClient.scrapping.scrape({
url: 'https://www.amazon.com/dp/B0EXAMPLE',
preset: 'amazon',
format: 'raw'
});

const pageContent = result.data.content;

Scrape with proxy

Use a proxy pool to avoid being blocked by the target website. Available pools: public_datacenter_pool (faster, cheaper) and public_residential_pool (harder to detect).

const result = await totalumClient.scrapping.scrape({
url: 'https://example.com',
format: 'markdown',
proxy_pool: 'public_residential_pool'
});

Scrape with custom headers and cookies

const result = await totalumClient.scrapping.scrape({
url: 'https://example.com',
format: 'text',
headers: {
'Accept-Language': 'es-ES',
'User-Agent': 'CustomAgent/1.0'
},
cookies: {
'session_id': 'abc123',
'consent': 'true'
}
});

Scrape and extract data in one call

You can scrape a page and extract structured data from it in a single call by adding an extraction_prompt or extraction_model.

// Using a custom prompt
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com/product/123',
format: 'text',
extraction_prompt: 'Extract the product name, price, and description as JSON: { "name": "", "price": 0, "description": "" }'
});

const extractedData = result.data.extracted_data;
console.log(extractedData); // { name: "Product X", price: 29.99, description: "..." }
// Using a pre-built extraction model
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com/product/123',
format: 'text',
extraction_model: 'product' // automatically extracts product data
});

const extractedData = result.data.extracted_data;

All scrape options

ParameterTypeRequiredDescription
urlstringYesThe URL of the page to scrape
formatstringNoOutput format: raw, json, text, markdown, clean_html. Default: raw
format_optionsstring[]NoFormat modifiers: no_links, no_images, only_content
presetstringNoPlatform preset: google, amazon, instagram, linkedin, twitter, youtube, ebay, walmart, generic
render_jsbooleanNoRender JavaScript before scraping
rendering_waitnumberNoMilliseconds to wait after page load (when render_js is true)
wait_for_selectorstringNoCSS selector to wait for before scraping
auto_scrollbooleanNoAutomatically scroll the page to load lazy content
jsstringNoCustom JavaScript to execute on the page
proxy_poolstringNoProxy pool: public_datacenter_pool or public_residential_pool
aspbooleanNoEnable anti-scraping protection bypass
countrystringNoCountry code for geo-targeting (e.g., US, ES)
headersobjectNoCustom HTTP headers
cookiesobjectNoCustom cookies
methodstringNoHTTP method: GET, POST, PUT, DELETE. Default: GET
bodystringNoHTTP request body (for POST/PUT requests)
cachebooleanNoEnable response caching
cache_ttlnumberNoCache time-to-live in seconds
timeoutnumberNoRequest timeout in milliseconds
extraction_modelstringNoPre-built AI extraction model (see extraction models table below)
extraction_promptstringNoCustom AI extraction prompt
extraction_templatestringNoCustom extraction template

Scrape response

const result = await totalumClient.scrapping.scrape({ url: 'https://example.com', format: 'text' });

// result.data contains:
{
url: 'https://example.com', // final URL (after redirects)
status_code: 200, // HTTP status code
content: '...', // scraped content in the requested format
content_type: 'text/html', // content type of the page
headers: { ... }, // response headers
format: 'text', // format used
extracted_data: { ... }, // only present if extraction was requested
}

Extract structured data from a page

Use Case:

Extract structured data from a URL or from HTML/markdown content you already have. This is useful when you need specific data points from a page (e.g., product info, article content, contact details) instead of the full raw content.

The key difference between scrape and extract is that extract uses AI to parse and structure the data, returning clean JSON instead of raw page content.

Extract from a URL with a custom prompt

const result = await totalumClient.scrapping.extract({
url: 'https://example.com/contact',
extraction_prompt: 'Extract all contact information as JSON: { "emails": [], "phones": [], "address": "" }'
});

const contactInfo = result.data.data;
console.log(contactInfo); // { emails: ["[email protected]"], phones: ["+34 123 456 789"], address: "..." }

Extract from a URL with a pre-built model

Instead of writing a custom prompt, you can use a pre-built extraction model that automatically extracts the right data based on the page type.

const result = await totalumClient.scrapping.extract({
url: 'https://example.com/product/123',
extraction_model: 'product' // automatically extracts product data (name, price, description, images, etc.)
});

const productData = result.data.data;
console.log(productData);

Extract from HTML content you already have

If you already have the HTML or markdown content (e.g., from a previous scrape or from your own data), you can extract data from it directly without making a new HTTP request.

const htmlContent = '<html><body><h1>John Doe</h1><p>Age: 30</p><p>City: Barcelona</p></body></html>';

const result = await totalumClient.scrapping.extract({
content: htmlContent,
content_type: 'text/html',
extraction_prompt: 'Extract name, age, and city as JSON: { "name": "", "age": 0, "city": "" }'
});

const personData = result.data.data;
console.log(personData); // { name: "John Doe", age: 30, city: "Barcelona" }

Extract from a URL with custom scrape settings

You can customize how the page is scraped before extraction by passing scrape_config.

const result = await totalumClient.scrapping.extract({
url: 'https://example.com/dynamic-page',
scrape_config: {
render_js: true,
wait_for_selector: '.product-details',
proxy_pool: 'public_residential_pool'
},
extraction_prompt: 'Extract the product name and price as JSON: { "name": "", "price": 0 }'
});

const productData = result.data.data;

All extract options

ParameterTypeRequiredDescription
urlstringNo*The URL to scrape and extract from
contentstringNo*HTML or markdown content to extract from directly
content_typestringNoContent type of the provided content: text/html or text/markdown
scrape_configobjectNoCustom scrape settings when extracting from a URL (same options as scrape except url)
extraction_modelstringNo**Pre-built AI extraction model (see extraction models table below)
extraction_promptstringNo**Custom AI extraction prompt
extraction_templatestringNo**Custom extraction template

* You must provide either url or content, but not both.

** You must provide at least one of extraction_model, extraction_prompt, or extraction_template.

Extract response

const result = await totalumClient.scrapping.extract({ url: 'https://example.com', extraction_prompt: '...' });

// result.data contains:
{
content_type: 'application/json', // type of extracted data
data: { ... }, // the extracted structured data
data_quality: { // quality assessment of the extraction
errors: [], // any extraction errors
fulfilled: true, // whether the extraction was successful
fulfillment_percent: 100 // percentage of requested data extracted
},
url: 'https://example.com', // source URL (if extracted from URL)
}

Available extraction models

Both scrape (via extraction_model) and extract (via extraction_model) support these pre-built models. They automatically extract structured data without needing a custom prompt.

ModelDescription
productSingle product page (name, price, description, images, etc.)
product_listingProduct listing/category page (list of products)
articleNews article or blog post (title, author, date, content)
review_listList of reviews (reviewer, rating, text, date)
real_estate_propertySingle real estate listing
real_estate_property_listingList of real estate properties
job_postingSingle job posting (title, company, location, salary)
job_listingList of job postings
hotelSingle hotel page
hotel_listingList of hotels
eventEvent page (name, date, location, description)
food_recipeRecipe (ingredients, steps, cooking time)
organizationOrganization/company page
social_media_postSocial media post
search_engine_resultsSearch engine results page
softwareSoftware/app page
stockStock/financial data
vehicle_adVehicle advertisement

Important notes

  • Credits: Each scrape or extract call consumes credits from your account. The cost depends on the complexity of the request (JavaScript rendering and proxy usage cost more).
  • Anti-bot protection: For websites with strong anti-bot protection, enable asp: true and use proxy_pool: 'public_residential_pool' for best results.
  • JavaScript rendering: Only enable render_js: true when necessary, as it increases cost and response time.
  • Format choice: Use markdown or text for clean content. Use raw when you need the original HTML. Use clean_html for simplified HTML without scripts and styles.
  • Extraction models vs prompts: Pre-built extraction_model values are optimized for their specific content type and generally produce better results than custom prompts for supported page types.