Web Scraping
Totalum allows you to scrape web pages and extract structured data from them using the Totalum API or SDK. The web scraping service supports multiple output formats, anti-bot bypass, JavaScript rendering, presets for popular platforms, and AI-powered data extraction.
📚 Setup Required: For installation and usage of the Totalum SDK or API, see the Installation Guide.
Scrape a web page
Use Case:
Scrape any web page and get its content in a specific format (text, markdown, HTML, JSON, etc.). This is useful when you need to read the content of a website programmatically.
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com',
format: 'markdown' // 'raw' | 'json' | 'text' | 'markdown' | 'clean_html'
});
const pageContent = result.data.content;
console.log(pageContent);
Scrape with JavaScript rendering
Some websites load their content dynamically using JavaScript. Use render_js to render the page before scraping.
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com/dynamic-page',
format: 'text',
render_js: true, // renders JavaScript before scraping
wait_for_selector: '.main-content', // optional: wait until this CSS selector appears
rendering_wait: 3000 // optional: wait 3 seconds after page load
});
const pageContent = result.data.content;
Scrape using a preset for popular platforms
Presets automatically configure the optimal scraping settings for popular websites. Available presets: google, amazon, instagram, linkedin, twitter, youtube, ebay, walmart.
// Scrape an Amazon product page with the Amazon preset
const result = await totalumClient.scrapping.scrape({
url: 'https://www.amazon.com/dp/B0EXAMPLE',
preset: 'amazon',
format: 'raw'
});
const pageContent = result.data.content;
Scrape with proxy
Use a proxy pool to avoid being blocked by the target website. Available pools: public_datacenter_pool (faster, cheaper) and public_residential_pool (harder to detect).
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com',
format: 'markdown',
proxy_pool: 'public_residential_pool'
});
Scrape with custom headers and cookies
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com',
format: 'text',
headers: {
'Accept-Language': 'es-ES',
'User-Agent': 'CustomAgent/1.0'
},
cookies: {
'session_id': 'abc123',
'consent': 'true'
}
});
Scrape and extract data in one call
You can scrape a page and extract structured data from it in a single call by adding an extraction_prompt or extraction_model.
// Using a custom prompt
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com/product/123',
format: 'text',
extraction_prompt: 'Extract the product name, price, and description as JSON: { "name": "", "price": 0, "description": "" }'
});
const extractedData = result.data.extracted_data;
console.log(extractedData); // { name: "Product X", price: 29.99, description: "..." }
// Using a pre-built extraction model
const result = await totalumClient.scrapping.scrape({
url: 'https://example.com/product/123',
format: 'text',
extraction_model: 'product' // automatically extracts product data
});
const extractedData = result.data.extracted_data;
All scrape options
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | The URL of the page to scrape |
format | string | No | Output format: raw, json, text, markdown, clean_html. Default: raw |
format_options | string[] | No | Format modifiers: no_links, no_images, only_content |
preset | string | No | Platform preset: google, amazon, instagram, linkedin, twitter, youtube, ebay, walmart, generic |
render_js | boolean | No | Render JavaScript before scraping |
rendering_wait | number | No | Milliseconds to wait after page load (when render_js is true) |
wait_for_selector | string | No | CSS selector to wait for before scraping |
auto_scroll | boolean | No | Automatically scroll the page to load lazy content |
js | string | No | Custom JavaScript to execute on the page |
proxy_pool | string | No | Proxy pool: public_datacenter_pool or public_residential_pool |
asp | boolean | No | Enable anti-scraping protection bypass |
country | string | No | Country code for geo-targeting (e.g., US, ES) |
headers | object | No | Custom HTTP headers |
cookies | object | No | Custom cookies |
method | string | No | HTTP method: GET, POST, PUT, DELETE. Default: GET |
body | string | No | HTTP request body (for POST/PUT requests) |
cache | boolean | No | Enable response caching |
cache_ttl | number | No | Cache time-to-live in seconds |
timeout | number | No | Request timeout in milliseconds |
extraction_model | string | No | Pre-built AI extraction model (see extraction models table below) |
extraction_prompt | string | No | Custom AI extraction prompt |
extraction_template | string | No | Custom extraction template |
Scrape response
const result = await totalumClient.scrapping.scrape({ url: 'https://example.com', format: 'text' });
// result.data contains:
{
url: 'https://example.com', // final URL (after redirects)
status_code: 200, // HTTP status code
content: '...', // scraped content in the requested format
content_type: 'text/html', // content type of the page
headers: { ... }, // response headers
format: 'text', // format used
extracted_data: { ... }, // only present if extraction was requested
}
Extract structured data from a page
Use Case:
Extract structured data from a URL or from HTML/markdown content you already have. This is useful when you need specific data points from a page (e.g., product info, article content, contact details) instead of the full raw content.
The key difference between scrape and extract is that extract uses AI to parse and structure the data, returning clean JSON instead of raw page content.
Extract from a URL with a custom prompt
const result = await totalumClient.scrapping.extract({
url: 'https://example.com/contact',
extraction_prompt: 'Extract all contact information as JSON: { "emails": [], "phones": [], "address": "" }'
});
const contactInfo = result.data.data;
console.log(contactInfo); // { emails: ["[email protected]"], phones: ["+34 123 456 789"], address: "..." }
Extract from a URL with a pre-built model
Instead of writing a custom prompt, you can use a pre-built extraction model that automatically extracts the right data based on the page type.
const result = await totalumClient.scrapping.extract({
url: 'https://example.com/product/123',
extraction_model: 'product' // automatically extracts product data (name, price, description, images, etc.)
});
const productData = result.data.data;
console.log(productData);
Extract from HTML content you already have
If you already have the HTML or markdown content (e.g., from a previous scrape or from your own data), you can extract data from it directly without making a new HTTP request.
const htmlContent = '<html><body><h1>John Doe</h1><p>Age: 30</p><p>City: Barcelona</p></body></html>';
const result = await totalumClient.scrapping.extract({
content: htmlContent,
content_type: 'text/html',
extraction_prompt: 'Extract name, age, and city as JSON: { "name": "", "age": 0, "city": "" }'
});
const personData = result.data.data;
console.log(personData); // { name: "John Doe", age: 30, city: "Barcelona" }
Extract from a URL with custom scrape settings
You can customize how the page is scraped before extraction by passing scrape_config.
const result = await totalumClient.scrapping.extract({
url: 'https://example.com/dynamic-page',
scrape_config: {
render_js: true,
wait_for_selector: '.product-details',
proxy_pool: 'public_residential_pool'
},
extraction_prompt: 'Extract the product name and price as JSON: { "name": "", "price": 0 }'
});
const productData = result.data.data;
All extract options
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | No* | The URL to scrape and extract from |
content | string | No* | HTML or markdown content to extract from directly |
content_type | string | No | Content type of the provided content: text/html or text/markdown |
scrape_config | object | No | Custom scrape settings when extracting from a URL (same options as scrape except url) |
extraction_model | string | No** | Pre-built AI extraction model (see extraction models table below) |
extraction_prompt | string | No** | Custom AI extraction prompt |
extraction_template | string | No** | Custom extraction template |
* You must provide either url or content, but not both.
** You must provide at least one of extraction_model, extraction_prompt, or extraction_template.
Extract response
const result = await totalumClient.scrapping.extract({ url: 'https://example.com', extraction_prompt: '...' });
// result.data contains:
{
content_type: 'application/json', // type of extracted data
data: { ... }, // the extracted structured data
data_quality: { // quality assessment of the extraction
errors: [], // any extraction errors
fulfilled: true, // whether the extraction was successful
fulfillment_percent: 100 // percentage of requested data extracted
},
url: 'https://example.com', // source URL (if extracted from URL)
}
Available extraction models
Both scrape (via extraction_model) and extract (via extraction_model) support these pre-built models. They automatically extract structured data without needing a custom prompt.
| Model | Description |
|---|---|
product | Single product page (name, price, description, images, etc.) |
product_listing | Product listing/category page (list of products) |
article | News article or blog post (title, author, date, content) |
review_list | List of reviews (reviewer, rating, text, date) |
real_estate_property | Single real estate listing |
real_estate_property_listing | List of real estate properties |
job_posting | Single job posting (title, company, location, salary) |
job_listing | List of job postings |
hotel | Single hotel page |
hotel_listing | List of hotels |
event | Event page (name, date, location, description) |
food_recipe | Recipe (ingredients, steps, cooking time) |
organization | Organization/company page |
social_media_post | Social media post |
search_engine_results | Search engine results page |
software | Software/app page |
stock | Stock/financial data |
vehicle_ad | Vehicle advertisement |
Important notes
- Credits: Each scrape or extract call consumes credits from your account. The cost depends on the complexity of the request (JavaScript rendering and proxy usage cost more).
- Anti-bot protection: For websites with strong anti-bot protection, enable
asp: trueand useproxy_pool: 'public_residential_pool'for best results. - JavaScript rendering: Only enable
render_js: truewhen necessary, as it increases cost and response time. - Format choice: Use
markdownortextfor clean content. Userawwhen you need the original HTML. Useclean_htmlfor simplified HTML without scripts and styles. - Extraction models vs prompts: Pre-built
extraction_modelvalues are optimized for their specific content type and generally produce better results than custom prompts for supported page types.