Building a Production Web Scraper: Real-World Lessons with domharvest-playwright
Theory is great, but nothing beats learning from a real project. In this case study, I’ll walk you through building a production-grade web scraper using domharvest-playwright’s built-in features—retry logic, rate limiting, batch processing, and structured logging.
The goal: Build a competitive intelligence tool that monitors product listings across multiple e-commerce sites, tracking prices, availability, and product details.
Let’s dive into what it really takes to build production-ready web scraping infrastructure when you have the right tools.
The Requirements
Before writing any code, I defined clear requirements:
Functional Requirements:
- Monitor 1,000+ products across 5 e-commerce sites
- Scrape each product every 4 hours
- Track price, availability, and ratings
- Detect and alert on significant changes
- Store data in PostgreSQL
Non-Functional Requirements:
- Handle transient failures gracefully
- Respect target server rate limits
- Complete scraping cycle within 2 hours
- Comprehensive logging for debugging
- Memory usage under 1GB
Why domharvest-playwright?
I evaluated several options, and domharvest-playwright stood out because it already includes the production features I needed:
Built-in Features:
- ✅ Automatic retry with exponential backoff
- ✅ Built-in rate limiting (global and per-domain)
- ✅ Batch processing with concurrency control
- ✅ Enhanced error handling with custom error classes
- ✅ Structured logging
- ✅ Screenshot capture for debugging
These features eliminated weeks of development time. Instead of building infrastructure, I could focus on extraction logic and business requirements.
Architecture Overview
Here’s the high-level architecture:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Scheduler │─────▶│ domharvest │─────▶│ PostgreSQL │
│ (node-cron) │ │ (w/ batch) │ │ (Database) │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Alert System │
│ (Email/SMS) │
└──────────────┘
Components:
- Scheduler: Manages scraping schedule (node-cron)
- Scraper: domharvest-playwright with built-in retry, rate limiting, and batch processing
- Database: PostgreSQL for historical data
- Alert System: Detects changes and sends notifications
Implementation
Step 1: Configure the Harvester
The first step was configuring domharvest-playwright with production settings:
import { DOMHarvester } from 'domharvest-playwright'
import winston from 'winston'
// Set up Winston logger
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' })
]
})
// Configure harvester with production settings
const harvester = new DOMHarvester({
headless: true,
timeout: 30000,
// Automatic retry with exponential backoff
retries: 3,
backoff: 'exponential',
// Rate limiting to avoid bans
rateLimit: {
requestsPerSecond: 2,
perDomain: true // Different limits per domain
},
// Structured logging
logging: {
level: 'info',
logger: {
debug: (msg) => logger.debug(msg),
info: (msg) => logger.info(msg),
warn: (msg) => logger.warn(msg),
error: (msg) => logger.error(msg)
}
},
// Error callback for monitoring
onError: (error) => {
logger.error('Scraping error', {
url: error.url,
type: error.constructor.name,
message: error.message,
timestamp: new Date().toISOString()
})
}
})
await harvester.init()
Key decisions:
- Retry with exponential backoff: Handles transient failures automatically
- Per-domain rate limiting: Different domains get separate rate limits
- Custom logger integration: Winston for production logging
- Error callbacks: Catch all errors for monitoring
Step 2: Define Site-Specific Extractors
Different e-commerce sites have different HTML structures:
// config/extractors.js
export const extractors = {
'shop-a.com': (el) => ({
title: el.querySelector('h1.product-title')?.textContent?.trim(),
price: parseFloat(
el.querySelector('.price .amount')?.textContent?.replace(/[^0-9.]/g, '') || '0'
),
inStock: el.querySelector('.stock-status')?.textContent?.includes('In stock'),
rating: parseFloat(
el.querySelector('.star-rating')?.getAttribute('aria-label')?.match(/[\d.]+/)?.[0] || '0'
)
}),
'shop-b.com': (el) => ({
title: el.querySelector('.product-name')?.textContent?.trim(),
price: parseFloat(
el.querySelector('.product-price')?.textContent?.replace(/[^0-9.]/g, '') || '0'
),
inStock: !el.querySelector('.out-of-stock-badge'),
rating: el.querySelectorAll('.rating-star.filled').length
})
}
export function getExtractor(url) {
const domain = new URL(url).hostname.replace('www.', '')
return extractors[domain] || extractors['default']
}
Step 3: Scrape Products Using Batch Processing
Use domharvest-playwright’s harvestBatch() for efficient scraping:
import { Pool } from 'pg'
import { getExtractor } from './config/extractors.js'
const db = new Pool({ connectionString: process.env.DATABASE_URL })
async function scrapeProducts(urls) {
// Prepare batch configurations
const configs = urls.map(url => {
const domain = new URL(url).hostname.replace('www.', '')
return {
url,
selector: 'body',
extractor: getExtractor(url)
}
})
// Execute batch with concurrency control
const results = await harvester.harvestBatch(configs, {
concurrency: 5, // 5 concurrent requests
onProgress: (completed, total) => {
logger.info(`Progress: ${completed}/${total} products scraped`)
}
})
logger.info(`Batch complete: ${results.length} products processed`)
return results
}
Why this approach works:
- Built-in concurrency: No need to write our own queue system
- Automatic rate limiting: domharvest handles throttling
- Retry on failure: Transient errors are retried automatically
- Progress callbacks: Real-time monitoring
Step 4: Save Results and Detect Changes
Store results in PostgreSQL and detect price changes:
async function saveProduct(productData) {
const { url, title, price, inStock, rating } = productData
// Get previous price
const previous = await db.query(
'SELECT price, in_stock FROM products WHERE url = $1 ORDER BY scraped_at DESC LIMIT 1',
[url]
)
// Insert new data
await db.query(
`INSERT INTO products (url, title, price, in_stock, rating, scraped_at)
VALUES ($1, $2, $3, $4, $5, NOW())`,
[url, title, price, inStock, rating]
)
// Detect significant changes
if (previous.rows.length > 0) {
const prev = previous.rows[0]
const changes = []
// Price change > 10%
if (Math.abs(price - prev.price) / prev.price > 0.10) {
changes.push(`Price: $${prev.price} → $${price}`)
}
// Stock status change
if (inStock !== prev.in_stock) {
changes.push(`Stock: ${prev.in_stock ? 'In stock' : 'Out'} → ${inStock ? 'In stock' : 'Out'}`)
}
if (changes.length > 0) {
await sendAlert(productData, changes)
}
}
}
async function saveResults(results) {
for (const result of results) {
if (!result.error) {
await saveProduct(result)
} else {
logger.error('Failed to scrape product', {
url: result.url,
error: result.error
})
}
}
}
Step 5: Schedule with node-cron
Automate the scraping process:
import cron from 'node-cron'
// Run every 4 hours
cron.schedule('0 */4 * * *', async () => {
logger.info('Starting scheduled scrape')
try {
// Get product URLs from database
const result = await db.query('SELECT url FROM product_urls WHERE active = true')
const urls = result.rows.map(row => row.url)
// Scrape using batch processing
const results = await scrapeProducts(urls)
// Save results and detect changes
await saveResults(results)
logger.info('Scheduled scrape complete', {
total: urls.length,
successful: results.filter(r => !r.error).length,
failed: results.filter(r => r.error).length
})
} catch (error) {
logger.error('Scheduled scrape failed', {
error: error.message,
stack: error.stack
})
}
})
logger.info('Scraper scheduler started')
Production Lessons Learned
1. Built-in Retry Logic is Essential
Reality: Networks fail, pages timeout, sites go down temporarily.
Solution: domharvest-playwright’s automatic retry with exponential backoff handled this perfectly:
const harvester = new DOMHarvester({
retries: 3,
backoff: 'exponential'
})
Results:
- 95% of transient failures recovered automatically
- No manual retry logic needed
- Exponential backoff prevented overwhelming servers
2. Rate Limiting Prevents Bans
Reality: Aggressive scraping gets you IP-banned quickly.
Solution: Built-in per-domain rate limiting:
const harvester = new DOMHarvester({
rateLimit: {
requestsPerSecond: 2,
perDomain: true
}
})
Results:
- Zero IP bans after implementing rate limits
- Respects each site’s capacity
- No manual delay management needed
3. Batch Processing with Concurrency is Faster
Reality: Sequential scraping is too slow for 1,000+ products.
Solution: harvestBatch() with concurrency control:
await harvester.harvestBatch(configs, {
concurrency: 5
})
Results:
- 1,000 products: 15 minutes (vs. 2+ hours sequential)
- Concurrency + rate limiting work together
- Automatic error isolation per URL
4. Structured Logging Enables Debugging
Reality: Production issues are hard to diagnose without good logs.
Solution: Integrated Winston with domharvest-playwright’s logging:
const harvester = new DOMHarvester({
logging: {
level: 'info',
logger: {
debug: (msg) => winston.debug(msg),
info: (msg) => winston.info(msg),
warn: (msg) => winston.warn(msg),
error: (msg) => winston.error(msg)
}
}
})
Results:
- Every request logged with context
- Retry attempts visible in logs
- Easy debugging of failures
- Integration with existing logging infrastructure
5. Error Callbacks Enable Monitoring
Reality: You need to know when scraping fails.
Solution: Error callbacks for real-time alerting:
const harvester = new DOMHarvester({
onError: (error) => {
// Send to monitoring service
monitoring.track('scraping_error', {
url: error.url,
type: error.constructor.name,
message: error.message
})
}
})
Results:
- Real-time error notifications
- Integration with monitoring tools
- Quick response to issues
6. Custom Error Classes Provide Context
Reality: Generic errors don’t tell you what went wrong.
Solution: domharvest-playwright’s custom error classes:
import { TimeoutError, NavigationError, ExtractionError } from 'domharvest-playwright'
try {
await harvester.harvest(url, selector, extractor)
} catch (error) {
if (error instanceof TimeoutError) {
logger.error('Timeout', { url: error.url, timeout: error.timeout })
} else if (error instanceof NavigationError) {
logger.error('Navigation failed', { url: error.url, cause: error.cause })
} else if (error instanceof ExtractionError) {
logger.error('Extraction failed', { selector: error.selector })
}
}
Results:
- Rich error context for debugging
- Type-safe error handling
- Better alerting specificity
7. Screenshots Help Debug Selector Issues
Reality: Selectors break when sites change.
Solution: Capture screenshots on failures:
await harvester.harvest(url, selector, extractor, {
screenshot: { path: `errors/${Date.now()}.png` }
})
Results:
- Visual confirmation of selector issues
- Faster debugging
- Historical record of site changes
Performance Metrics
After optimization, here’s what I achieved:
Scraping Performance:
- 1,000 products scraped in 15 minutes
- 5 concurrent workers
- ~0.9 seconds per product
- 98.5% success rate (with retries)
System Reliability:
- Automatic recovery from 95% of failures
- Zero IP bans with rate limiting
- No manual intervention needed
Resource Usage:
- Memory: 400-600MB (single browser instance)
- CPU: 20-30% average
- Network: Respects rate limits automatically
Cost Breakdown
Running this in production is affordable:
Infrastructure:
- DigitalOcean Droplet (2GB RAM, 1 vCPU): $12/month
- PostgreSQL Database (managed, 1GB): $15/month
- Total: $27/month
No proxy service needed thanks to respectful rate limiting.
Complete Production Code
Here’s the full production implementation:
import { DOMHarvester, TimeoutError, NavigationError, ExtractionError } from 'domharvest-playwright'
import { Pool } from 'pg'
import winston from 'winston'
import cron from 'node-cron'
import { getExtractor } from './config/extractors.js'
// Logger setup
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' })
]
})
// Database setup
const db = new Pool({ connectionString: process.env.DATABASE_URL })
// Harvester with production configuration
const harvester = new DOMHarvester({
headless: true,
timeout: 30000,
retries: 3,
backoff: 'exponential',
rateLimit: {
requestsPerSecond: 2,
perDomain: true
},
logging: {
level: 'info',
logger: {
debug: (msg) => logger.debug(msg),
info: (msg) => logger.info(msg),
warn: (msg) => logger.warn(msg),
error: (msg) => logger.error(msg)
}
},
onError: (error) => {
logger.error('Scraping error', {
url: error.url,
type: error.constructor.name,
message: error.message,
timestamp: new Date().toISOString()
})
}
})
async function scrapeProducts(urls) {
const configs = urls.map(url => ({
url,
selector: 'body',
extractor: getExtractor(url)
}))
const results = await harvester.harvestBatch(configs, {
concurrency: 5,
onProgress: (completed, total) => {
logger.info(`Progress: ${completed}/${total}`)
}
})
return results
}
async function saveProduct(productData) {
const { url, title, price, inStock, rating } = productData
const previous = await db.query(
'SELECT price, in_stock FROM products WHERE url = $1 ORDER BY scraped_at DESC LIMIT 1',
[url]
)
await db.query(
`INSERT INTO products (url, title, price, in_stock, rating, scraped_at)
VALUES ($1, $2, $3, $4, $5, NOW())`,
[url, title, price, inStock, rating]
)
if (previous.rows.length > 0) {
const prev = previous.rows[0]
const changes = []
if (Math.abs(price - prev.price) / prev.price > 0.10) {
changes.push(`Price: $${prev.price} → $${price}`)
}
if (inStock !== prev.in_stock) {
changes.push(`Stock: ${prev.in_stock ? 'In' : 'Out'} → ${inStock ? 'In' : 'Out'}`)
}
if (changes.length > 0) {
logger.info('Product changed', { url, changes })
// Send alert here
}
}
}
async function saveResults(results) {
for (const result of results) {
if (!result.error) {
await saveProduct(result)
} else {
logger.error('Scrape failed', { url: result.url, error: result.error })
}
}
}
// Initialize
await harvester.init()
// Schedule scraping every 4 hours
cron.schedule('0 */4 * * *', async () => {
logger.info('Starting scheduled scrape')
try {
const result = await db.query('SELECT url FROM product_urls WHERE active = true')
const urls = result.rows.map(row => row.url)
const results = await scrapeProducts(urls)
await saveResults(results)
logger.info('Scrape complete', {
total: urls.length,
successful: results.filter(r => !r.error).length,
failed: results.filter(r => r.error).length
})
} catch (error) {
logger.error('Scheduled scrape failed', { error: error.message })
}
})
logger.info('Scraper running')
Conclusion
Building production web scrapers is significantly easier when you have the right tools. domharvest-playwright eliminated weeks of infrastructure development by providing:
Production features out of the box:
- Automatic retry with exponential backoff
- Built-in rate limiting (global and per-domain)
- Batch processing with concurrency control
- Enhanced error handling with custom error classes
- Structured logging with custom logger support
- Screenshot capture for debugging
Key takeaways:
- Don’t build what you can configure: Use built-in retry and rate limiting
- Batch processing is essential: Process multiple URLs efficiently
- Structured logging is critical: Integrate with existing logging infrastructure
- Error callbacks enable monitoring: Real-time alerts on failures
- Rate limiting prevents bans: Automatic throttling respects server capacity
domharvest-playwright handled the core scraping infrastructure, letting me focus on business logic—extracting data, detecting changes, and sending alerts. That’s exactly what a good library should do.
Resources:
Building your own production scraper? Share your experiences on GitHub Discussions!