Max B. - Web Scraping & Automation

Building a Production Web Scraper: Real-World Lessons with domharvest-playwright

07 Jan 2025

Building a Production Web Scraper: Real-World Lessons with domharvest-playwright

Theory is great, but nothing beats learning from a real project. In this case study, I’ll walk you through building a production-grade web scraper using domharvest-playwright’s built-in features—retry logic, rate limiting, batch processing, and structured logging.

The goal: Build a competitive intelligence tool that monitors product listings across multiple e-commerce sites, tracking prices, availability, and product details.

Let’s dive into what it really takes to build production-ready web scraping infrastructure when you have the right tools.

The Requirements

Before writing any code, I defined clear requirements:

Functional Requirements:

  • Monitor 1,000+ products across 5 e-commerce sites
  • Scrape each product every 4 hours
  • Track price, availability, and ratings
  • Detect and alert on significant changes
  • Store data in PostgreSQL

Non-Functional Requirements:

  • Handle transient failures gracefully
  • Respect target server rate limits
  • Complete scraping cycle within 2 hours
  • Comprehensive logging for debugging
  • Memory usage under 1GB

Why domharvest-playwright?

I evaluated several options, and domharvest-playwright stood out because it already includes the production features I needed:

Built-in Features:

  • ✅ Automatic retry with exponential backoff
  • ✅ Built-in rate limiting (global and per-domain)
  • ✅ Batch processing with concurrency control
  • ✅ Enhanced error handling with custom error classes
  • ✅ Structured logging
  • ✅ Screenshot capture for debugging

These features eliminated weeks of development time. Instead of building infrastructure, I could focus on extraction logic and business requirements.

Architecture Overview

Here’s the high-level architecture:

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Scheduler  │─────▶│  domharvest  │─────▶│  PostgreSQL  │
│  (node-cron) │      │  (w/ batch)  │      │  (Database)  │
└──────────────┘      └──────────────┘      └──────────────┘
                             │
                             ▼
                      ┌──────────────┐
                      │ Alert System │
                      │  (Email/SMS) │
                      └──────────────┘

Components:

  1. Scheduler: Manages scraping schedule (node-cron)
  2. Scraper: domharvest-playwright with built-in retry, rate limiting, and batch processing
  3. Database: PostgreSQL for historical data
  4. Alert System: Detects changes and sends notifications

Implementation

Step 1: Configure the Harvester

The first step was configuring domharvest-playwright with production settings:

import { DOMHarvester } from 'domharvest-playwright'
import winston from 'winston'

// Set up Winston logger
const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' })
  ]
})

// Configure harvester with production settings
const harvester = new DOMHarvester({
  headless: true,
  timeout: 30000,

  // Automatic retry with exponential backoff
  retries: 3,
  backoff: 'exponential',

  // Rate limiting to avoid bans
  rateLimit: {
    requestsPerSecond: 2,
    perDomain: true // Different limits per domain
  },

  // Structured logging
  logging: {
    level: 'info',
    logger: {
      debug: (msg) => logger.debug(msg),
      info: (msg) => logger.info(msg),
      warn: (msg) => logger.warn(msg),
      error: (msg) => logger.error(msg)
    }
  },

  // Error callback for monitoring
  onError: (error) => {
    logger.error('Scraping error', {
      url: error.url,
      type: error.constructor.name,
      message: error.message,
      timestamp: new Date().toISOString()
    })
  }
})

await harvester.init()

Key decisions:

  1. Retry with exponential backoff: Handles transient failures automatically
  2. Per-domain rate limiting: Different domains get separate rate limits
  3. Custom logger integration: Winston for production logging
  4. Error callbacks: Catch all errors for monitoring

Step 2: Define Site-Specific Extractors

Different e-commerce sites have different HTML structures:

// config/extractors.js
export const extractors = {
  'shop-a.com': (el) => ({
    title: el.querySelector('h1.product-title')?.textContent?.trim(),
    price: parseFloat(
      el.querySelector('.price .amount')?.textContent?.replace(/[^0-9.]/g, '') || '0'
    ),
    inStock: el.querySelector('.stock-status')?.textContent?.includes('In stock'),
    rating: parseFloat(
      el.querySelector('.star-rating')?.getAttribute('aria-label')?.match(/[\d.]+/)?.[0] || '0'
    )
  }),

  'shop-b.com': (el) => ({
    title: el.querySelector('.product-name')?.textContent?.trim(),
    price: parseFloat(
      el.querySelector('.product-price')?.textContent?.replace(/[^0-9.]/g, '') || '0'
    ),
    inStock: !el.querySelector('.out-of-stock-badge'),
    rating: el.querySelectorAll('.rating-star.filled').length
  })
}

export function getExtractor(url) {
  const domain = new URL(url).hostname.replace('www.', '')
  return extractors[domain] || extractors['default']
}

Step 3: Scrape Products Using Batch Processing

Use domharvest-playwright’s harvestBatch() for efficient scraping:

import { Pool } from 'pg'
import { getExtractor } from './config/extractors.js'

const db = new Pool({ connectionString: process.env.DATABASE_URL })

async function scrapeProducts(urls) {
  // Prepare batch configurations
  const configs = urls.map(url => {
    const domain = new URL(url).hostname.replace('www.', '')
    return {
      url,
      selector: 'body',
      extractor: getExtractor(url)
    }
  })

  // Execute batch with concurrency control
  const results = await harvester.harvestBatch(configs, {
    concurrency: 5, // 5 concurrent requests
    onProgress: (completed, total) => {
      logger.info(`Progress: ${completed}/${total} products scraped`)
    }
  })

  logger.info(`Batch complete: ${results.length} products processed`)
  return results
}

Why this approach works:

  1. Built-in concurrency: No need to write our own queue system
  2. Automatic rate limiting: domharvest handles throttling
  3. Retry on failure: Transient errors are retried automatically
  4. Progress callbacks: Real-time monitoring

Step 4: Save Results and Detect Changes

Store results in PostgreSQL and detect price changes:

async function saveProduct(productData) {
  const { url, title, price, inStock, rating } = productData

  // Get previous price
  const previous = await db.query(
    'SELECT price, in_stock FROM products WHERE url = $1 ORDER BY scraped_at DESC LIMIT 1',
    [url]
  )

  // Insert new data
  await db.query(
    `INSERT INTO products (url, title, price, in_stock, rating, scraped_at)
     VALUES ($1, $2, $3, $4, $5, NOW())`,
    [url, title, price, inStock, rating]
  )

  // Detect significant changes
  if (previous.rows.length > 0) {
    const prev = previous.rows[0]
    const changes = []

    // Price change > 10%
    if (Math.abs(price - prev.price) / prev.price > 0.10) {
      changes.push(`Price: $${prev.price} → $${price}`)
    }

    // Stock status change
    if (inStock !== prev.in_stock) {
      changes.push(`Stock: ${prev.in_stock ? 'In stock' : 'Out'}${inStock ? 'In stock' : 'Out'}`)
    }

    if (changes.length > 0) {
      await sendAlert(productData, changes)
    }
  }
}

async function saveResults(results) {
  for (const result of results) {
    if (!result.error) {
      await saveProduct(result)
    } else {
      logger.error('Failed to scrape product', {
        url: result.url,
        error: result.error
      })
    }
  }
}

Step 5: Schedule with node-cron

Automate the scraping process:

import cron from 'node-cron'

// Run every 4 hours
cron.schedule('0 */4 * * *', async () => {
  logger.info('Starting scheduled scrape')

  try {
    // Get product URLs from database
    const result = await db.query('SELECT url FROM product_urls WHERE active = true')
    const urls = result.rows.map(row => row.url)

    // Scrape using batch processing
    const results = await scrapeProducts(urls)

    // Save results and detect changes
    await saveResults(results)

    logger.info('Scheduled scrape complete', {
      total: urls.length,
      successful: results.filter(r => !r.error).length,
      failed: results.filter(r => r.error).length
    })
  } catch (error) {
    logger.error('Scheduled scrape failed', {
      error: error.message,
      stack: error.stack
    })
  }
})

logger.info('Scraper scheduler started')

Production Lessons Learned

1. Built-in Retry Logic is Essential

Reality: Networks fail, pages timeout, sites go down temporarily.

Solution: domharvest-playwright’s automatic retry with exponential backoff handled this perfectly:

const harvester = new DOMHarvester({
  retries: 3,
  backoff: 'exponential'
})

Results:

  • 95% of transient failures recovered automatically
  • No manual retry logic needed
  • Exponential backoff prevented overwhelming servers

2. Rate Limiting Prevents Bans

Reality: Aggressive scraping gets you IP-banned quickly.

Solution: Built-in per-domain rate limiting:

const harvester = new DOMHarvester({
  rateLimit: {
    requestsPerSecond: 2,
    perDomain: true
  }
})

Results:

  • Zero IP bans after implementing rate limits
  • Respects each site’s capacity
  • No manual delay management needed

3. Batch Processing with Concurrency is Faster

Reality: Sequential scraping is too slow for 1,000+ products.

Solution: harvestBatch() with concurrency control:

await harvester.harvestBatch(configs, {
  concurrency: 5
})

Results:

  • 1,000 products: 15 minutes (vs. 2+ hours sequential)
  • Concurrency + rate limiting work together
  • Automatic error isolation per URL

4. Structured Logging Enables Debugging

Reality: Production issues are hard to diagnose without good logs.

Solution: Integrated Winston with domharvest-playwright’s logging:

const harvester = new DOMHarvester({
  logging: {
    level: 'info',
    logger: {
      debug: (msg) => winston.debug(msg),
      info: (msg) => winston.info(msg),
      warn: (msg) => winston.warn(msg),
      error: (msg) => winston.error(msg)
    }
  }
})

Results:

  • Every request logged with context
  • Retry attempts visible in logs
  • Easy debugging of failures
  • Integration with existing logging infrastructure

5. Error Callbacks Enable Monitoring

Reality: You need to know when scraping fails.

Solution: Error callbacks for real-time alerting:

const harvester = new DOMHarvester({
  onError: (error) => {
    // Send to monitoring service
    monitoring.track('scraping_error', {
      url: error.url,
      type: error.constructor.name,
      message: error.message
    })
  }
})

Results:

  • Real-time error notifications
  • Integration with monitoring tools
  • Quick response to issues

6. Custom Error Classes Provide Context

Reality: Generic errors don’t tell you what went wrong.

Solution: domharvest-playwright’s custom error classes:

import { TimeoutError, NavigationError, ExtractionError } from 'domharvest-playwright'

try {
  await harvester.harvest(url, selector, extractor)
} catch (error) {
  if (error instanceof TimeoutError) {
    logger.error('Timeout', { url: error.url, timeout: error.timeout })
  } else if (error instanceof NavigationError) {
    logger.error('Navigation failed', { url: error.url, cause: error.cause })
  } else if (error instanceof ExtractionError) {
    logger.error('Extraction failed', { selector: error.selector })
  }
}

Results:

  • Rich error context for debugging
  • Type-safe error handling
  • Better alerting specificity

7. Screenshots Help Debug Selector Issues

Reality: Selectors break when sites change.

Solution: Capture screenshots on failures:

await harvester.harvest(url, selector, extractor, {
  screenshot: { path: `errors/${Date.now()}.png` }
})

Results:

  • Visual confirmation of selector issues
  • Faster debugging
  • Historical record of site changes

Performance Metrics

After optimization, here’s what I achieved:

Scraping Performance:

  • 1,000 products scraped in 15 minutes
  • 5 concurrent workers
  • ~0.9 seconds per product
  • 98.5% success rate (with retries)

System Reliability:

  • Automatic recovery from 95% of failures
  • Zero IP bans with rate limiting
  • No manual intervention needed

Resource Usage:

  • Memory: 400-600MB (single browser instance)
  • CPU: 20-30% average
  • Network: Respects rate limits automatically

Cost Breakdown

Running this in production is affordable:

Infrastructure:

  • DigitalOcean Droplet (2GB RAM, 1 vCPU): $12/month
  • PostgreSQL Database (managed, 1GB): $15/month
  • Total: $27/month

No proxy service needed thanks to respectful rate limiting.

Complete Production Code

Here’s the full production implementation:

import { DOMHarvester, TimeoutError, NavigationError, ExtractionError } from 'domharvest-playwright'
import { Pool } from 'pg'
import winston from 'winston'
import cron from 'node-cron'
import { getExtractor } from './config/extractors.js'

// Logger setup
const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' })
  ]
})

// Database setup
const db = new Pool({ connectionString: process.env.DATABASE_URL })

// Harvester with production configuration
const harvester = new DOMHarvester({
  headless: true,
  timeout: 30000,
  retries: 3,
  backoff: 'exponential',
  rateLimit: {
    requestsPerSecond: 2,
    perDomain: true
  },
  logging: {
    level: 'info',
    logger: {
      debug: (msg) => logger.debug(msg),
      info: (msg) => logger.info(msg),
      warn: (msg) => logger.warn(msg),
      error: (msg) => logger.error(msg)
    }
  },
  onError: (error) => {
    logger.error('Scraping error', {
      url: error.url,
      type: error.constructor.name,
      message: error.message,
      timestamp: new Date().toISOString()
    })
  }
})

async function scrapeProducts(urls) {
  const configs = urls.map(url => ({
    url,
    selector: 'body',
    extractor: getExtractor(url)
  }))

  const results = await harvester.harvestBatch(configs, {
    concurrency: 5,
    onProgress: (completed, total) => {
      logger.info(`Progress: ${completed}/${total}`)
    }
  })

  return results
}

async function saveProduct(productData) {
  const { url, title, price, inStock, rating } = productData

  const previous = await db.query(
    'SELECT price, in_stock FROM products WHERE url = $1 ORDER BY scraped_at DESC LIMIT 1',
    [url]
  )

  await db.query(
    `INSERT INTO products (url, title, price, in_stock, rating, scraped_at)
     VALUES ($1, $2, $3, $4, $5, NOW())`,
    [url, title, price, inStock, rating]
  )

  if (previous.rows.length > 0) {
    const prev = previous.rows[0]
    const changes = []

    if (Math.abs(price - prev.price) / prev.price > 0.10) {
      changes.push(`Price: $${prev.price} → $${price}`)
    }

    if (inStock !== prev.in_stock) {
      changes.push(`Stock: ${prev.in_stock ? 'In' : 'Out'}${inStock ? 'In' : 'Out'}`)
    }

    if (changes.length > 0) {
      logger.info('Product changed', { url, changes })
      // Send alert here
    }
  }
}

async function saveResults(results) {
  for (const result of results) {
    if (!result.error) {
      await saveProduct(result)
    } else {
      logger.error('Scrape failed', { url: result.url, error: result.error })
    }
  }
}

// Initialize
await harvester.init()

// Schedule scraping every 4 hours
cron.schedule('0 */4 * * *', async () => {
  logger.info('Starting scheduled scrape')

  try {
    const result = await db.query('SELECT url FROM product_urls WHERE active = true')
    const urls = result.rows.map(row => row.url)

    const results = await scrapeProducts(urls)
    await saveResults(results)

    logger.info('Scrape complete', {
      total: urls.length,
      successful: results.filter(r => !r.error).length,
      failed: results.filter(r => r.error).length
    })
  } catch (error) {
    logger.error('Scheduled scrape failed', { error: error.message })
  }
})

logger.info('Scraper running')

Conclusion

Building production web scrapers is significantly easier when you have the right tools. domharvest-playwright eliminated weeks of infrastructure development by providing:

Production features out of the box:

  1. Automatic retry with exponential backoff
  2. Built-in rate limiting (global and per-domain)
  3. Batch processing with concurrency control
  4. Enhanced error handling with custom error classes
  5. Structured logging with custom logger support
  6. Screenshot capture for debugging

Key takeaways:

  1. Don’t build what you can configure: Use built-in retry and rate limiting
  2. Batch processing is essential: Process multiple URLs efficiently
  3. Structured logging is critical: Integrate with existing logging infrastructure
  4. Error callbacks enable monitoring: Real-time alerts on failures
  5. Rate limiting prevents bans: Automatic throttling respects server capacity

domharvest-playwright handled the core scraping infrastructure, letting me focus on business logic—extracting data, detecting changes, and sending alerts. That’s exactly what a good library should do.


Resources:

Building your own production scraper? Share your experiences on GitHub Discussions!