Thursday, September 4, 2025

Processing 2.7M Company Websites: BI Dataset Lessons

data flowing through earth

The Problem: AI Agents Need Better Business Context

We started this project because existing business databases felt incomplete for the AI-first world we're moving into. You'd get basic contact info, maybe some filmographic data, but nothing about what companies actually do or how they present themselves online.
As developers building B2B tools, we were constantly frustrated by the gap between "this company exists" and "this is what this company is actually about." But more importantly, as AI agents and LLMs become standard tools in almost every company, they need rich contextual data about businesses to be truly useful. An AI assistant trying to help with sales, marketing, or competitive research needs to understand not just that a company exists, but what they do, how they position themselves, what technology they use, and how they communicate.
So we decided to build something better. The goal was straightforward: create a dataset that captures not just company names and addresses, but the actual substance of what businesses do and how they communicate online – formatted in a way that AI systems can actually use.
We started with a list of company names and websites compiled from various public business directories and registries. The raw list had basic information – company name, website, industry classification – but that was about it. Our job was to transform this basic directory into something much richer by actually visiting each company's website and understanding what they do. After several months of development and processing, we ended up with 2.7 million company records with AI-extracted insights about their business models, technology choices, and market positioning ready for LLMs.

The technical challenges were real though. Crawling millions of websites without getting blocked, ensuring data quality across wildly different site types, extracting meaningful insights from messy HTML, and doing it all reliably without burning through our AWS budget. We learned a lot along the way – mostly about things that don't work.


How We Built It: A Three-Stage Pipeline

We settled on a three-stage approach after trying (and abandoning) several other architectures. The final design processes companies in batches, crawls their websites with managed browser instances, then extracts business insights using AI.
It's not groundbreaking – just a practical solution to some messy problems.

Stage 1: Industry Batching

Early on, we tried processing companies randomly and ran into problems. Fintech sites are built differently than restaurant websites, which are different from enterprise software companies. So we group companies by industry before processing them.

This helps with performance and lets us optimize for industry-specific patterns. SaaS companies tend to have similar page structures, manufacturing companies often use similar CMSs, etc.

industry.go
go
// Industry-based processing allows for specialized handling
func CreateIndustryFiles(dataFilePath string, outputDir string, minFoundedYear int) (*IndustryIndex, error) {
    industriesMap := make(map[string][]*Company)

    // Group companies by industry for optimized processing
    for company := range companies {
        industryName := helpers.SanitizeFolderName(company.Industry)
        industriesMap[industryName] = append(industriesMap[industryName], company)
    }

    // Process each industry batch independently
    return createIndustryBatches(industriesMap)
}

This approach allows us to:

  • Apply industry-specific validation rules
  • Optimize crawling patterns for similar business models
  • Enable parallel processing across different market sectors
  • Resume processing at the industry level if interruptions occur

Stage 2: Web Crawling (The Hard Part)

This is where things get messy. We need to visit millions of websites, extract content, take screenshots, and do it without getting blocked or crashing.

We use a pool of Chrome browser instances – typically a few hundred running at once. Each browser gets its own isolated environment to prevent one problematic site from affecting others. The browsers are managed by a custom pool system that handles crashes, memory leaks, and the general unreliability of running Chrome at scale.

Browser Pool Management

Chrome instances crash a lot when you're doing this at scale. They also leak memory and sometimes just hang indefinitely. Our pool system tries to deal with this by:

  • Giving each browser its own user data directory (isolation)
  • Health checking browsers before and after use
  • Force-killing browsers that have been checked out too long
  • Automatically replacing crashed browsers

It's not elegant, but it works.

browser.go
go
type Pool struct {
    browsers    chan *rod.Browser
    size        int
    checkedOut  map[*rod.Browser]*BrowserInfo
    userDataMap map[*rod.Browser]string
}

func (p *Pool) launchBrowser() {
    // Each browser gets a unique, isolated environment
    userDataDir, _ := os.MkdirTemp("", fmt.Sprintf("chrome-browser-%d-", browserID))

    url := launcher.New().
        Headless(true).
        Leakless(true).
        NoSandbox(true).
        Set("disable-dev-shm-usage").
        UserDataDir(userDataDir).
        MustLaunch()
}

Adaptive Resource Management

The system automatically scales based on available CPU cores:

  • 1-4 CPUs: Conservative settings, perfect for development
  • 5-16 CPUs: Moderate performance for local processing
  • 17-48 CPUs: High performance with resource blocking
  • 49+ CPUs: Maximum throughput for enterprise deployment

Progressive Data Extraction

Instead of trying to extract everything at once (and risking timeouts), we use a three-phase extraction approach:

  1. Essential Phase (30s timeout): Title, meta data, core content
  2. Extended Phase (45s timeout): Links, media, forms, JavaScript analysis
  3. Comprehensive Phase (25s timeout): Screenshots, performance metrics, accessibility data

This ensures we capture critical information even if a site is slow or partially broken.

Stage 3: AI Content Analysis

Raw HTML isn't very useful. We need to understand what companies actually do, who they serve, and how they position themselves. So we use AI to extract structured business information from the crawled content.

We built prompts that analyze websites and extract things like:

  • Company descriptions and value propositions
  • Target audiences and customer segments
  • Technology stack and platform availability
  • Design characteristics and communication tone
  • Contact information and support options

The AI does a pretty good job, but it's not perfect. We validate everything and assign confidence scores to help filter out low-quality extractions.

Multi-Model AI Pipeline

We developed a comprehensive prompt system that extracts structured business intelligence from crawled content:

ai_context.go
go
type AIContext struct {
    CompanyInfo    CompanyInfo    `json:"companyInfo"`
    LegalInfo      LegalInfo      `json:"legalInfo"`
    ContactInfo    ContactInfo    `json:"contactInfo"`
    MarketingInfo  MarketingInfo  `json:"marketingInfo"`
    TechInfo       TechInfo       `json:"techInfo"`
    DesignInfo     DesignInfo     `json:"designInfo"`
    ToneInfo       ToneInfo       `json:"toneInfo"`
    ContentQuality ContentQuality `json:"contentQuality"`
    QualityCheck   QualityCheck   `json:"qualityCheck"`
}

Quality-First Processing

Every piece of extracted data goes through multiple validation layers:

  • Content Validation: Filters out parked domains, error pages, and placeholder content
  • Domain Matching: Ensures extracted company information actually matches the domain
  • Confidence Scoring: AI assigns confidence scores (0-100) to each extraction
  • Quality Thresholds: Only records meeting minimum quality standards make it to the final dataset

What We Got: The Numbers

After several months of processing, here's what we ended up with:

  • 2,687,241 companies across 361 industries
  • 248 countries and territories represented
  • 95%+ confidence scores for the majority of records
  • Multiple output formats: JSON Lines, CSV, and Parquet
  • Comprehensive coverage: From 1-person startups to Fortune 500 enterprises

Industry Distribution Highlights

  • Computer Software: 148,648 companies
  • Construction: 156,607 companies
  • Marketing & Advertising: 208,906 companies
  • Information Technology: 163,499 companies
  • Management Consulting: 129,138 companies

Technical Innovations That Made It Possible

1. Fault-Tolerant Processing

Every component is designed to handle failures gracefully:

  • Browser crashes don't affect other crawling operations
  • Failed enrichments are automatically retried with exponential backoff
  • Processing can resume from any point without data loss
  • Comprehensive failure logging enables post-mortem analysis

2. Memory-Efficient Streaming

Instead of loading everything into memory, we use streaming JSON processing:

  • Process companies one at a time to minimize memory footprint
  • Write results immediately to prevent data loss
  • Support for multiple output formats generated simultaneously

3. Intelligent Content Filtering

Not all websites are worth enriching. Our multi-stage filtering removes:

  • Parked domains and placeholder sites
  • Social media profiles and personal blogs
  • Error pages and maintenance screens
  • Sites with insufficient business-relevant content

4. Performance Optimization

  • Browser reuse: Browsers are returned to the pool rather than recreated
  • Resource blocking: Unnecessary assets (ads, trackers) are blocked during crawling
  • Concurrent processing: Multiple companies processed simultaneously across different industries
  • Adaptive timeouts: Timeout values adjust based on content complexity

What We Learned (The Hard Way)

Building this taught us a lot about the reality of web crawling at scale. Here are the main lessons:

1. Most Business Websites Are Pretty Bad

The nice, fast websites you normally visit aren't representative. Most business websites are slow, poorly built, or just plain broken. We had to handle everything from modern React apps to sites that literally still required Flash Player.

Some companies' entire web presence is a single HTML file that looks like it was uploaded in 2003. Load times of 30+ seconds aren't uncommon. If you're building web crawling tools, plan for the worst-case scenario.

2. Chrome Is a Pain to Manage at Scale

Running hundreds of Chrome instances simultaneously is harder than it sounds. Browsers crash randomly, leak memory, and sometimes just stop responding. We spent way too much time debugging browser-related issues.

  • Monitor browser health continuously
  • Force cleanup of stuck browsers
  • Maintain isolated user data directories
  • Implement aggressive garbage collection

3. AI Makes Mistakes (Confidently)

LLMs will hallucinate data and present it with high confidence. We had to build extensive validation to catch obviously wrong extractions. Even with validation, some errors probably slip through.

4. Everything Will Break Eventually

At this scale, failures are guaranteed. Servers crash, networks hiccup, APIs go down. We learned to build resume capability into everything from day one, which saved us from having to restart processing multiple times.


What We Ended Up With

We built a dataset that goes deeper than traditional business directories. Instead of just contact info, each record includes AI-extracted insights about what companies actually do and how they present themselves online.

Each record includes:

  • Business Intelligence: Company descriptions, target audiences, value propositions
  • Technical Insights: Technology stacks, platform availability, mobile optimization
  • Design Analysis: Color schemes, typography, design styles
  • Communication Patterns: Tone analysis, formality levels, content quality metrics
  • Market Context: Industry classification, company size, founding year, geographic location

What's Next

This was a useful learning exercise and we're happy with the results. We're working on a few improvements:

  • Real-time updates: Continuous monitoring of company changes
  • Deeper AI analysis: More sophisticated business model classification
  • API access: Making this intelligence available via programmatic interfaces
  • Vertical-specific datasets: Industry-focused deep dives with specialized extraction

The infrastructure scales reasonably well and produces useful results. Not perfect, but better than what we had before.


The Dataset

The 2025Q3 dataset is available in multiple formats. It's useful if you need business data that goes beyond basic contact information.

What's included:

  • 2.7M+ companies across 361 industries and 248 countries
  • AI-extracted business insights with confidence scores
  • Multiple formats: JSON Lines, CSV, Parquet
  • Industry and geographic coverage globally

Good for:

  • Powering AI agents that need business context
  • Sales and marketing prospect research
  • Market analysis and competitive intelligence
  • Training data for ML models
  • Academic research on business patterns
  • Building B2B tools and applications

It's not perfect – there are definitely errors and gaps – but it's a decent starting point for projects that need this kind of data. Especially useful if you're building AI systems that need to understand what companies actually do beyond just their name and industry.

Processing 2.7M Company Websites: BI Dataset Lessons