No API? No Problem: A Framework for Extracting Data from Any Website.

No API? No Problem

Context

As AI automations have grown exponentially, so has the need to scrape websites to complete the full picture of a problem. The LLM can reason, the API can act, but someone still needs to go get the data from that website that has no API, no export button, and a login form behind a CAPTCHA.

I am not talking about the easy kind of scraping. Extracting product prices from a popular e-commerce site? That's a requests.get() to their public API endpoint. You don't need Playwright for that. You probably don't even need a library.

I am talking about the hard kind. The kind where you need to:

Log into a private account through a portal with bot detection
Navigate through multi-step flows with dropdowns, filters, and entity selectors
Open modals, close popups, switch between tabs
Extract structured data from dynamic SPAs that render lazily
Download files that only appear after three clicks and a page reload
Do all of this without breaking any legal rules or overloading a server

This is the kind of task that makes developers groan. The kind you estimate at "two weeks" and ship in two months. The kind where the site changes a CSS class on a Tuesday and your 4 AM cron job fails silently.

I've built several of these. They run in production. They scrape multiple complex sites daily, on schedule, from the cloud. And after enough suffering, I distilled the patterns into a framework that makes building new scrapers faster.

The Problem

Why is this hard?

Bot detection is sophisticated. Sites use services like Akamai, Shape, ThreatMetrix, and PerimeterX. These aren't just checking your User-Agent string. They're fingerprinting your WebGL renderer, your canvas hash, your navigator properties, your TLS handshake, and your WebRTC configuration. They run JavaScript that phones home with a full browser fingerprint, and if the fingerprint looks off, your session is dead.

Dynamic content is unpredictable. SPAs render content lazily. Elements appear after async calls. Modals stack on top of each other. A company selector triggers a full page re-render. networkidle is never truly idle.

Selectors are fragile. The site's login form uses #userInput today. Next deploy it's #loginField. The month after that it's input[name='user']. Your scraper breaks every time.

Cross-origin iframes are a nightmare. Some sites load their login form inside a cross-origin iframe with CSP headers that block your automation scripts. Standard Playwright approaches just don't work.

Environment differences kill you. Your scraper works perfectly on your Mac with a visible browser window. You deploy it to a Linux container and it fails instantly because Chrome's TLS fingerprint is different, the User-Agent doesn't match the OS, and the headless mode is detectable.

Maintenance is the real cost. Building the scraper is 20% of the work. Keeping it running is the other 80%.

This Blog

Here I accumulate key insights from building several production-level web scraping applications into a practical guide. What you'll get:

A Playwright-based scraper framework with battle-tested anti-detection
Cloud deployment with scheduled execution and pay-per-run pricing
Dynamic IP protection via residential proxies with smart routing
Easy-to-maintain code through abstract base classes and a registry pattern
A CLAUDE.md template you can hand to an AI editor to scaffold the entire project

The code itself is straightforward — any decent AI editor can generate Playwright automation code. What it can't generate is the non-obvious knowledge: which Chrome flags get you detected, why blocking security scripts backfires, how to handle cross-origin iframe logins, and why your User-Agent needs to match your container's OS.

That's what this blog focuses on.

If you just want a scaffold to start coding right away, jump to the CLAUDE.md template at the end.

Architecture Overview

The system is split into three repos, each with a single responsibility:

                    ┌─────────────────────┐
                    │    Orchestrator      │
                    │  (Serverless Fns)    │
                    │                      │
                    │  Timer: Site A 4:00  │
                    │  Timer: Site B 4:20  │
                    │  Timer: Site C 4:40  │
                    │  HTTP: manual runs   │
                    └──────────┬───────────┘
                               │ starts container with
                               │ env vars (SITE_TYPE, PROXY_IP...)
                               ▼
                    ┌─────────────────────┐
                    │   Container Job      │
                    │  (Python+Playwright) │
                    │                      │
                    │  1. Get credentials  │
                    │     from Key Vault   │
                    │  2. Run scraper      │
                    │  3. Persist data     │
                    │  4. Notify on fail   │
                    └─────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
 ┌────────────┐        ┌─────────────┐       ┌──────────────┐
 │  Key Vault │        │   Database  │       │   Telegram   │
 │ (secrets)  │        │(persistence)│       │(failure alert)│
 └────────────┘        └─────────────┘       └──────────────┘

Scraper (Python 3.12 + Playwright async): Abstract base class with all anti-detection logic. Site-specific scrapers inherit from it and implement login() and scrape_data(). Factory pattern dispatches by site type.
Infrastructure (Terraform): Container App Job (on-demand compute, not always-on), serverless functions, container registry, key vault, log analytics. Everything uses managed identities — zero passwords in config.
Orchestrator (Node.js serverless functions): Timer triggers for each site (staggered 20 min apart to avoid parallel runs). One HTTP trigger for ad-hoc manual runs. Passes the site type as an environment variable when starting the container job.

Each repo has its own CI/CD pipeline deploying on push to main via GitHub OIDC — no stored secrets anywhere.

The Non-Obvious Stuff

This is the section that will save you weeks. These are the things I learned the hard way, that no tutorial covers, and that an AI editor won't suggest on its own.

Use System Chrome, Not Bundled Chromium

When you pip install playwright && playwright install, you get Playwright's bundled Chromium. It works great for testing. It will get you detected instantly on any site with serious bot protection.

Why? TLS fingerprinting. Every browser has a unique TLS Client Hello signature (called a JA3 fingerprint). WAFs like Akamai and Shape maintain databases of known browser fingerprints. Bundled Chromium has a distinct JA3 that screams "automation tool." System Google Chrome has the real fingerprint that WAFs expect from normal users.

In your Dockerfile, install system Chrome:

# As root — installs to /opt/google/chrome/
RUN playwright install chrome

# As non-root — bundled Chromium as fallback
USER scraper
RUN playwright install chromium

In your scraper, use channel="chrome" to launch system Chrome instead of bundled Chromium:

browser = await playwright.chromium.launch(channel="chrome", ...)

Use --headless=new, Not Legacy Headless

Chrome has two headless modes. The legacy --headless mode (what Playwright uses by default when you pass headless=True) runs a different rendering engine than headed Chrome. Bot detection scripts can tell the difference.

Chrome's "new headless" mode (--headless=new) runs the full browser engine without a visible window. Same rendering pipeline, same JS APIs, same fingerprint. Virtually undetectable.

The trick: pass headless=False to Playwright (so it doesn't inject its own legacy --headless flag), then add --headless=new to your launch args manually:

launch_args = ['--headless=new', ...]  # Only in deployed environments
browser = await playwright.chromium.launch(
    headless=False,  # Prevent Playwright from injecting legacy --headless
    args=launch_args,
    channel="chrome",
)

Locally, skip --headless=new entirely and run headed for debugging.

Do NOT Block Security Scripts

This one is counter-intuitive. Your first instinct when you see ThreatMetrix, Akamai sensor, or PerimeterX scripts loading is to block them via page.route(). Don't.

These services run on the site for a reason. The site's backend expects telemetry to arrive from these scripts. If it doesn't arrive, the server flags your session as suspicious and invalidates it — often silently. You'll see a successful login followed by an immediate redirect to the login page, and you'll spend hours debugging.

Instead, let them run. Harden your browser fingerprint (WebGL, plugins, navigator properties, canvas) so that when these scripts phone home, they report a "normal" browser. They see a real Chrome with a real GPU, real plugins, and real screen dimensions. Session stays valid.

# This is intentionally a no-op
async def _block_fingerprint_domains(self):
    """Security scripts must be allowed to run.
    Blocking them causes server-side session invalidation."""
    pass

Do NOT Block Images

Same logic. Blocking images via --disable-images flag or route interception seems like an easy performance win. It's a detection signal.

The --disable-images Chrome flag is directly detectable
Route interception via page.route() alters CDP traces that advanced WAFs can detect
Sites can verify via performance.getEntries() whether images loaded
A session with zero image loads is a strong bot signal

Let images load normally. The bandwidth cost is negligible compared to a failed scrape.

Match Your User-Agent to Your OS

WAFs don't just check your User-Agent string — they cross-reference it against your TCP/IP OS fingerprint and navigator.platform. A Windows User-Agent coming from a Linux container with a Linux TCP stack is an instant bot flag.

Detect your platform at runtime and pick matching User-Agents:

import platform

os_name = platform.system()
if os_name == "Linux":
    USER_AGENTS = [
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ...",
        # ... more Linux Chrome UAs
    ]
elif os_name == "Darwin":
    USER_AGENTS = [
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ...",
        # ... more macOS Chrome UAs
    ]

Your container runs Linux. Your local machine runs macOS. The User-Agents must match.

Strip --enable-automation

Playwright adds --enable-automation to Chrome's launch args by default. This flag exposes Chrome DevTools Protocol (CDP) artifacts that Akamai's bot sensor detects immediately. The telltale sign: the _abck cookie value contains ~-1~ (failed validation) instead of ~0~ (valid).

browser = await playwright.chromium.launch(
    ignore_default_args=['--enable-automation'],
    ...
)

One line. Massive difference.

Cross-Origin Iframe Logins

Some sites load their login form inside a cross-origin iframe. This is the boss fight of web scraping. Here's how to survive it:

1. Strip CSP and X-Frame-Options headers. The iframe's server sends headers that block it from working inside your automation context. Intercept responses and strip them:

async def handle_route(route):
    response = await route.fetch()
    headers = {k: v for k, v in response.headers.items()
               if k.lower() not in (
                   'content-security-policy',
                   'content-security-policy-report-only',
                   'x-frame-options'
               )}
    await route.fulfill(response=response, headers=headers)

await page.route("**/login-domain.com/**", handle_route)

Without this, the iframe never renders. You'll see ERR_BLOCKED_BY_RESPONSE in your logs and a blank frame.

2. Use Frame, not FrameLocator. For cross-origin iframes, resolve the actual Frame object from page.frames and use it directly. FrameLocator can be unreliable across origins:

def find_login_frame(page):
    for frame in page.frames:
        url = frame.url or ""
        if "login-domain.com" in url and "/ping" not in url:
            return frame
    return None

Watch out for telemetry iframes (like /ping endpoints) — the page might have multiple iframes and you need the right one.

3. Poll until the frame has content. The iframe loads asynchronously. Don't just wait for it to appear — wait until it actually has input elements:

from time import monotonic

deadline = monotonic() + 30  # 30 second timeout
while monotonic() < deadline:
    frame = find_login_frame(page)
    if frame:
        input_count = await frame.locator("input").count()
        if input_count > 0:
            return frame  # Form is ready
    await asyncio.sleep(2)

Modal Handling Is the Hardest Part

After login, many sites throw modals at you — welcome messages, security alerts, cookie consents, survey popups. If you don't close them, they steal your clicks. After the third time a modal backdrop ate my click event at 4 AM, I wrote a three-layer fallback:

Escape key — the simplest approach. Focus the page, press Escape, check if the modal count decreased.
jQuery — if the site uses Bootstrap, jQuery(modal).modal('hide') is reliable.
DOM manipulation — nuclear option. Remove in/show classes, set display: none, clean up the backdrop, remove modal-open from body.

async def close_all_modals(self):
    await self.page.evaluate("""() => {
        const modals = document.querySelectorAll('.modal');
        modals.forEach(modal => {
            // Try close button
            const btn = modal.querySelector('[data-dismiss="modal"], .close');
            if (btn) btn.click();
            // Try jQuery
            if (window.jQuery) try { jQuery(modal).modal('hide'); } catch(e) {}
            // Force close
            modal.classList.remove('in', 'show');
            modal.style.display = 'none';
        });
        // Clean up
        document.querySelectorAll('.modal-backdrop').forEach(b => b.remove());
        document.body.classList.remove('modal-open');
    }""")

Always try Escape first. Only escalate when it fails.

Use Flexible Selector Fallback Lists

Sites change their element IDs between deploys. Instead of betting on one selector, try a prioritized list:

username_selectors = [
    "#userInput",
    "#username",
    "input[name='username']",
    "input[name*='user' i]",
    "input[placeholder*='user' i]",
    "input[type='text']",  # last resort
]

for selector in username_selectors:
    element = frame.locator(selector)
    if await element.count() > 0 and await element.first.is_visible():
        logger.info(f"Found username input with selector: {selector}")
        return element.first

raise Exception("Could not find username input with any known selector")

Log which selector matched. When the site changes and your scraper breaks, the log tells you exactly what changed.

Navigate Like a Human

Don't page.goto("https://portal.targetsite.com/login") directly. That creates a suspicious referrer chain — or no referrer at all.

Instead, replicate the organic user flow:

Go to the site's public homepage
Find and click the "Login" or "Portal" link in the navigation
Wait for the navigation to the login page
Fill in credentials

The referrer chain matters. Bot detection scripts check document.referrer. A login page loaded with no referrer or a direct URL is suspicious.

Proxy Strategy

You need proxies for two reasons: geo-location and IP reputation.

Many sites restrict access by country. If your cloud container runs in a different region than where the site expects its users, you need a residential proxy in the target country.

But here's the non-obvious part: not all requests should go through the proxy.

Some site CDNs and WAFs actively block known proxy IP ranges. If you route everything through your residential proxy, API calls to the target site's backend might fail with ERR_TUNNEL_CONNECTION_FAILED. The solution is the bypass field in Playwright's proxy config:

proxy_config = {
    "server": f"http://{proxy_host}:{proxy_port}",
    "username": proxy_user,
    "password": proxy_pass,
    "bypass": "*.targetsite.com, *.cdn-targetsite.com",
}

This routes the initial page load through the proxy (passing the geo-check) but lets subsequent API calls go direct from your container's Azure/AWS IP (which the CDN accepts). Works reliably in production.

Use residential proxies, not datacenter. Datacenter IPs are in blocklists. Residential IPs from proxy providers rotate through real ISPs.

Cloud Deployment

Container App Job (Not Always-On)

Don't run a VM or a Container App that's always on. Use a Container App Job with manual trigger. It spins up only when the orchestrator starts it, runs the scraper, and shuts down. You pay for execution time only.

Specs that work well: 2 CPU / 4GB RAM (Chrome is memory-hungry), 30-minute timeout (complex scraping flows take time).

Serverless Orchestrator

Each target site gets its own timer trigger, staggered 20 minutes apart:

Site A: 4:00 AM    →  start container job (SITE_TYPE=site_a)
Site B: 4:20 AM    →  start container job (SITE_TYPE=site_b)
Site C: 4:40 AM    →  start container job (SITE_TYPE=site_c)

The orchestrator passes the site type and proxy config as environment variables via template override when starting the container job. One container image, multiple configurations.

Add an HTTP trigger for ad-hoc runs — invaluable for debugging.

Zero Stored Secrets

Runtime: Managed identities for everything. The container's identity pulls images from the registry and reads secrets from Key Vault. No passwords anywhere.
CI/CD: GitHub OIDC. Your GitHub Actions workflows authenticate to Azure via federated token exchange — no PATs, no service account keys stored in GitHub Secrets.
Credentials: Stored in Key Vault with convention-based naming ({site_type}-username, {site_type}-password). The scraper fetches them at runtime.

Diagnostics & Failure Handling

Your scraper will fail. The question is how fast you can diagnose and fix it.

Screenshot + HTML on Failure

When anything goes wrong (login failure, data extraction error, timeout), capture a screenshot and the full page HTML, then send both to a Telegram chat:

async def notify_failure(scraper, site, stage, error=None):
    screenshot = await scraper.page.screenshot(full_page=True)
    html = await scraper.page.content()

    await telegram.send_photo(screenshot, caption=f"{site} failed at {stage}")
    await telegram.send_document(html.encode(), "error_page.html")

The screenshot tells you what the user saw. The HTML tells you what the DOM actually contains. Together they're usually enough to diagnose the issue without reproducing it locally.

Event Listener Capture

For the really tricky issues, attach listeners to capture everything:

page.on("requestfailed") — catch ERR_TUNNEL_CONNECTION_FAILED (proxy issues) and ERR_BLOCKED_BY_RESPONSE (CSP/X-Frame-Options)
page.on("response") — log 3xx redirects and 4xx/5xx errors with their headers
page.on("console") — JS errors from the site itself
Cookie inspection — check the _abck cookie value to know if Akamai flagged you (~-1~ = detected, ~0~ = passed)

Dump all of this when a scrape fails. It's the difference between "it broke" and "Akamai detected us because the proxy tunnel failed on the CDN domain."

Lessons Learned

Twelve tips from production. Each one cost me at least a full day of debugging.

Do NOT block security scripts. Let ThreatMetrix, Akamai sensor, Shape, and PerimeterX run. Blocking them flags your session faster than letting them see your hardened fingerprint.
System Chrome, not Chromium. WAFs check TLS/JA3 fingerprints. Bundled Chromium has a distinct one that automation tools use. System Chrome has the real fingerprint.
Match your User-Agent to your OS. WAFs cross-check UA against TCP/IP OS fingerprint and navigator.platform. A Windows UA from a Linux container = instant detection.
Do NOT block images. Route interception is detectable. Sites verify performance.getEntries() for image loads. Zero images = bot.
CSP headers block iframe automation. If a login form is in an iframe and it never renders, strip Content-Security-Policy and X-Frame-Options via route interception.
Use Frame, not FrameLocator. For cross-origin iframes, resolve the actual Frame via page.frames, poll until it has content (input count > 0), then use it directly.
Selectors break. Use fallback lists. Instead of #username, try ["#username", "#userInput", "input[name='user']", "input[type='text']"] in order. Log which matched.
Modal closing needs three layers. Escape key first. jQuery .modal('hide') second. DOM manipulation (remove classes, set display: none, clean backdrop) as nuclear option.
Navigate like a human. Visit the public site first, click the login link naturally. Direct URL navigation creates suspicious or missing referrer chains.
Wait for SPA re-renders. After clicking a selector, switching entities, or navigating, wait 3-8 seconds. waitForLoadState('networkidle') is not enough for SPAs.
Proxy bypass for target domains. Residential proxy for the initial geo-check, but bypass the proxy for the target site's own CDN domains — they often block proxy IP ranges.
Strip --enable-automation. Playwright adds this by default. It exposes CDP artifacts that Akamai's sensor detects instantly. One line to remove it, massive difference.

Get Started: The CLAUDE.md Template

If you want to skip all the reading and have an AI editor build this for you, here's a CLAUDE.md file you can drop into an empty project directory. It describes the full architecture, patterns, and anti-detection rules. Point your AI editor at it and start building.

Copy everything inside the code block below into a file called CLAUDE.md in your project root:

# Project: Production Web Scraper Framework

## What this project does
A production web scraper that uses Playwright (async Python) to log into private web
portals, navigate complex UIs, and extract structured data. Deployed as a container job
in the cloud, orchestrated by serverless functions on a schedule.

## Tech Stack
- **Scraper**: Python 3.12, Playwright (async), Pydantic, playwright-stealth
- **Infrastructure**: Terraform, Azure (Container App Job, Functions, Key Vault, ACR, Log Analytics)
- **Orchestrator**: Node.js, Azure Functions SDK
- **CI/CD**: GitHub Actions with OIDC (zero stored secrets)
- **Services**: Azure Key Vault (secrets), Google Sheets (data persistence), Telegram (failure alerts)

## Architecture (3 repos)

### 1. scraper/ — Python + Playwright
src/
  main.py                    # Entry point: load config, fetch creds, run workflow
  config.py                  # ENV, OS-aware User-Agent rotation, service config
  models.py                  # Pydantic: SiteType enum, LoginInfo, ScrapedData, TaskConfig
  scraper/
    base_scraper.py          # Abstract base with browser setup + anti-detection + helpers
    proxy_manager.py         # Multi-provider proxy config with bypass support
    registry.py              # Factory dict + run_scraper_workflow()
    sites/
      site_a_scraper.py      # Inherits BaseScraper, implements login() + scrape_data()
      site_b_scraper.py
  services/
    key_vault.py             # Singleton, DefaultAzureCredential, convention-based secrets
    data_client.py           # Google Sheets persistence, lazy connection, auto-create sheet
    notifications.py         # Telegram: send_photo, send_document, send_message
  utils/
    logging_config.py
Dockerfile
requirements.txt

[... full CLAUDE.md content as shown in the blog post above ...]

Wrapping Up

This framework runs in production daily. Adding a new target site takes hours, not weeks — the base class and infrastructure are the investment; each new site is just a login flow and a data extraction flow.

The non-obvious stuff — anti-detection layering, iframe handling, modal nightmares, proxy routing — is where the real time goes. Hopefully this post saves you some of that time.

If you found this useful, share it with someone who's about to build a scraper.

Subscribe