09. HTML and web content — extracting the article from the page¶

~11 min read. Most "the web" is navigation chrome. The content is a small subset.

[Stub — to be written]

Outline:

The boilerplate problem: nav, footer, ads, related-articles, cookie banners
readability-lxml (Mozilla's readability port) — the classical solution
trafilatura — newer, language-aware, often better recall
jusText, dragnet — alternatives worth knowing
JavaScript-rendered pages — Playwright/Selenium when curl returns nothing useful
Sitemaps and robots.txt — the polite-crawler basics
HTML to markdown vs HTML to plain text
Preserving structure: headings, lists, code blocks, blockquotes
Tables and figures in web pages
Single-page apps and infinite scroll — when web ingestion is genuinely hard
Rate limits, anti-bot, and the ethics-and-legality boundary