09. HTML and web content — extracting the article from the page¶
~11 min read. Most "the web" is navigation chrome. The content is a small subset.
[Stub — to be written]
Outline:
- The boilerplate problem: nav, footer, ads, related-articles, cookie banners
- readability-lxml (Mozilla's readability port) — the classical solution
- trafilatura — newer, language-aware, often better recall
- jusText, dragnet — alternatives worth knowing
- JavaScript-rendered pages — Playwright/Selenium when curl returns nothing useful
- Sitemaps and robots.txt — the polite-crawler basics
- HTML to markdown vs HTML to plain text
- Preserving structure: headings, lists, code blocks, blockquotes
- Tables and figures in web pages
- Single-page apps and infinite scroll — when web ingestion is genuinely hard
- Rate limits, anti-bot, and the ethics-and-legality boundary