Skip to content

09. HTML and web content — extracting the article from the page

~11 min read. Most "the web" is navigation chrome. The content is a small subset.


[Stub — to be written]

Outline:

  • The boilerplate problem: nav, footer, ads, related-articles, cookie banners
  • readability-lxml (Mozilla's readability port) — the classical solution
  • trafilatura — newer, language-aware, often better recall
  • jusText, dragnet — alternatives worth knowing
  • JavaScript-rendered pages — Playwright/Selenium when curl returns nothing useful
  • Sitemaps and robots.txt — the polite-crawler basics
  • HTML to markdown vs HTML to plain text
  • Preserving structure: headings, lists, code blocks, blockquotes
  • Tables and figures in web pages
  • Single-page apps and infinite scroll — when web ingestion is genuinely hard
  • Rate limits, anti-bot, and the ethics-and-legality boundary