HTML to Markdown Converter: Convert HTML to MD Online
Key Takeaways
- ▸Regex-based HTML strippers destroy structure — a real parser + rule engine is required for quality Markdown output
- ▸Turndown is the JavaScript standard: 4M+ weekly npm downloads, extensible via plugins, GFM tables supported
- ▸LLM pipelines use HTML-to-Markdown to cut token count by 60–80% before feeding web content to models
- ▸Conversion is lossy: classes, IDs, styles, colspan tables, and JS-driven content cannot round-trip back to original HTML
- ▸For Python web scraping with boilerplate removal, trafilatura extracts main content before converting — far cleaner output
The Token Cost of Raw HTML
A typical news article page weighs around 120KB of raw HTML. The actual article text accounts for maybe 8KB. The rest is navigation markup, script tags, style blocks, social sharing buttons, comment widgets, cookie consent banners, and footer infrastructure. If you feed raw HTML to a language model, you are paying for — and consuming context window space with — 93% noise.
Converting to Markdown first is not just about readability. For a 120KB news page, the Markdown output is typically 6–12KB: a 90% reduction in token count. At GPT-4o rates ($5/1M input tokens), processing 10,000 such pages drops from roughly $60 to $6. At scale, HTML-to-Markdown conversion is a cost-optimization primitive, not a formatting nicety.
This is why HTML-to-Markdown has seen a surge of production tooling in 2024–2026, driven by RAG pipelines, web agents, and documentation migration tools. The challenge is that naive implementations — stripping tags with a regex, calling innerHTML.replace(/<[^>]*>/g, '') — destroy structure along with the noise. What you need is a parser-backed rule engine that maps HTML semantics to Markdown syntax one element at a time.
Why Regex HTML Stripping Fails
The temptation to strip HTML with a regular expression is understandable — it looks like a one-liner. The reality is that HTML is not a regular language. Attributes contain quoted strings that can include angle brackets. CDATA sections, HTML entities, self-closing tags, and malformed markup all break simple regex assumptions.
// ❌ Naive approach: destroys all structure
const text = html.replace(/<[^>]*>/g, '')
// Input: <p>Visit <a href="https://example.com">our site</a></p>
// Output: "Visit our site"
// Link is gone. No way to recover the URL.
// ❌ Still wrong: doesn't handle nested quotes in attributes
const text = html.replace(/<[^"]*>/g, '')
// Breaks on: <img alt="A <cat> photo" src="cat.jpg">
// ❌ Misses HTML entities
// & → &, < → <, > → >, → space (or not?)
// "AT&T" → "AT&T" only if you also decode entities
// ✅ What you actually need: DOM parsing + structured traversal
// Every production library does this:
// 1. Parse HTML into a DOM tree (JSDOM, node-html-parser, or browser DOM)
// 2. Walk the tree recursively
// 3. Apply per-element conversion rules
// 4. Concatenate output with correct whitespace handlingAccording to the CommonMark specification, Markdown itself has 652 edge cases in its reference implementation. Mapping HTML to Markdown correctly requires handling all of them — nested emphasis, code spans inside link text, raw HTML passthrough, and more. No regex can do this.
The Core Conversion Mapping
Most HTML structure maps cleanly to Markdown when processed by a proper rule engine. Here is what good conversion looks like in practice:
// Input HTML → Output Markdown
// Headings
<h1>Title</h1> → # Title
<h2>Section</h2> → ## Section
// Emphasis
<strong>bold</strong> → **bold**
<em>italic</em> → _italic_
<code>inline</code> → `inline`
// Links (with URL preservation)
<a href="/page">text</a> → [text](/page)
<a href="/x" title="T">y</a> → [y](/x "T")
// Images
<img src="cat.jpg" alt="A cat"> → 
// Code blocks
<pre><code class="language-js">
const x = 1
</code></pre>
→
```js
const x = 1
```
// Lists
<ul><li>First</li><li>Second</li></ul>
→
- First
- Second
// Ordered lists
<ol><li>One</li><li>Two</li></ol>
→
1. One
2. Two
// Blockquotes
<blockquote><p>A quote</p></blockquote>
→
> A quoteThe tricky cases are nested structures: an <em> inside an <a> inside an <li> inside a <blockquote>. Each library handles these differently, and the differences become visible only when running against real-world content.
JavaScript: Turndown
Turndown (originally to-markdown) is the JavaScript standard for HTML-to-Markdown conversion. Per npm, it receives over 4 million weekly downloads — making it the dominant solution for browser and Node.js environments alike. It parses HTML into a DOM tree, applies a configurable rule set, and produces CommonMark-compatible Markdown.
import TurndownService from 'turndown'
import { gfm, tables, strikethrough } from 'turndown-plugin-gfm'
const turndown = new TurndownService({
headingStyle: 'atx', // # H1, ## H2 (vs underline style)
codeBlockStyle: 'fenced', // ``` code blocks
bulletListMarker: '-', // - item (vs * or +)
hr: '---',
})
// Enable GitHub Flavored Markdown (tables, strikethrough, task lists)
turndown.use(gfm)
// Custom rule: strip <nav> and <footer> elements entirely
turndown.addRule('removeChrome', {
filter: ['nav', 'footer', 'aside', 'script', 'style'],
replacement: () => '',
})
// Custom rule: keep <figure> captions as italicized text
turndown.addRule('figcaption', {
filter: 'figcaption',
replacement: (content) => `_${content}_
`,
})
const markdown = turndown.turndown(htmlString)
// With a real-world article page:
// Input: ~85KB HTML
// Output: ~9KB Markdown (89% reduction)
// Time: ~12ms for a 50KB DOM in Node.js 20The custom rule API is Turndown's biggest advantage. You can strip entire element categories (navigation, ads, sidebars) by returning an empty string from the replacement function, or transform elements that have no native Markdown equivalent. This is how production web scrapers and LLM pipelines customize output quality.
Python: html2text vs trafilatura
Python offers two meaningfully different approaches depending on your use case:
# html2text: direct conversion, full control
# Created by Aaron Swartz, maintained as OSS since 2004
import html2text
h = html2text.HTML2Text()
h.ignore_links = False # Keep [text](url) links
h.ignore_images = False # Keep  images
h.body_width = 0 # No line wrapping (important for LLM input)
h.unicode_snob = True # Prefer unicode chars over ASCII equivalents
h.ignore_emphasis = False
markdown = h.handle(html_string)
# trafilatura: content extraction + conversion
# Identifies the main content zone (article body) before converting
# Removes: navigation, headers, footers, sidebars, ads automatically
import trafilatura
# From a URL (handles fetch + extract):
downloaded = trafilatura.fetch_url('https://example.com/article')
markdown = trafilatura.extract(downloaded, output_format='markdown')
# From raw HTML:
markdown = trafilatura.extract(html_string, output_format='markdown',
include_comments=False,
include_tables=True)
# trafilatura accuracy on news/blog content:
# Per the trafilatura paper (Barbaresi 2021, ACL Anthology):
# F1 score of 0.89 for content extraction vs 0.71 for boilerplate removal alone
# Evaluated against 1,226 web pages from the C3 corpus| Library | Language | Boilerplate Removal | Best For | Weekly Downloads |
|---|---|---|---|---|
| Turndown | JavaScript | Manual (custom rules) | Browser, Node.js, LLM pipelines | ~4M (npm) |
| html2text | Python | None | Known-clean HTML, migrations | ~2.3M (PyPI) |
| trafilatura | Python | Automatic (ML-based) | Web scraping, news extraction | ~600K (PyPI) |
| markdownify | Python | None | Custom logic via subclassing | ~800K (PyPI) |
| html-to-markdown (Go) | Go | Plugin-based | High-throughput servers | ~30K (pkg.go.dev) |
| Pandoc | Haskell (CLI) | None | Complex tables, academic docs | System install |
The Hard Cases: What No Library Handles Perfectly
Every converter has failure modes. Understanding them lets you decide where to add post-processing logic:
1. Complex Tables (colspan, rowspan)
GFM table syntax has no colspan or rowspan. A table cell spanning three columns in HTML becomes three identical-content cells in Markdown, losing the visual merge. The only correct handling is to keep the <table> as raw HTML inside the Markdown output. Pandoc does this for complex tables; Turndown drops the structure. If tables are critical, use Pandoc or keep raw HTML passthrough.
2. Nested Lists Beyond Three Levels
CommonMark supports theoretically infinite nesting via indentation, but real renderers often choke past four levels. More practically, HTML list nesting can include block elements (<div>, <p>) inside <li> elements, which Markdown handles only with blank-line-separated list items. Turndown gets this right; simpler converters produce invalid Markdown here.
3. Definition Lists
<!-- HTML definition list: no standard Markdown equivalent -->
<dl>
<dt>REST</dt>
<dd>Representational State Transfer, a stateless API architecture</dd>
<dt>GraphQL</dt>
<dd>A query language for APIs with client-defined response shapes</dd>
</dl>
<!-- Most converters produce either: -->
**REST**
Representational State Transfer, a stateless API architecture
**GraphQL**
A query language for APIs with client-defined response shapes
<!-- Or (Pandoc extended syntax): -->
REST
: Representational State Transfer, a stateless API architecture4. Relative URLs
HTML pages use relative links constantly: href="/about", src="../images/photo.jpg". When you convert to Markdown for use outside the original domain, these links break. Turndown does not resolve relative URLs automatically — you need to provide a base URL and resolve them in a custom rule:
import TurndownService from 'turndown'
const BASE = 'https://example.com'
const turndown = new TurndownService()
turndown.addRule('absoluteLinks', {
filter: 'a',
replacement: (content, node) => {
const href = node.getAttribute('href') || ''
const absoluteHref = href.startsWith('http')
? href
: new URL(href, BASE).toString()
const title = node.getAttribute('title')
return title
? `[${content}](${absoluteHref} "${title}")`
: `[${content}](${absoluteHref})`
},
})Production Pattern: LLM Content Pipeline
Here is a production-grade Node.js pipeline for converting arbitrary web pages to clean Markdown for LLM consumption:
import TurndownService from 'turndown'
import { gfm } from 'turndown-plugin-gfm'
import * as cheerio from 'cheerio'
function htmlToLlmMarkdown(html: string, baseUrl: string): string {
// Step 1: Pre-process HTML with Cheerio to remove noise
const $ = cheerio.load(html)
// Remove elements that are never content
$('script, style, nav, header, footer, aside').remove()
$('[class*="cookie"], [class*="popup"], [class*="modal"]').remove()
$('[class*="social"], [class*="share"], [class*="subscribe"]').remove()
$('[aria-hidden="true"]').remove()
// Extract just the main content if present
const main = $('main, article, [role="main"]').first()
const contentHtml = main.length ? main.html() ?? '' : $('body').html() ?? ''
// Step 2: Convert to Markdown with Turndown
const turndown = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced',
bulletListMarker: '-',
})
turndown.use(gfm)
// Resolve relative URLs
turndown.addRule('absoluteLinks', {
filter: 'a',
replacement: (content, node) => {
const href = (node as HTMLAnchorElement).getAttribute('href') || ''
if (!href || href.startsWith('javascript:') || href.startsWith('#')) return content
const abs = href.startsWith('http') ? href : new URL(href, baseUrl).toString()
return `[${content}](${abs})`
},
})
let markdown = turndown.turndown(contentHtml)
// Step 3: Post-process — collapse excessive blank lines
markdown = markdown
.replace(/
{3,}/g, '
') // max 2 consecutive newlines
.trim()
return markdown
}
// Usage:
const response = await fetch('https://example.com/article')
const html = await response.text()
const md = htmlToLlmMarkdown(html, 'https://example.com')
// Result: typically 85-92% smaller than source HTML
console.log(`Reduced ${html.length} → ${md.length} chars (${Math.round((1 - md.length/html.length) * 100)}% reduction)`)HTML Entities and Special Characters
HTML entities are a conversion minefield. The HTML specification defines over 2,200 named character references (&, —, ’), plus numeric references (—, —). Good converters decode these to their Unicode equivalents in the output (— for em dash, ' for right single quote). Bad converters pass the entity reference through literally, leaving ’ in the Markdown output where you expected an apostrophe.
Turndown decodes entities correctly because it operates on the parsed DOM where browsers have already resolved entities to Unicode code points. html2text decodes them explicitly via the html.parser backend. If you are writing a custom converter without a full parser, use the he library in Node.js or Python's html.unescape() as a post-processing step.
Markdown also has its own set of characters that need escaping: * _ [ ] ( ) # + - . !. If plain text in the HTML contains these characters, they must be backslash-escaped in the Markdown output to prevent unintended formatting. This is another reason regex-based converters produce corrupt output — they skip the escape step entirely. Use our Markdown cheat sheet for a reference of all syntax rules and escape sequences.
Rust-Based Converters: Performance at Scale
For applications processing thousands of pages per hour, pure-JavaScript or pure-Python converters introduce throughput constraints. The kreuzberg html-to-markdown library ships a Rust core with native bindings for TypeScript/Node.js, Python, Go, Ruby, PHP, and 8 other runtimes, producing identical output across all of them from a single Rust parser.
# Reported throughput comparison (per kreuzberg benchmarks, 2025):
# Processing 1,000 average news article pages (avg 85KB HTML each)
# Turndown (Node.js): ~12ms/page → ~83 pages/sec
# html2text (Python): ~18ms/page → ~56 pages/sec
# kreuzberg (Rust/Node): ~2ms/page → ~500 pages/sec
# For LLM batch processing at 10K pages/run:
# Turndown: ~120 seconds
# kreuzberg: ~20 seconds
# 6x throughput improvement, same API surface
// Node.js usage (same Turndown-compatible API):
import { convert } from '@kreuzberg/html-to-markdown'
const markdown = await convert(html, {
headingStyle: 'atx',
codeBlockStyle: 'fenced',
})For most use cases, Turndown's ~12ms/page is fast enough. Reach for a Rust-backed library when you are processing bulk batches on a schedule and the cumulative time matters for your pipeline's throughput.
Markdown Variants: CommonMark, GFM, and Pandoc
"Markdown" is not a single specification. Your output format choice affects what constructs the converter will use:
- CommonMark: the 2014 standardization of John Gruber's original Markdown. Defines exact parsing rules for ambiguous cases. Supported by GitHub, GitLab, Discourse, and most modern renderers. No tables, no strikethrough, no task lists.
- GitHub Flavored Markdown (GFM): CommonMark + tables, strikethrough (
~~text~~), task lists (- [x]), and autolinks. The de facto standard for developer-facing content. - Pandoc Markdown: superset with definition lists, footnotes, math (LaTeX), tables with alignment, div/span containers. Best for academic and technical documentation.
- MDX: Markdown + JSX components. Not a conversion target for HTML; you write it directly.
For LLM context, target CommonMark or GFM — they are the most widely trained-on formats and produce the most predictable tokenization. For developer documentation and GitHub-rendered content, GFM is the right choice. You can format and inspect raw HTML before running conversion to catch malformed markup that would break edge cases.
Format HTML Before Converting
Malformed HTML produces garbage Markdown. Before running your conversion pipeline, use BytePane's HTML Formatter to pretty-print and validate your source HTML, then convert. Works in-browser, zero setup.
Frequently Asked Questions
What is the best library to convert HTML to Markdown in JavaScript?
Turndown is the JavaScript standard with over 4 million weekly npm downloads. It handles CommonMark output, supports GFM tables via turndown-plugin-gfm, and exposes a rule-based API for custom element handling. For Node.js high-throughput use, the Rust-based kreuzberg html-to-markdown offers ~6x faster processing with compatible API surface.
Why does HTML to Markdown conversion lose formatting?
Markdown is a subset of what HTML can express. Complex table layouts, colspan/rowspan, CSS styles, definition lists, and non-standard elements have no Markdown equivalent. Converters either drop these or keep them as raw HTML. Output quality depends on how structured the source HTML was and which converter handles the edge cases.
How do I convert HTML to Markdown in Python?
html2text (created by Aaron Swartz) is the most common option. For web scraping with automatic boilerplate removal — navigation, footers, ads — use trafilatura instead. It identifies the main content zone before converting, producing much cleaner Markdown from news articles. Per the trafilatura paper (Barbaresi 2021, ACL Anthology), it achieves F1 of 0.89 for content extraction.
Can I convert HTML to Markdown without losing links?
Yes, for standard anchor tags. Problems arise with javascript: hrefs (stripped), relative URLs (need a base URL to resolve), and link text containing inline HTML. Good converters handle these: Turndown preserves absolute URLs and lets you add a custom rule to resolve relative paths to absolute URLs before output.
Is HTML to Markdown conversion reversible?
Not perfectly — the conversion is lossy by design. CSS classes, IDs, data attributes, inline styles, complex tables, forms, and scripts are discarded. Converting back produces valid HTML but structurally different from the original. Use the round-trip for content portability, not for preserving presentation.
What is HTML to Markdown used for in LLM applications?
Web pages contain 60–80% noise (navigation, scripts, ads). Converting to Markdown before feeding to an LLM reduces token count by the same amount, cutting API costs and improving context quality. Most RAG pipelines and web agents use HTML-to-Markdown or plain-text extraction as a preprocessing step.
How do I handle HTML tables when converting to Markdown?
Simple tables convert to GFM format via turndown-plugin-gfm. Complex tables (colspan, rowspan) cannot be represented in GFM and are best kept as raw HTML inside the Markdown output. Pandoc handles the widest table types but requires system-level installation.
Related Articles
Markdown Cheat Sheet
Complete Markdown syntax reference: headings, links, tables, code blocks, and GFM extensions.
HTML Formatter & Beautifier
Format, indent, and validate HTML before running conversion pipelines.
JSON to CSV Converter
Convert JSON arrays to CSV with proper quoting and delimiter handling.
CSV to JSON Converter
Transform CSV data to JSON with type inference and nested object support.