HTML to Markdown Converter: Convert HTML to MD Online

Q: What is the best library to convert HTML to Markdown in JavaScript?

Turndown (formerly to-markdown) is the most widely used JavaScript HTML-to-Markdown library, with over 4 million weekly npm downloads. It handles CommonMark output, supports GFM tables via the turndown-plugin-gfm plugin, and exposes a rule-based API for custom element handling. For Node.js-only use cases with higher throughput needs, the Rust-based html-to-markdown package offers native bindings with near-identical API surface.

Q: Why does HTML to Markdown conversion lose formatting?

Markdown is a subset of what HTML can express. Several HTML constructs have no Markdown equivalent: complex table layouts, colspan/rowspan cells, arbitrary CSS styles, multi-level nested lists beyond three levels, definition lists, and non-standard elements. Converters either drop these elements, convert them to their HTML equivalent inside Markdown, or use extended syntax like GFM. The output quality depends heavily on how the source HTML was structured.

Q: How do I convert HTML to Markdown in Python?

The html2text library (created by Aaron Swartz) is the most common Python option. Install with pip install html2text. For web scraping with boilerplate removal — navigation, footers, ads — use trafilatura instead. It identifies the main content zone before converting, producing much cleaner Markdown from news articles and blog posts. For async batch processing, html2md offers non-blocking conversion.

Q: Can I convert HTML to Markdown without losing links?

Yes, but with caveats. Standard anchor tags convert cleanly to [text](url). Problems arise with JavaScript-driven links (href="javascript:void(0)"), relative URLs (which need a base URL to resolve), and link text that contains inline HTML. Good converters handle these cases: Turndown preserves absolute URLs, strips javascript: hrefs, and processes inline formatting within link text correctly.

Q: Is HTML to Markdown conversion reversible?

Not perfectly. The conversion is lossy by design: HTML structural information (classes, IDs, data attributes, inline styles, complex tables, forms, scripts) is discarded during the HTML-to-Markdown pass. Converting back from Markdown to HTML produces valid HTML, but it will be structurally different from the original — no CSS classes, no attributes, plain semantic HTML only. Use the round-trip for content, not for preserving presentation.

Q: What is HTML to Markdown used for in LLM applications?

Web pages fetched for LLM context windows contain massive amounts of noise: navigation HTML, footer markup, script tags, style tags, ads, social buttons. Converting to Markdown reduces token count by 60–80% for typical news/blog pages, which directly reduces API costs and improves context quality. Most RAG pipelines and web agents (including OpenAI's browsing tool) use HTML-to-Markdown or plain-text extraction before feeding content to the model.

Q: How do I handle HTML tables when converting to Markdown?

Standard Markdown has no table syntax — tables require GitHub Flavored Markdown (GFM) extension. Turndown supports GFM tables via the turndown-plugin-gfm plugin. Complex tables (nested, colspan, rowspan) cannot be represented in GFM and are typically left as raw HTML inside the Markdown output. Pandoc handles the widest range of table types but requires a system-level installation.

Key Takeaways

▸Regex-based HTML strippers destroy structure — a real parser + rule engine is required for quality Markdown output
▸Turndown is the JavaScript standard: 4M+ weekly npm downloads, extensible via plugins, GFM tables supported
▸LLM pipelines use HTML-to-Markdown to cut token count by 60–80% before feeding web content to models
▸Conversion is lossy: classes, IDs, styles, colspan tables, and JS-driven content cannot round-trip back to original HTML
▸For Python web scraping with boilerplate removal, trafilatura extracts main content before converting — far cleaner output

The Token Cost of Raw HTML

A typical news article page weighs around 120KB of raw HTML. The actual article text accounts for maybe 8KB. The rest is navigation markup, script tags, style blocks, social sharing buttons, comment widgets, cookie consent banners, and footer infrastructure. If you feed raw HTML to a language model, you are paying for — and consuming context window space with — 93% noise.

Converting to Markdown first is not just about readability. For a 120KB news page, the Markdown output is typically 6–12KB: a 90% reduction in token count. At GPT-4o rates ($5/1M input tokens), processing 10,000 such pages drops from roughly $60 to $6. At scale, HTML-to-Markdown conversion is a cost-optimization primitive, not a formatting nicety.

This is why HTML-to-Markdown has seen a surge of production tooling in 2024–2026, driven by RAG pipelines, web agents, and documentation migration tools. The challenge is that naive implementations — stripping tags with a regex, calling innerHTML.replace(/<[^>]*>/g, '') — destroy structure along with the noise. What you need is a parser-backed rule engine that maps HTML semantics to Markdown syntax one element at a time.

Why Regex HTML Stripping Fails

The temptation to strip HTML with a regular expression is understandable — it looks like a one-liner. The reality is that HTML is not a regular language. Attributes contain quoted strings that can include angle brackets. CDATA sections, HTML entities, self-closing tags, and malformed markup all break simple regex assumptions.

// ❌ Naive approach: destroys all structure
const text = html.replace(/<[^>]*>/g, '')
// Input:  <p>Visit <a href="https://example.com">our site</a></p>
// Output: "Visit our site"
// Link is gone. No way to recover the URL.

// ❌ Still wrong: doesn't handle nested quotes in attributes
const text = html.replace(/<[^"]*>/g, '')
// Breaks on: <img alt="A <cat> photo" src="cat.jpg">

// ❌ Misses HTML entities
// &amp; → &, &lt; → <, &gt; → >, &nbsp; → space (or not?)
// "AT&amp;T" → "AT&T" only if you also decode entities

// ✅ What you actually need: DOM parsing + structured traversal
// Every production library does this:
// 1. Parse HTML into a DOM tree (JSDOM, node-html-parser, or browser DOM)
// 2. Walk the tree recursively
// 3. Apply per-element conversion rules
// 4. Concatenate output with correct whitespace handling

According to the CommonMark specification, Markdown itself has 652 edge cases in its reference implementation. Mapping HTML to Markdown correctly requires handling all of them — nested emphasis, code spans inside link text, raw HTML passthrough, and more. No regex can do this.

The Core Conversion Mapping

Most HTML structure maps cleanly to Markdown when processed by a proper rule engine. Here is what good conversion looks like in practice:

// Input HTML → Output Markdown

// Headings
<h1>Title</h1>              → # Title
<h2>Section</h2>            → ## Section

// Emphasis
<strong>bold</strong>        → **bold**
<em>italic</em>              → _italic_
<code>inline</code>          → `inline`

// Links (with URL preservation)
<a href="/page">text</a>    → [text](/page)
<a href="/x" title="T">y</a> → [y](/x "T")

// Images
<img src="cat.jpg" alt="A cat"> → ![A cat](cat.jpg)

// Code blocks
<pre><code class="language-js">
const x = 1
</code></pre>
→
```js
const x = 1
```

// Lists
<ul><li>First</li><li>Second</li></ul>
→
- First
- Second

// Ordered lists
<ol><li>One</li><li>Two</li></ol>
→
1. One
2. Two

// Blockquotes
<blockquote><p>A quote</p></blockquote>
→
> A quote

The tricky cases are nested structures: an <em> inside an <a> inside an <li> inside a <blockquote>. Each library handles these differently, and the differences become visible only when running against real-world content.

JavaScript: Turndown

Turndown (originally to-markdown) is the JavaScript standard for HTML-to-Markdown conversion. Per npm, it receives over 4 million weekly downloads — making it the dominant solution for browser and Node.js environments alike. It parses HTML into a DOM tree, applies a configurable rule set, and produces CommonMark-compatible Markdown.

import TurndownService from 'turndown'
import { gfm, tables, strikethrough } from 'turndown-plugin-gfm'

const turndown = new TurndownService({
  headingStyle: 'atx',       // # H1, ## H2 (vs underline style)
  codeBlockStyle: 'fenced',  // ``` code blocks
  bulletListMarker: '-',     // - item (vs * or +)
  hr: '---',
})

// Enable GitHub Flavored Markdown (tables, strikethrough, task lists)
turndown.use(gfm)

// Custom rule: strip <nav> and <footer> elements entirely
turndown.addRule('removeChrome', {
  filter: ['nav', 'footer', 'aside', 'script', 'style'],
  replacement: () => '',
})

// Custom rule: keep <figure> captions as italicized text
turndown.addRule('figcaption', {
  filter: 'figcaption',
  replacement: (content) => `_${content}_

`,
})

const markdown = turndown.turndown(htmlString)

// With a real-world article page:
// Input:  ~85KB HTML
// Output: ~9KB Markdown (89% reduction)
// Time:   ~12ms for a 50KB DOM in Node.js 20

The custom rule API is Turndown's biggest advantage. You can strip entire element categories (navigation, ads, sidebars) by returning an empty string from the replacement function, or transform elements that have no native Markdown equivalent. This is how production web scrapers and LLM pipelines customize output quality.

Python: html2text vs trafilatura

Python offers two meaningfully different approaches depending on your use case:

# html2text: direct conversion, full control
# Created by Aaron Swartz, maintained as OSS since 2004
import html2text

h = html2text.HTML2Text()
h.ignore_links = False      # Keep [text](url) links
h.ignore_images = False     # Keep ![alt](src) images
h.body_width = 0            # No line wrapping (important for LLM input)
h.unicode_snob = True       # Prefer unicode chars over ASCII equivalents
h.ignore_emphasis = False

markdown = h.handle(html_string)

# trafilatura: content extraction + conversion
# Identifies the main content zone (article body) before converting
# Removes: navigation, headers, footers, sidebars, ads automatically
import trafilatura

# From a URL (handles fetch + extract):
downloaded = trafilatura.fetch_url('https://example.com/article')
markdown = trafilatura.extract(downloaded, output_format='markdown')

# From raw HTML:
markdown = trafilatura.extract(html_string, output_format='markdown',
                               include_comments=False,
                               include_tables=True)

# trafilatura accuracy on news/blog content:
# Per the trafilatura paper (Barbaresi 2021, ACL Anthology):
# F1 score of 0.89 for content extraction vs 0.71 for boilerplate removal alone
# Evaluated against 1,226 web pages from the C3 corpus

Library	Language	Boilerplate Removal	Best For	Weekly Downloads
Turndown	JavaScript	Manual (custom rules)	Browser, Node.js, LLM pipelines	~4M (npm)
html2text	Python	None	Known-clean HTML, migrations	~2.3M (PyPI)
trafilatura	Python	Automatic (ML-based)	Web scraping, news extraction	~600K (PyPI)
markdownify	Python	None	Custom logic via subclassing	~800K (PyPI)
html-to-markdown (Go)	Go	Plugin-based	High-throughput servers	~30K (pkg.go.dev)
Pandoc	Haskell (CLI)	None	Complex tables, academic docs	System install

The Hard Cases: What No Library Handles Perfectly

Every converter has failure modes. Understanding them lets you decide where to add post-processing logic:

1. Complex Tables (colspan, rowspan)

GFM table syntax has no colspan or rowspan. A table cell spanning three columns in HTML becomes three identical-content cells in Markdown, losing the visual merge. The only correct handling is to keep the <table> as raw HTML inside the Markdown output. Pandoc does this for complex tables; Turndown drops the structure. If tables are critical, use Pandoc or keep raw HTML passthrough.

2. Nested Lists Beyond Three Levels

CommonMark supports theoretically infinite nesting via indentation, but real renderers often choke past four levels. More practically, HTML list nesting can include block elements (<div>, <p>) inside <li> elements, which Markdown handles only with blank-line-separated list items. Turndown gets this right; simpler converters produce invalid Markdown here.

3. Definition Lists

<!-- HTML definition list: no standard Markdown equivalent -->
<dl>
  <dt>REST</dt>
  <dd>Representational State Transfer, a stateless API architecture</dd>
  <dt>GraphQL</dt>
  <dd>A query language for APIs with client-defined response shapes</dd>
</dl>

<!-- Most converters produce either: -->
**REST**
Representational State Transfer, a stateless API architecture

**GraphQL**
A query language for APIs with client-defined response shapes

<!-- Or (Pandoc extended syntax): -->
REST
: Representational State Transfer, a stateless API architecture

4. Relative URLs

HTML pages use relative links constantly: href="/about", src="../images/photo.jpg". When you convert to Markdown for use outside the original domain, these links break. Turndown does not resolve relative URLs automatically — you need to provide a base URL and resolve them in a custom rule:

import TurndownService from 'turndown'

const BASE = 'https://example.com'

const turndown = new TurndownService()
turndown.addRule('absoluteLinks', {
  filter: 'a',
  replacement: (content, node) => {
    const href = node.getAttribute('href') || ''
    const absoluteHref = href.startsWith('http')
      ? href
      : new URL(href, BASE).toString()
    const title = node.getAttribute('title')
    return title
      ? `[${content}](${absoluteHref} "${title}")`
      : `[${content}](${absoluteHref})`
  },
})

Production Pattern: LLM Content Pipeline

Here is a production-grade Node.js pipeline for converting arbitrary web pages to clean Markdown for LLM consumption:

import TurndownService from 'turndown'
import { gfm } from 'turndown-plugin-gfm'
import * as cheerio from 'cheerio'

function htmlToLlmMarkdown(html: string, baseUrl: string): string {
  // Step 1: Pre-process HTML with Cheerio to remove noise
  const $ = cheerio.load(html)

  // Remove elements that are never content
  $('script, style, nav, header, footer, aside').remove()
  $('[class*="cookie"], [class*="popup"], [class*="modal"]').remove()
  $('[class*="social"], [class*="share"], [class*="subscribe"]').remove()
  $('[aria-hidden="true"]').remove()

  // Extract just the main content if present
  const main = $('main, article, [role="main"]').first()
  const contentHtml = main.length ? main.html() ?? '' : $('body').html() ?? ''

  // Step 2: Convert to Markdown with Turndown
  const turndown = new TurndownService({
    headingStyle: 'atx',
    codeBlockStyle: 'fenced',
    bulletListMarker: '-',
  })
  turndown.use(gfm)

  // Resolve relative URLs
  turndown.addRule('absoluteLinks', {
    filter: 'a',
    replacement: (content, node) => {
      const href = (node as HTMLAnchorElement).getAttribute('href') || ''
      if (!href || href.startsWith('javascript:') || href.startsWith('#')) return content
      const abs = href.startsWith('http') ? href : new URL(href, baseUrl).toString()
      return `[${content}](${abs})`
    },
  })

  let markdown = turndown.turndown(contentHtml)

  // Step 3: Post-process — collapse excessive blank lines
  markdown = markdown
    .replace(/
{3,}/g, '

')  // max 2 consecutive newlines
    .trim()

  return markdown
}

// Usage:
const response = await fetch('https://example.com/article')
const html = await response.text()
const md = htmlToLlmMarkdown(html, 'https://example.com')

// Result: typically 85-92% smaller than source HTML
console.log(`Reduced ${html.length} → ${md.length} chars (${Math.round((1 - md.length/html.length) * 100)}% reduction)`)

HTML Entities and Special Characters

HTML entities are a conversion minefield. The HTML specification defines over 2,200 named character references (&, —, ’), plus numeric references (—, —). Good converters decode these to their Unicode equivalents in the output (— for em dash, ' for right single quote). Bad converters pass the entity reference through literally, leaving ’ in the Markdown output where you expected an apostrophe.

Turndown decodes entities correctly because it operates on the parsed DOM where browsers have already resolved entities to Unicode code points. html2text decodes them explicitly via the html.parser backend. If you are writing a custom converter without a full parser, use the he library in Node.js or Python's html.unescape() as a post-processing step.

Markdown also has its own set of characters that need escaping: * _ [ ] ( ) # + - . !. If plain text in the HTML contains these characters, they must be backslash-escaped in the Markdown output to prevent unintended formatting. This is another reason regex-based converters produce corrupt output — they skip the escape step entirely. Use our Markdown cheat sheet for a reference of all syntax rules and escape sequences.

Rust-Based Converters: Performance at Scale

For applications processing thousands of pages per hour, pure-JavaScript or pure-Python converters introduce throughput constraints. The kreuzberg html-to-markdown library ships a Rust core with native bindings for TypeScript/Node.js, Python, Go, Ruby, PHP, and 8 other runtimes, producing identical output across all of them from a single Rust parser.

# Reported throughput comparison (per kreuzberg benchmarks, 2025):
# Processing 1,000 average news article pages (avg 85KB HTML each)

# Turndown (Node.js):      ~12ms/page   → ~83 pages/sec
# html2text (Python):      ~18ms/page   → ~56 pages/sec
# kreuzberg (Rust/Node):   ~2ms/page    → ~500 pages/sec

# For LLM batch processing at 10K pages/run:
# Turndown:   ~120 seconds
# kreuzberg:  ~20 seconds
# 6x throughput improvement, same API surface

// Node.js usage (same Turndown-compatible API):
import { convert } from '@kreuzberg/html-to-markdown'

const markdown = await convert(html, {
  headingStyle: 'atx',
  codeBlockStyle: 'fenced',
})

For most use cases, Turndown's ~12ms/page is fast enough. Reach for a Rust-backed library when you are processing bulk batches on a schedule and the cumulative time matters for your pipeline's throughput.

Markdown Variants: CommonMark, GFM, and Pandoc

"Markdown" is not a single specification. Your output format choice affects what constructs the converter will use:

CommonMark: the 2014 standardization of John Gruber's original Markdown. Defines exact parsing rules for ambiguous cases. Supported by GitHub, GitLab, Discourse, and most modern renderers. No tables, no strikethrough, no task lists.
GitHub Flavored Markdown (GFM): CommonMark + tables, strikethrough (~~text~~), task lists (- [x]), and autolinks. The de facto standard for developer-facing content.
Pandoc Markdown: superset with definition lists, footnotes, math (LaTeX), tables with alignment, div/span containers. Best for academic and technical documentation.
MDX: Markdown + JSX components. Not a conversion target for HTML; you write it directly.

For LLM context, target CommonMark or GFM — they are the most widely trained-on formats and produce the most predictable tokenization. For developer documentation and GitHub-rendered content, GFM is the right choice. You can format and inspect raw HTML before running conversion to catch malformed markup that would break edge cases.

Format HTML Before Converting

Malformed HTML produces garbage Markdown. Before running your conversion pipeline, use BytePane's HTML Formatter to pretty-print and validate your source HTML, then convert. Works in-browser, zero setup.

Open HTML Formatter Markdown Cheat Sheet

Frequently Asked Questions

What is the best library to convert HTML to Markdown in JavaScript?

Turndown is the JavaScript standard with over 4 million weekly npm downloads. It handles CommonMark output, supports GFM tables via turndown-plugin-gfm, and exposes a rule-based API for custom element handling. For Node.js high-throughput use, the Rust-based kreuzberg html-to-markdown offers ~6x faster processing with compatible API surface.

Why does HTML to Markdown conversion lose formatting?

Markdown is a subset of what HTML can express. Complex table layouts, colspan/rowspan, CSS styles, definition lists, and non-standard elements have no Markdown equivalent. Converters either drop these or keep them as raw HTML. Output quality depends on how structured the source HTML was and which converter handles the edge cases.

How do I convert HTML to Markdown in Python?

html2text (created by Aaron Swartz) is the most common option. For web scraping with automatic boilerplate removal — navigation, footers, ads — use trafilatura instead. It identifies the main content zone before converting, producing much cleaner Markdown from news articles. Per the trafilatura paper (Barbaresi 2021, ACL Anthology), it achieves F1 of 0.89 for content extraction.

Can I convert HTML to Markdown without losing links?

Yes, for standard anchor tags. Problems arise with javascript: hrefs (stripped), relative URLs (need a base URL to resolve), and link text containing inline HTML. Good converters handle these: Turndown preserves absolute URLs and lets you add a custom rule to resolve relative paths to absolute URLs before output.

Is HTML to Markdown conversion reversible?

Not perfectly — the conversion is lossy by design. CSS classes, IDs, data attributes, inline styles, complex tables, forms, and scripts are discarded. Converting back produces valid HTML but structurally different from the original. Use the round-trip for content portability, not for preserving presentation.

What is HTML to Markdown used for in LLM applications?

Web pages contain 60–80% noise (navigation, scripts, ads). Converting to Markdown before feeding to an LLM reduces token count by the same amount, cutting API costs and improving context quality. Most RAG pipelines and web agents use HTML-to-Markdown or plain-text extraction as a preprocessing step.

How do I handle HTML tables when converting to Markdown?

Simple tables convert to GFM format via turndown-plugin-gfm. Complex tables (colspan, rowspan) cannot be represented in GFM and are best kept as raw HTML inside the Markdown output. Pandoc handles the widest table types but requires system-level installation.