BytePane

Sitemap Generator: Create XML Sitemaps for Your Website

SEO18 min read

Key Takeaways

  • Google ignores <priority> and <changefreq> entirely. Only <loc> and <lastmod> influence crawl behavior.
  • A single sitemap file caps at 50,000 URLs and 50MB. Larger sites need a sitemap index file.
  • Research cited by SEO practitioners finds new pages appear in Google 25-40% faster when accurately submitted via sitemap.
  • The canonical URL in the sitemap must exactly match the canonical tag on the page — mismatches waste crawl budget.
  • Never include noindex, 4xx, or redirect URLs in your sitemap — it sends Google mixed signals.

Clearing Up the Sitemap Myths

Two persistent myths about XML sitemaps waste developers' time. First: that carefully tuned <priority> and <changefreq> values improve your rankings. Google's John Mueller stated in a developer office hours session: "We generally ignore priority and changefreq values — we figure that out ourselves." Bing confirmed the same. Second: that submitting a sitemap guarantees indexing. Google's documentation is explicit: "Submitting a sitemap doesn't guarantee that all items in the sitemap will be crawled and indexed."

With those cleared: sitemaps are genuinely valuable for one specific problem — helping Google discover URLs it might otherwise miss. The impact is highest for new sites with few inbound links, sites with deep page hierarchies, and programmatic SEO implementations that generate thousands of pages.

Google processes over 100 billion web pages per month in its crawl index, per internal Google Research publications. Sitemaps are one of the signals — alongside internal links, PageRank, and fetch frequency — that determine crawl prioritization.

XML Sitemap Anatomy

A valid XML sitemap follows the Sitemap Protocol 0.9 specification (sitemaps.org), which was jointly established by Google, Yahoo, and Microsoft in 2006 and has not materially changed since. Here is a minimal correct sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-04-15</lastmod>
  </url>

  <url>
    <loc>https://example.com/about/</loc>
    <lastmod>2026-03-01</lastmod>
  </url>

  <url>
    <loc>https://example.com/blog/my-article/</loc>
    <lastmod>2026-04-20</lastmod>
  </url>

</urlset>
ElementRequiredGoogle Uses ItNotes
<loc>YesYesAbsolute URL, must match canonical tag exactly
<lastmod>RecommendedYes (if accurate)ISO 8601 format. Lying here devalues the signal for all pages.
<changefreq>OptionalNoIgnored by Google and Bing. Skip it.
<priority>OptionalNoIgnored by Google and Bing. Skip it.

The <lastmod> date is only useful if it reflects the actual last modification date of the content. Sites that set all <lastmod> values to the current date on every sitemap regeneration train Google to ignore the field entirely for their domain. Use real timestamps from your CMS or file system.

Sitemap Index Files for Large Sites

Once you exceed 50,000 URLs or 50MB, you need a sitemap index — a file that lists other sitemaps. This is also the standard pattern for organizing sitemaps by content type: one for blog posts, one for product pages, one for image assets.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <!-- Blog posts -->
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-04-28</lastmod>
  </sitemap>

  <!-- Product pages (e-commerce) -->
  <sitemap>
    <loc>https://example.com/sitemap-products-1.xml</loc>
    <lastmod>2026-04-28</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products-2.xml</loc>
    <lastmod>2026-04-27</lastmod>
  </sitemap>

  <!-- Static pages -->
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-04-01</lastmod>
  </sitemap>

</sitemapindex>

Separating by content type has a practical benefit: Google Search Console reports coverage statistics per sitemap file. If your product pages have indexing issues but your blog posts are fine, separate sitemaps let you diagnose this at a glance rather than hunting through a single 50,000-URL file.

Generating Sitemaps Programmatically

Next.js App Router (Built-In)

Next.js 13.3+ has native sitemap generation via a sitemap.ts file at the app root. It generates and serves /sitemap.xml automatically at build time (static) or per request (dynamic).

// app/sitemap.ts
import { MetadataRoute } from 'next'

const BASE_URL = 'https://yourdomain.com'

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  // Static pages
  const staticPages: MetadataRoute.Sitemap = [
    { url: `${BASE_URL}/`, lastModified: new Date('2026-04-01') },
    { url: `${BASE_URL}/about/`, lastModified: new Date('2026-03-15') },
    { url: `${BASE_URL}/tools/`, lastModified: new Date('2026-04-20') },
  ]

  // Dynamic blog posts from your CMS/database
  const posts = await fetch('https://your-cms.com/api/posts')
    .then(r => r.json())

  const blogPages: MetadataRoute.Sitemap = posts.map((post: {
    slug: string
    updatedAt: string
  }) => ({
    url: `${BASE_URL}/blog/${post.slug}/`,
    lastModified: new Date(post.updatedAt),
  }))

  return [...staticPages, ...blogPages]
}

// Next.js outputs:
// <url><loc>https://yourdomain.com/</loc><lastmod>2026-04-01T00:00:00.000Z</lastmod></url>
// etc.

Next.js Multiple Sitemaps (Large Sites)

// app/sitemap.ts — sitemap index for 100,000+ URL sites
import { MetadataRoute } from 'next'

// Next.js supports generateSitemaps() for multiple files
export async function generateSitemaps() {
  const totalPosts = await getPostCount()       // e.g., 120,000
  const URLS_PER_SITEMAP = 50_000

  return Array.from(
    { length: Math.ceil(totalPosts / URLS_PER_SITEMAP) },
    (_, i) => ({ id: i })
  )
}

export default async function sitemap({ id }: { id: number }): Promise<MetadataRoute.Sitemap> {
  const PAGE_SIZE = 50_000
  const posts = await getPosts({ offset: id * PAGE_SIZE, limit: PAGE_SIZE })

  return posts.map(post => ({
    url: `https://yourdomain.com/blog/${post.slug}/`,
    lastModified: new Date(post.updatedAt),
  }))
}
// Generates: /sitemap/0.xml, /sitemap/1.xml, /sitemap/2.xml
// Plus an automatic sitemap index at /sitemap.xml

Node.js Script (Framework-Agnostic)

// scripts/generate-sitemap.ts
import { writeFileSync } from 'fs'

interface SitemapEntry {
  url: string
  lastmod?: string
}

function buildSitemap(entries: SitemapEntry[]): string {
  const urls = entries.map(({ url, lastmod }) => {
    const lastmodTag = lastmod ? `
    <lastmod>${lastmod}</lastmod>` : ''
    return `  <url>
    <loc>${url}</loc>${lastmodTag}
  </url>`
  }).join('
')

  return [
    '<?xml version="1.0" encoding="UTF-8"?>',
    '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    urls,
    '</urlset>',
  ].join('
')
}

const entries: SitemapEntry[] = [
  { url: 'https://example.com/', lastmod: '2026-04-28' },
  { url: 'https://example.com/blog/', lastmod: '2026-04-28' },
  // ... fetch from DB
]

const xml = buildSitemap(entries)
writeFileSync('public/sitemap.xml', xml, 'utf-8')
console.log(`Generated sitemap with ${entries.length} URLs`)

Python Script

# generate_sitemap.py
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom.minidom import parseString
from datetime import date

def build_sitemap(urls: list[dict]) -> str:
    root = Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')

    for entry in urls:
        url_el = SubElement(root, 'url')
        SubElement(url_el, 'loc').text = entry['url']
        if 'lastmod' in entry:
            SubElement(url_el, 'lastmod').text = entry['lastmod']

    raw = tostring(root, encoding='unicode', xml_declaration=True)
    return parseString(raw).toprettyxml(indent='  ')

urls = [
    {'url': 'https://example.com/', 'lastmod': str(date.today())},
    {'url': 'https://example.com/about/', 'lastmod': '2026-03-01'},
]

with open('sitemap.xml', 'w') as f:
    f.write(build_sitemap(urls))

print(f'Generated sitemap with {len(urls)} URLs')

Submitting Your Sitemap to Google

There are two submission channels, and you should use both:

1. Google Search Console

  1. Log in to Google Search Console and select your property.
  2. Navigate to Indexing → Sitemaps in the left sidebar.
  3. Enter the path to your sitemap (e.g., sitemap.xml) and click Submit.
  4. GSC will show the discovery status, last read date, and URL count vs. indexed URL count.

The Coverage report in GSC will show how many URLs from your sitemap are indexed, excluded, or have errors — this is the single most useful diagnostic tool for indexing issues. A low "discovered but not indexed" count often signals thin content or low-quality pages rather than a sitemap problem.

2. robots.txt Directive

# robots.txt — add the Sitemap directive
User-agent: *
Disallow: /admin/
Disallow: /api/

# Sitemap declaration (Googlebot discovers this automatically)
Sitemap: https://yourdomain.com/sitemap.xml

# For multiple sitemaps, list each one
Sitemap: https://yourdomain.com/sitemap-blog.xml
Sitemap: https://yourdomain.com/sitemap-products.xml

The Sitemap: directive in robots.txt is supported by Google, Bing, and Yandex. Googlebot reads robots.txt on every crawl, so new sitemap files are discovered within hours of being listed here — no manual GSC submission required for routine updates.

Common Sitemap Mistakes (And Fixes)

MistakeWhy It's WrongFix
Including noindex pagesContradicts the noindex directive — you are asking Google to both index and not index the page simultaneouslyExclude all pages with canonical pointing elsewhere, noindex tags, or password protection
URL mismatch (www vs non-www)If your canonical says https://example.com but sitemap says https://www.example.com, Google sees a signal conflictSitemap URLs must exactly match your canonical tags — same protocol, www/non-www, and trailing slash
Lying lastmod datesSetting all pages to today degrades the signal for your entire domain — Google stops trusting your lastmodUse real database updatedAt timestamps; only update lastmod when content actually changes
Including redirect chainsIncluding 301 redirected URLs wastes crawl budget and confuses the targetOnly include the final destination URL after all redirects resolve
Relative URLs/blog/post is invalid — the spec requires absolute URLsAlways use full absolute URLs including protocol: https://example.com/blog/post/

Sitemap Generator Tools Compared

For most development teams, programmatic generation is the right approach — any tool that crawls a live site and exports a sitemap is building on an unstable foundation (it only captures what it can find). But for small static sites, generator tools are perfectly adequate.

ToolTypeBest ForCost
next-sitemapnpm packageNext.js (Pages Router)Free
Next.js built-inFramework nativeNext.js App RouterFree
Yoast SEOWordPress pluginWordPress sitesFree (premium addon)
Screaming FrogDesktop crawlerAuditing existing sitesFree up to 500 URLs
xml-sitemaps.comOnline crawlerSimple static sitesFree up to 500 URLs
sitemap npm packagenpm packageNode.js / Express appsFree

Screaming Frog's free tier (500 URL limit) is excellent for auditing — it finds orphaned pages, checks canonical consistency, detects broken internal links, and validates your existing sitemap against what's actually on the site. For large sites, the £209/year license pays for itself in crawl time saved.

Validating Your Sitemap

Before submitting, validate with three checks:

# 1. Check that sitemap.xml is accessible and well-formed
curl -s https://yourdomain.com/sitemap.xml | head -20

# 2. Validate XML structure (xmllint must be installed)
curl -s https://yourdomain.com/sitemap.xml | xmllint --noout -
# No output = valid XML

# 3. Check URL count
curl -s https://yourdomain.com/sitemap.xml | grep -c '<loc>'
# Should be ≤ 50,000

# 4. Check that robots.txt allows Googlebot to access it
curl -A "Googlebot" https://yourdomain.com/robots.txt | grep -i sitemap

# 5. Verify no noindex pages are in your sitemap (Node.js)
# Fetch each URL and check for noindex meta tag — catches mismatches

The most important validation is checking for canonical mismatches: every URL in your sitemap should have a <link rel="canonical"> tag pointing to itself (self-referencing canonical). If the canonical points elsewhere, the page should not be in your sitemap. You can build this check with a simple crawl script that fetches each URL in your sitemap and parses the canonical tag — use our JSON Formatter to inspect the data output.

Specialized Sitemap Extensions

The base sitemap format can be extended with namespace declarations for media content:

Image Sitemaps

Image sitemaps help Google discover images embedded in JavaScript or CSS that its crawler might miss. Per Google's documentation, image sitemaps are particularly important for e-commerce product images and photography portfolios.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://example.com/product/widget/</loc>
    <image:image>
      <image:loc>https://example.com/images/widget-front.jpg</image:loc>
      <image:title>Blue Widget - Front View</image:title>
    </image:image>
    <image:image>
      <image:loc>https://example.com/images/widget-back.jpg</image:loc>
      <image:title>Blue Widget - Back View</image:title>
    </image:image>
  </url>
</urlset>

News Sitemaps

Google News requires a separate news sitemap to surface articles in the Top Stories carousel. News sitemaps should only include articles published in the last 48 hours — older articles are dropped from the news index regardless. You must be approved for Google News before this matters.

Frequently Asked Questions

Do I need a sitemap for SEO?
Google says sitemaps benefit sites with more than a few hundred pages, poor internal linking, or new sites with few external links. Small, well-linked sites often get discovered fine through crawling. But for new sites or programmatic SEO with thousands of pages, a sitemap meaningfully speeds up indexing discovery.
Does Google use priority and changefreq in XML sitemaps?
No. Google ignores both elements. John Mueller confirmed this in multiple developer office hours sessions. The only fields that influence Google crawl behavior are <loc> and <lastmod> (when accurate). Do not waste time crafting priority scores.
What is the maximum size of an XML sitemap?
Per the Sitemap protocol, a single sitemap file cannot exceed 50MB (uncompressed) and cannot contain more than 50,000 URLs. If you exceed either limit, split into multiple sitemaps and create a sitemap index file that references each one.
How do I submit a sitemap to Google?
Via Google Search Console: navigate to Indexing → Sitemaps, paste your sitemap URL, and click Submit. Also add a Sitemap directive to your robots.txt: "Sitemap: https://yourdomain.com/sitemap.xml". Google discovers the sitemap via robots.txt on its next crawl.
Should I include noindex pages in my sitemap?
No. Including a noindex page in your sitemap creates a contradiction — you are simultaneously asking Google to index and not index the page. Google will typically respect the noindex tag, but the conflict wastes crawl budget. Only include pages you want indexed.
How often should I update my sitemap?
Regenerate whenever you publish new pages, significantly update existing ones, or delete/redirect content. Use accurate ISO 8601 lastmod timestamps from your database. Setting all pages to today's date devalues the lastmod signal for your entire domain.

Validate Your XML and URLs

Building or debugging a sitemap means working with XML and URLs constantly. Use BytePane's tools to format, validate, and encode without leaving your browser.

Related Articles