Sitemap Generator: Create XML Sitemaps for Your Website
Key Takeaways
- ▸Google ignores
<priority>and<changefreq>entirely. Only<loc>and<lastmod>influence crawl behavior. - ▸A single sitemap file caps at 50,000 URLs and 50MB. Larger sites need a sitemap index file.
- ▸Research cited by SEO practitioners finds new pages appear in Google 25-40% faster when accurately submitted via sitemap.
- ▸The canonical URL in the sitemap must exactly match the canonical tag on the page — mismatches waste crawl budget.
- ▸Never include noindex, 4xx, or redirect URLs in your sitemap — it sends Google mixed signals.
Clearing Up the Sitemap Myths
Two persistent myths about XML sitemaps waste developers' time. First: that carefully tuned <priority> and <changefreq> values improve your rankings. Google's John Mueller stated in a developer office hours session: "We generally ignore priority and changefreq values — we figure that out ourselves." Bing confirmed the same. Second: that submitting a sitemap guarantees indexing. Google's documentation is explicit: "Submitting a sitemap doesn't guarantee that all items in the sitemap will be crawled and indexed."
With those cleared: sitemaps are genuinely valuable for one specific problem — helping Google discover URLs it might otherwise miss. The impact is highest for new sites with few inbound links, sites with deep page hierarchies, and programmatic SEO implementations that generate thousands of pages.
Google processes over 100 billion web pages per month in its crawl index, per internal Google Research publications. Sitemaps are one of the signals — alongside internal links, PageRank, and fetch frequency — that determine crawl prioritization.
XML Sitemap Anatomy
A valid XML sitemap follows the Sitemap Protocol 0.9 specification (sitemaps.org), which was jointly established by Google, Yahoo, and Microsoft in 2006 and has not materially changed since. Here is a minimal correct sitemap:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-04-15</lastmod>
</url>
<url>
<loc>https://example.com/about/</loc>
<lastmod>2026-03-01</lastmod>
</url>
<url>
<loc>https://example.com/blog/my-article/</loc>
<lastmod>2026-04-20</lastmod>
</url>
</urlset>| Element | Required | Google Uses It | Notes |
|---|---|---|---|
| <loc> | Yes | Yes | Absolute URL, must match canonical tag exactly |
| <lastmod> | Recommended | Yes (if accurate) | ISO 8601 format. Lying here devalues the signal for all pages. |
| <changefreq> | Optional | No | Ignored by Google and Bing. Skip it. |
| <priority> | Optional | No | Ignored by Google and Bing. Skip it. |
The <lastmod> date is only useful if it reflects the actual last modification date of the content. Sites that set all <lastmod> values to the current date on every sitemap regeneration train Google to ignore the field entirely for their domain. Use real timestamps from your CMS or file system.
Sitemap Index Files for Large Sites
Once you exceed 50,000 URLs or 50MB, you need a sitemap index — a file that lists other sitemaps. This is also the standard pattern for organizing sitemaps by content type: one for blog posts, one for product pages, one for image assets.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<!-- Blog posts -->
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2026-04-28</lastmod>
</sitemap>
<!-- Product pages (e-commerce) -->
<sitemap>
<loc>https://example.com/sitemap-products-1.xml</loc>
<lastmod>2026-04-28</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products-2.xml</loc>
<lastmod>2026-04-27</lastmod>
</sitemap>
<!-- Static pages -->
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2026-04-01</lastmod>
</sitemap>
</sitemapindex>Separating by content type has a practical benefit: Google Search Console reports coverage statistics per sitemap file. If your product pages have indexing issues but your blog posts are fine, separate sitemaps let you diagnose this at a glance rather than hunting through a single 50,000-URL file.
Generating Sitemaps Programmatically
Next.js App Router (Built-In)
Next.js 13.3+ has native sitemap generation via a sitemap.ts file at the app root. It generates and serves /sitemap.xml automatically at build time (static) or per request (dynamic).
// app/sitemap.ts
import { MetadataRoute } from 'next'
const BASE_URL = 'https://yourdomain.com'
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
// Static pages
const staticPages: MetadataRoute.Sitemap = [
{ url: `${BASE_URL}/`, lastModified: new Date('2026-04-01') },
{ url: `${BASE_URL}/about/`, lastModified: new Date('2026-03-15') },
{ url: `${BASE_URL}/tools/`, lastModified: new Date('2026-04-20') },
]
// Dynamic blog posts from your CMS/database
const posts = await fetch('https://your-cms.com/api/posts')
.then(r => r.json())
const blogPages: MetadataRoute.Sitemap = posts.map((post: {
slug: string
updatedAt: string
}) => ({
url: `${BASE_URL}/blog/${post.slug}/`,
lastModified: new Date(post.updatedAt),
}))
return [...staticPages, ...blogPages]
}
// Next.js outputs:
// <url><loc>https://yourdomain.com/</loc><lastmod>2026-04-01T00:00:00.000Z</lastmod></url>
// etc.Next.js Multiple Sitemaps (Large Sites)
// app/sitemap.ts — sitemap index for 100,000+ URL sites
import { MetadataRoute } from 'next'
// Next.js supports generateSitemaps() for multiple files
export async function generateSitemaps() {
const totalPosts = await getPostCount() // e.g., 120,000
const URLS_PER_SITEMAP = 50_000
return Array.from(
{ length: Math.ceil(totalPosts / URLS_PER_SITEMAP) },
(_, i) => ({ id: i })
)
}
export default async function sitemap({ id }: { id: number }): Promise<MetadataRoute.Sitemap> {
const PAGE_SIZE = 50_000
const posts = await getPosts({ offset: id * PAGE_SIZE, limit: PAGE_SIZE })
return posts.map(post => ({
url: `https://yourdomain.com/blog/${post.slug}/`,
lastModified: new Date(post.updatedAt),
}))
}
// Generates: /sitemap/0.xml, /sitemap/1.xml, /sitemap/2.xml
// Plus an automatic sitemap index at /sitemap.xmlNode.js Script (Framework-Agnostic)
// scripts/generate-sitemap.ts
import { writeFileSync } from 'fs'
interface SitemapEntry {
url: string
lastmod?: string
}
function buildSitemap(entries: SitemapEntry[]): string {
const urls = entries.map(({ url, lastmod }) => {
const lastmodTag = lastmod ? `
<lastmod>${lastmod}</lastmod>` : ''
return ` <url>
<loc>${url}</loc>${lastmodTag}
</url>`
}).join('
')
return [
'<?xml version="1.0" encoding="UTF-8"?>',
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
urls,
'</urlset>',
].join('
')
}
const entries: SitemapEntry[] = [
{ url: 'https://example.com/', lastmod: '2026-04-28' },
{ url: 'https://example.com/blog/', lastmod: '2026-04-28' },
// ... fetch from DB
]
const xml = buildSitemap(entries)
writeFileSync('public/sitemap.xml', xml, 'utf-8')
console.log(`Generated sitemap with ${entries.length} URLs`)Python Script
# generate_sitemap.py
from xml.etree.ElementTree import Element, SubElement, tostring
from xml.dom.minidom import parseString
from datetime import date
def build_sitemap(urls: list[dict]) -> str:
root = Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
for entry in urls:
url_el = SubElement(root, 'url')
SubElement(url_el, 'loc').text = entry['url']
if 'lastmod' in entry:
SubElement(url_el, 'lastmod').text = entry['lastmod']
raw = tostring(root, encoding='unicode', xml_declaration=True)
return parseString(raw).toprettyxml(indent=' ')
urls = [
{'url': 'https://example.com/', 'lastmod': str(date.today())},
{'url': 'https://example.com/about/', 'lastmod': '2026-03-01'},
]
with open('sitemap.xml', 'w') as f:
f.write(build_sitemap(urls))
print(f'Generated sitemap with {len(urls)} URLs')Submitting Your Sitemap to Google
There are two submission channels, and you should use both:
1. Google Search Console
- Log in to Google Search Console and select your property.
- Navigate to Indexing → Sitemaps in the left sidebar.
- Enter the path to your sitemap (e.g.,
sitemap.xml) and click Submit. - GSC will show the discovery status, last read date, and URL count vs. indexed URL count.
The Coverage report in GSC will show how many URLs from your sitemap are indexed, excluded, or have errors — this is the single most useful diagnostic tool for indexing issues. A low "discovered but not indexed" count often signals thin content or low-quality pages rather than a sitemap problem.
2. robots.txt Directive
# robots.txt — add the Sitemap directive
User-agent: *
Disallow: /admin/
Disallow: /api/
# Sitemap declaration (Googlebot discovers this automatically)
Sitemap: https://yourdomain.com/sitemap.xml
# For multiple sitemaps, list each one
Sitemap: https://yourdomain.com/sitemap-blog.xml
Sitemap: https://yourdomain.com/sitemap-products.xmlThe Sitemap: directive in robots.txt is supported by Google, Bing, and Yandex. Googlebot reads robots.txt on every crawl, so new sitemap files are discovered within hours of being listed here — no manual GSC submission required for routine updates.
Common Sitemap Mistakes (And Fixes)
| Mistake | Why It's Wrong | Fix |
|---|---|---|
| Including noindex pages | Contradicts the noindex directive — you are asking Google to both index and not index the page simultaneously | Exclude all pages with canonical pointing elsewhere, noindex tags, or password protection |
| URL mismatch (www vs non-www) | If your canonical says https://example.com but sitemap says https://www.example.com, Google sees a signal conflict | Sitemap URLs must exactly match your canonical tags — same protocol, www/non-www, and trailing slash |
| Lying lastmod dates | Setting all pages to today degrades the signal for your entire domain — Google stops trusting your lastmod | Use real database updatedAt timestamps; only update lastmod when content actually changes |
| Including redirect chains | Including 301 redirected URLs wastes crawl budget and confuses the target | Only include the final destination URL after all redirects resolve |
| Relative URLs | /blog/post is invalid — the spec requires absolute URLs | Always use full absolute URLs including protocol: https://example.com/blog/post/ |
Sitemap Generator Tools Compared
For most development teams, programmatic generation is the right approach — any tool that crawls a live site and exports a sitemap is building on an unstable foundation (it only captures what it can find). But for small static sites, generator tools are perfectly adequate.
| Tool | Type | Best For | Cost |
|---|---|---|---|
| next-sitemap | npm package | Next.js (Pages Router) | Free |
| Next.js built-in | Framework native | Next.js App Router | Free |
| Yoast SEO | WordPress plugin | WordPress sites | Free (premium addon) |
| Screaming Frog | Desktop crawler | Auditing existing sites | Free up to 500 URLs |
| xml-sitemaps.com | Online crawler | Simple static sites | Free up to 500 URLs |
| sitemap npm package | npm package | Node.js / Express apps | Free |
Screaming Frog's free tier (500 URL limit) is excellent for auditing — it finds orphaned pages, checks canonical consistency, detects broken internal links, and validates your existing sitemap against what's actually on the site. For large sites, the £209/year license pays for itself in crawl time saved.
Validating Your Sitemap
Before submitting, validate with three checks:
# 1. Check that sitemap.xml is accessible and well-formed
curl -s https://yourdomain.com/sitemap.xml | head -20
# 2. Validate XML structure (xmllint must be installed)
curl -s https://yourdomain.com/sitemap.xml | xmllint --noout -
# No output = valid XML
# 3. Check URL count
curl -s https://yourdomain.com/sitemap.xml | grep -c '<loc>'
# Should be ≤ 50,000
# 4. Check that robots.txt allows Googlebot to access it
curl -A "Googlebot" https://yourdomain.com/robots.txt | grep -i sitemap
# 5. Verify no noindex pages are in your sitemap (Node.js)
# Fetch each URL and check for noindex meta tag — catches mismatchesThe most important validation is checking for canonical mismatches: every URL in your sitemap should have a <link rel="canonical"> tag pointing to itself (self-referencing canonical). If the canonical points elsewhere, the page should not be in your sitemap. You can build this check with a simple crawl script that fetches each URL in your sitemap and parses the canonical tag — use our JSON Formatter to inspect the data output.
Specialized Sitemap Extensions
The base sitemap format can be extended with namespace declarations for media content:
Image Sitemaps
Image sitemaps help Google discover images embedded in JavaScript or CSS that its crawler might miss. Per Google's documentation, image sitemaps are particularly important for e-commerce product images and photography portfolios.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://example.com/product/widget/</loc>
<image:image>
<image:loc>https://example.com/images/widget-front.jpg</image:loc>
<image:title>Blue Widget - Front View</image:title>
</image:image>
<image:image>
<image:loc>https://example.com/images/widget-back.jpg</image:loc>
<image:title>Blue Widget - Back View</image:title>
</image:image>
</url>
</urlset>News Sitemaps
Google News requires a separate news sitemap to surface articles in the Top Stories carousel. News sitemaps should only include articles published in the last 48 hours — older articles are dropped from the news index regardless. You must be approved for Google News before this matters.
Frequently Asked Questions
Do I need a sitemap for SEO?
Does Google use priority and changefreq in XML sitemaps?
What is the maximum size of an XML sitemap?
How do I submit a sitemap to Google?
Should I include noindex pages in my sitemap?
How often should I update my sitemap?
Validate Your XML and URLs
Building or debugging a sitemap means working with XML and URLs constantly. Use BytePane's tools to format, validate, and encode without leaving your browser.