How to Identify Pages Missing from Your Sitemap

13 min read

A sitemap is not just a formality. For new pages, niche content with few internal links, and large sites where Googlebot might not naturally discover every corner through link-following, the sitemap is often the primary route to indexing.

When pages are missing from your sitemap, indexing becomes slower and less reliable. Google will still find most pages eventually through internal links - but “eventually” on a large or frequently updated site might mean weeks rather than days.

The other side of the problem is equally common: sitemaps that contain URLs they should not - 404 pages, redirect URLs, noindex pages, or staging URLs that were included by mistake. Google treats a sitemap it cannot trust as less valuable, which can slow down the processing of the legitimate entries.

A clean, complete sitemap is a reliable signal. An incomplete or contaminated sitemap is a missed opportunity at best and a crawl budget problem at worst.


What a Sitemap Should and Should Not Contain

Before auditing for gaps, it helps to have a clear definition of what belongs.

Should be in the sitemap:

  • Live pages returning HTTP 200
  • Pages you want Google to index
  • Canonical pages (not duplicate versions)
  • The most important pages across your site

Should NOT be in the sitemap:

  • Pages returning 301, 302, 404, or 410
  • Pages with a noindex meta tag or X-Robots-Tag
  • Paginated pages (page 2, page 3, etc.) - debated, but generally not worth including unless the paginated content is unique
  • Duplicate pages (use canonical tags for these instead)
  • Admin, login, or internal tool pages
  • Staging or development URLs

Any URL in your sitemap that falls into the “should not” category is an error. Each one wastes a small slice of Google’s trust in your sitemap.


Step 1: Crawl Your Site for a Complete URL List

Run a full-site crawl starting from your homepage and following all internal links. The output is every page currently live and reachable on your site.

Using redCacti: Add your site and run a crawl.

Export the pages report as CSV.

Filter for pages returning status code 200.

Using Screaming Frog: Standard crawl from homepage. Export Internal URLs, filter by Status Code 200.

What to look for in your crawl results: Pages that return 200 and are fully live on your site. These are your candidate sitemap entries.


Step 2: Export Your Current Sitemap

Fetch all URLs currently in your sitemap for comparison.

Finding your sitemap:

Most sites publish at yoursite.com/sitemap.xml. If this is a sitemap index (it contains references to other sitemaps rather than URLs directly), you need to fetch all child sitemaps and combine their URL lists.

Extracting the URL list:

# Simple extraction from a standard sitemap
curl -s https://yoursite.com/sitemap.xml | grep -oP '(?<=<loc>)[^<]+'

# For a sitemap index, first get child sitemap URLs
curl -s https://yoursite.com/sitemap.xml | grep -oP '(?<=<loc>)[^<]+' | while read url; do
  curl -s "$url" | grep -oP '(?<=<loc>)[^<]+'
done

Or paste the sitemap URL into a sitemap parser tool to get a clean flat list. Export to a spreadsheet.


Step 3: Compare and Find the Gaps

You now have two lists. The comparison reveals two types of problems:

Gap Type 1 - Pages in crawl but not in sitemap: These are live pages that exist and return 200 but you have not included in your sitemap. Depending on the page, they may belong in the sitemap (valuable content not yet declared) or be intentionally excluded (noindex pages, admin pages).

Gap Type 2 - Pages in sitemap but not returning 200: These are sitemap entries pointing to pages that are broken, redirecting, or no longer exist. These should be removed from the sitemap.

In a spreadsheet:

URLIn Crawl?In Sitemap?Status
/blog/post-aYesYesOK
/blog/post-bYesNoMissing from sitemap
/old-page/NoYesBroken - remove from sitemap

Use VLOOKUP or COUNTIF formulas to flag each URL automatically.

In redCacti: The orphan pages and sitemap audit reports surface this comparison automatically, flagging pages that exist but are missing from the sitemap and sitemap entries that return errors.


Step 4: Decide Which Missing Pages Should Be Added

Not every page that appears in your crawl results but is absent from your sitemap needs to be added. Apply these filters:

Add to sitemap if:

  • The page returns 200 and has no noindex tag
  • The page has unique, indexable content
  • It is a canonical URL (not a duplicate or parameter variant)
  • It is content you want Google to find and index

Do not add to sitemap if:

  • The page has a noindex meta tag (pointless to declare a page you are asking Google to ignore)
  • The page is a redirect (include only the final destination)
  • The page is a parameter or filter variant of another canonical page
  • The page is administrative, a login page, or otherwise not for search indexing
  • The page is intentionally excluded from search (password protected, internal only)

A page that passes all the above criteria but is missing from your sitemap is a genuine gap worth fixing.


Common Reasons Pages Go Missing from Sitemaps

Auto-generated sitemaps that exclude new content: WordPress and Webflow generate sitemaps automatically, but some plugins exclude certain post types, custom taxonomies, or recently published posts by mistake. If you published a post and it is not in the sitemap within 24-48 hours, check your sitemap plugin settings.

Manual sitemaps that were not updated: If your sitemap is maintained by hand or generated by a script that is not run regularly, new pages will be missing. Sitemaps need to be regenerated whenever new content is published.

Page published in draft, then published, but sitemap not updated: Some CMS setups only regenerate the sitemap on a scheduled basis rather than immediately on publish. Check whether there is a delay between content going live and appearing in the sitemap.

Pages excluded by noindex that should not be: A bulk noindex applied to a content type or category can accidentally exclude good content from both the index and the sitemap. Check whether any pages you want indexed have accidentally inherited a noindex tag from a category or template setting.

Pages that exist only in JavaScript-rendered content: If your site uses JavaScript rendering and some pages are only reachable through JS-rendered navigation rather than HTML links, some crawlers may miss them. These pages may not appear in a standard HTML crawl and may never be in the sitemap if it was generated by a crawler that did not execute JavaScript.


Step 5: Update the Sitemap

After identifying gaps, update the sitemap to include missing valuable pages and remove incorrect entries.

On WordPress: Most WordPress sites use Yoast SEO or Rank Math to generate the sitemap. Check the plugin’s sitemap settings to ensure all relevant post types and taxonomies are included. Use the “Regenerate” or “Rebuild” option to force an immediate sitemap refresh.

On Webflow: Webflow auto-generates a sitemap from all published pages. If a page is missing, check whether it was published correctly and whether the “Exclude from sitemap” option is toggled in Page Settings.

On Squarespace: Squarespace manages the sitemap automatically. If pages are missing, confirm they are published (not in draft mode) and not password-protected.

For static sites or custom implementations: If you manage your sitemap manually or with a build script, add the missing URLs and regenerate. Ensure the sitemap generation runs automatically as part of your build/deploy process.

Format reminders for a clean sitemap:

  • Each <url> entry should include at minimum the <loc> tag
  • <lastmod> is optional but useful for signalling recent updates
  • <priority> and <changefreq> are largely ignored by Google - skip them
  • Sitemaps have a 50,000 URL limit per file; use a sitemap index for larger sites

Step 6: Submit the Updated Sitemap to GSC

After updating, tell Google the sitemap has changed.

  1. Go to Google Search Console -> Sitemaps
  2. If your sitemap is already listed, click “Resubmit”
  3. If it is not listed, enter the sitemap URL and click Submit

GSC will process the updated sitemap and begin scheduling crawls for newly added URLs. Check back in a few days to see whether the new URLs show as “Submitted and indexed” or are pending.


Sitemap Audit Checklist

Finding gaps:

  • Run full-site crawl and export 200-status URLs
  • Export all URLs from sitemap.xml (and child sitemaps)
  • Compare lists - find pages in crawl but not sitemap (missing) and pages in sitemap not returning 200 (errors)

Deciding what to add:

  • Filter missing pages: has noindex? is it a duplicate? is it admin/internal?
  • Flag remaining missing pages as sitemap additions

Fixing the sitemap:

  • Add missing valuable pages to sitemap (via CMS settings or manual update)
  • Remove 301, 404, noindex pages from sitemap
  • Remove staging/development URLs if present
  • Regenerate sitemap

Notifying Google:

  • Submit updated sitemap URL in GSC
  • Confirm new URLs appear in GSC sitemap report within 48-72 hours


A sitemap audit takes less than an hour and tends to surface a handful of easy fixes with real upside: pages that should be indexed getting submitted to Google, and broken sitemap entries getting cleaned up.

For actively publishing sites, it is the kind of quick-win maintenance that compounds positively over time.

Audit your sitemap for missing pages and errors ->

The free sitemap audit checks every URL in your sitemap for errors and surfaces pages missing from it.


Also in this series: How to Find Orphan Pages on Your Website - How to Fix Orphan Pages That Google Can’t Find

Newsletter

Weekly SEO teardowns

Internal linking, broken links & orphan pages — straight to your inbox, every week.

Subscribe free

redCacti Team

The team behind redCacti - helping websites improve their SEO through better internal linking.

Related Posts