1 Easy Way to Audit your XML Sitemap

How many times have you looked at an XML sitemap and said, “Why are these URLs in here?” One time, I was getting so frustrated doing an indexability audit for my client because Google wasn’t paying attention to canonicals for a variety of reasons. One in particular was the XML sitemap not pointing to the canonical, or clean version of the URL, but to the ugly URL with tracking parameters.

It begged the question, “If I tell Google of a page through the XML sitemap, and then immediately command that page to give credit to another version through the canonical, what will Google believe? If the site isn’t consistent, how is Google supposed to be consistent?

Who uses a sitemap generator and calls it good? I know I did. Who audits that sitemap to make sure the generator picked only the good URLs? Hmm… How do you audit thousands and thousands of URLs from an XML sitemap?

In addition to this problem, earlier in March, SEOmoz did a Whiteboard Friday with Duane Forrester, the new senior project manager for Bing’s Webmaster Tools (embedded below). About eight minutes in, Forrester talked about how important it is to Bing to have a clean, even “hyper clean” sitemap. An XML sitemap must pass a quality threshold for Bing to recognize, use, and trust it. This means that you don’t want 404 pages, 301 or 302 redirects or URLs with different canonicals.

 

 

 

 

Wistia

Again, the question is: How do I easily audit the quality of my XML sitemap?

Well, after finally putting some serious thought into it, here is one of the easiest ways for anyone to audit their XML sitemap:

Parse the XML sitemap

There are 100 ways to skin a cat, so if you can write a quick script to do this, by all means, go ahead. For the non-nerdy type, there is a pretty simple process using Excel to get all of the URLs from your XML file.

  1. Go to the XML sitemap of your domain.
  2. Right click on the page and select “Save As…”
    XML Sitemap
  3.  

  4. Change the file extension from XML to HTML because Excel can parse and read HTML files only, the file should now be “sitemap.html”.
  5. Open the sitemap.html file with Excel. Note: You may get a message to Open as Text” and that is OK. It should look like this:

Clean Up the Excel File

Here is the easiest way to strip everything else out of the file so you just have your URLs.

  1. Add a row above all the content.
  2. Apply filters to the top row of the sheet.
  3. We are only interested in the tag so we will filter column C for only the tags.
  4. Copy and paste all the URLs to a new sheet (keeps it clean).
  5. Find and replace all the and end tags with blanks. You now have a clean list of all the URLs within your sitemap.

Audit the URLs

Now that you have a list of the URLs in your sitemap, it is time to audit them for 404s, 301 and 302 redirects and any canonical issues. I recently discovered this great tool for the job and I am so glad I did. The Screaming Frog SEO Spider tool has got to be one of the “must have” resources for an SEO. It is expensive, but well worth the price if you need to dig into a site (and it works for Mac and PC!). They do offer a free version that will allow you to crawl up to 500 URLs, which will work for this example. So now that we have a list we’re going to upload it to Screaming Frog’s SEO Spider.

  1. Export your list from Excel as text file because the spider can only upload text files (It would be great if that changed ;-).
  2. Open up the SEO Spider and change from “Spider” mode to “List” mode.
  3. Upload your file and hit “Start.”
  4. PREPARE TO BE WOWED!

The Spider will crawl through your lists and return loads of valuable information (too much to explain in this post). For this exercise, it will list all the HTML status codes for each URL. You can easily filter by HTML status code to see 4XX type errors, 3XX redirects, server errors and more from your XML sitemap.

The tool will also create a list of canonicals that each URL has. I can then go through these to correct any listed URLs that are different from the canonical I want to report to the search engines. To make it extra fluffy, the tool has the ability to export everything back to Excel where you can slice and dice your reports to your heart’s content.

This is one of the easiest ways I know to audit an XML sitemap. If you have any other ways to make this easier please feel free to chime in and share.

Get Internet Marketing Insight For Your Company - SEO.com

10 Comments

  1. Chris Kirkham says

    This is a great post. I especially like the random emoticon. I feel like this is reading like a spam comment so I’m going to stop typing now.

  2. Ian Spencer says

    Excellent post.

    We see many XML feeds that are really in desperate need of attention and I agree that you really should be doing that.

    Having an XML Sitemap that really is just a relevant sitemap and not just a list of crawled URL’s is crucial, as when you run automated programs you can end up with a lot of rubbish.

    Auditing / Editing really is important.

  3. dallmeier says

    this is the most easy way to audit you XML sitemap this article seem to be much helpful thanks for such info and keep sharing thanks.

  4. Matt says

    I've tried this in Excel 2007 on a sitemap with no schema. Excel then created a schema for me, and it then prompted me to import as xml, and then the sitemap appeared already cleaned up and formatted as a table :) Looks like microsoft have been busy…
    This tutorial was still really helpful though, I has no idea excel could turn a sitemap into a neat list of URL's.

  5. Matt says

    Does anyone have a solution for importing sitemap.xml documents for international websites? For me, excel will not import an xml sitemap which contains foreign characters. A real pain!

Leave a Reply