How many times have you looked at an XML sitemap and said, “Why are these URLs in here?” One time, I was getting so frustrated doing an indexability audit for my client because Google wasn’t paying attention to canonicals for a variety of reasons. One in particular was the XML sitemap not pointing to the canonical, or clean version of the URL, but to the ugly URL with tracking parameters.
It begged the question, “If I tell Google of a page through the XML sitemap, and then immediately command that page to give credit to another version through the canonical, what will Google believe? If the site isn’t consistent, how is Google supposed to be consistent?“
Who uses a sitemap generator and calls it good? I know I did. Who audits that sitemap to make sure the generator picked only the good URLs? Hmm… How do you audit thousands and thousands of URLs from an XML sitemap?
In addition to this problem, earlier in March, SEOmoz did a Whiteboard Friday with Duane Forrester, the new senior project manager for Bing’s Webmaster Tools (embedded below). About eight minutes in, Forrester talked about how important it is to Bing to have a clean, even “hyper clean” sitemap. An XML sitemap must pass a quality threshold for Bing to recognize, use, and trust it. This means that you don’t want 404 pages, 301 or 302 redirects or URLs with different canonicals.
Again, the question is: How do I easily audit the quality of my XML sitemap?
Well, after finally putting some serious thought into it, here is one of the easiest ways for anyone to audit their XML sitemap:
Parse the XML sitemap
There are 100 ways to skin a cat, so if you can write a quick script to do this, by all means, go ahead. For the non-nerdy type, there is a pretty simple process using Excel to get all of the URLs from your XML file.
- Go to the XML sitemap of your domain.
- Right click on the page and select “Save As…”
- Change the file extension from XML to HTML because Excel can parse and read HTML files only, the file should now be “sitemap.html”.
- Open the sitemap.html file with Excel. Note: You may get a message to Open as Text” and that is OK. It should look like this:
Clean Up the Excel File
Here is the easiest way to strip everything else out of the file so you just have your URLs.
- Add a row above all the content.
- Apply filters to the top row of the sheet.
- We are only interested in the tag so we will filter column C for only the tags.
- Copy and paste all the URLs to a new sheet (keeps it clean).
- Find and replace all the and end tags with blanks. You now have a clean list of all the URLs within your sitemap.
Audit the URLs
Now that you have a list of the URLs in your sitemap, it is time to audit them for 404s, 301 and 302 redirects and any canonical issues. I recently discovered this great tool for the job and I am so glad I did. The Screaming Frog SEO Spider tool has got to be one of the “must have” resources for an SEO. It is expensive, but well worth the price if you need to dig into a site (and it works for Mac and PC!). They do offer a free version that will allow you to crawl up to 500 URLs, which will work for this example. So now that we have a list we’re going to upload it to Screaming Frog’s SEO Spider.
- Export your list from Excel as text file because the spider can only upload text files (It would be great if that changed .
- Open up the SEO Spider and change from “Spider” mode to “List” mode.
- Upload your file and hit “Start.”
- PREPARE TO BE WOWED!
The Spider will crawl through your lists and return loads of valuable information (too much to explain in this post). For this exercise, it will list all the HTML status codes for each URL. You can easily filter by HTML status code to see 4XX type errors, 3XX redirects, server errors and more from your XML sitemap.
The tool will also create a list of canonicals that each URL has. I can then go through these to correct any listed URLs that are different from the canonical I want to report to the search engines. To make it extra fluffy, the tool has the ability to export everything back to Excel where you can slice and dice your reports to your heart’s content.
This is one of the easiest ways I know to audit an XML sitemap. If you have any other ways to make this easier please feel free to chime in and share.