PERL Sitemap Generator for WordPress

sitemap_300wI was hunting around for a Sitemap generator for use with a WordPress site and didn’t find a simple script that I could quickly verify what it was doing and/or configure, so I decided to write one in PERL.  I wanted something I could easily integrate into a batch process for use in staging deployments.  What follows is what worked for me (so far) which is provided free and as-is either for your own customization and use or general review.

Limitations

  • The platform is PERL/DBI, MySQL (DBI and DBD driver for MySQL).
  • I’m running only the one site for which I’d like to generate URLs for the Published Posts and Pages.
  • The latest Post is considered the highest priority.
  • I’m using Permalinks based on the %postname% in WordPress.

Firstly, let’s have a look at the base WordPress MySQL database query which simply pulls the Post titles in descending Post date order:

select
 b.option_value siteurl,
 a.post_name post_name,
 a.post_title post_title,
 date_format(a.post_date,'%Y-%m-%d') post_date
from $WPDefines{DB_PREFIX}posts a
join $WPDefines{DB_PREFIX}options b
on b.option_name = 'siteurl'
where post_status = 'publish'
order by post_date desc

The Post date format is presented as YYYY-MM-DD which is sufficient for the Sitemap URL characteristic, Last Modified (lastmod).  We’re substituting the $WPDefines{DB_PREFIX} at runtime to construct the right table names, and in the query we’re lazily joining to the WordPress Options table to get the Site URL for each row.


Code

Next, let’s take a look at the code steps (output is stdout):

  1. Process the WordPress Config File.
  2. Connect to the WordPress database.
  3. Select Post names, titles, dates published, and the main Site URL in descending date order (most recent post publish date first).
  4. Generate the Sitemap XML header.
  5. The Sitemap XML body consists of a list of URLs:  for each row returned from the query [3], generate the URL.
  6. Generate the Sitemap XML footer.

Pretty concise – now, let’s jump to the PERL code :

PERL - sitemap.pl

#!perl -w
#
# PERL Sitemap Generator
# by Richard Alvarez
# LICENSE:
# http://www.triplesunrise.org/wp-content/uploads/free_code_license.txt
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
use strict;
use DBI;
# WordPress Defines
my %WPDefines = ();
# Sitemap Constants
my $ChangeFreq = 'weekly';
my $WPConfig = shift or die "usage: sitemap.pl PATH_TO_WP_CONFIG\n";
process_config($WPConfig);
# Connect to our WordPress DB instance
my $dbh = DBI->connect(
 join(';',
 "DBI:mysql:",
 "database=$WPDefines{DB_NAME}",
 "user=$WPDefines{DB_USER}",
 "password=$WPDefines{DB_PASSWORD}",
 )
)
 or die;
# Define our WP query to select published Posts ordered by
# descending date.
my $query = qq/
select
 b.option_value siteurl,
 a.post_name post_name,
 a.post_title post_title,
 date_format(a.post_date,'%Y-%m-%d') post_date
from $WPDefines{DB_PREFIX}posts a
join $WPDefines{DB_PREFIX}options b
on b.option_name = 'siteurl'
where post_status = 'publish'
order by post_date desc
/;
# Issue the query with expected results in the form of an 
# array of references of results (1 for each row)
my $result = $dbh->selectall_arrayref($query, { Slice => {} });
# Generate the XML header and the start of the Sitemap document
print <<EOF;
<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
EOF
# Get the total number of articles
my $total_articles = @$result + 1;
# (k) row counter used to compute priority
my $k = 0;
# Loop through our results and generate URL elements (<url>) for the Sitemap
# priority = Percentage_From_the_Start_of_List
foreach my $row ( @$result ) {
 my $priority = sprintf("%0.1f", 1-$k/$total_articles);
 $k++;
 generate_url_for_sitemap($priority, $row);
}
# Close the urlset element to complete the file
print "</urlset>\n";
# Generate the whole URL element set for the Sitemap 
sub generate_url_for_sitemap {
 my ($priority, $post_info) = @_;
 print "\t<url>\n";
 my $page_url = 
 $post_info->{siteurl}
 . "/"
 . $post_info->{post_name}
 . "/";
 print_xml_element("loc", $page_url);
 print_xml_element("lastmod", $post_info->{post_date});
 print_xml_element("changefreq", $ChangeFreq);
 print_xml_element("priority", $priority);
 print "\t</url>\n";
}
# Print a single XML element name/value pair
sub print_xml_element {
 my ($element_name, $element_value) = @_;
 my $tab = "\t\t";
 if ($element_value) {
   print "$tab<${element_name}>${element_value}</${element_name}>\n";
 }
 else {
   print "$tab<${element_name} />\n";
 }
}
# Process the WP PHP Config File and pull the defines for 
# the PERL DBI connect
sub process_config {
 my ( $config_file ) = @_;
 open(CONF, "<$config_file") or die;
 while (<CONF>) {
   foreach my $key ( qw( DB_NAME DB_USER DB_PASSWORD ) ) {
     /define.+$key/ && do {
       my ( $value ) = $_ =~ m/define.+'$key'.+'(.*?)'/g;
       $WPDefines{$key} = $value if $value;
     };
   }
   /table_prefix/ && do {
     my ( $prefix ) = $_ =~ m/\$table_prefix.+'([A-Z_0-9]+)'/g;
     $WPDefines{DB_PREFIX} = $prefix if $prefix;
   };
 }
 close(CONF);
}

Note on Sitemaps

We’re addressing the four elements for a Sitemap URL as follows:

  1. Location URL Element, <loc>:
    This is the URL which is constructed from the WordPress siteurl option variable (it’s set with your WordPress installation already) with the post_name generated when the Post was created (i.e., the URL path unique for the Post which is generally based on the title, for Permalinks).
  2. Last Modification Date Element, <lastmod>:
    The modification date is our converted Post date in the YYYY-MM-DD format.
  3. Change Frequency Element, <changefreq>:
    The Change Frequency is hard-coded to weekly to indicate that changes would likely not occur within a week.
  4. Priority Element, <priority>:
    The Priority element defines a relative site priority which, based on our query, is highest for articles and pages which are most recent and is computed based on the total number of articles (score [0.0,1.0]).

This set of four elements is generated for each URL (Posts or Page in WordPress) and is encapsulated in its parent element for the Sitemaps DTD, <urlset> which comprises the requirements for the Sitemap document.


Usage

I batch my articles from a WordPress Sandbox to my Production site (more to follow later), so having this handy PERL script to generate the Sitemap makes the process simple and convenient.  Input is given on the command line using the BASH tool and output is directed to the terminal (stdout) which can be redirected to your new Sitemap file.  I would expect to execute this each time an article is Published or a new Site Page is added.

BASH Command Line

$ perl sitemap.pl WP_CONFIG_PHP_PATH > sitemap.xml

Substitute WP_CONFIG_PHP_PATH with the path to your wp-config.php file, or if you’ve renamed it, then that one, and then redirect to a temporary Sitemap file for comparison or whatever suits your needs.


Output

Here are a few lines of sample output for this site, with the closing <urlset> element for clarity:

<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>http://www.triplesunrise.org/check/</loc>
  <lastmod>2013-02-22</lastmod>
  <changefreq>weekly</changefreq>
  <priority>1.0</priority>
 </url>
 <url>
  <loc>http://www.triplesunrise.org/god-our-saviour/</loc>
  <lastmod>2013-02-21</lastmod>
  <changefreq>weekly</changefreq>
  <priority>0.9</priority>
 </url>
</urlset>

Robots

Finally, your Robots definition file (robots.txt) can reference the newly generated Sitemap (e.g., generally deployed to your root of your web site).  This is just a sample for reference but a good read is recommended.

Sitemap: http://www.yoursitename_here.org/sitemap.xml

Hope that this was practically or instructively helpful.  Thanks for reading!