Many of the extensions I’m implementing (such as the Google satellite maps) are user-facing, which means they are visible to visitors of WikiStudent. There are also many features which are admin-facing, that make my life as an administrator easier, but that nobody else will be aware of. One of these is the sitemap.
I’m not talking about an HTML sitemap (the kind that users see) but rather an XML sitemap constructed just for search engines. This is very useful for telling Google which pages to crawl and how often. Every website should have one!
I will be investigating the various MediaWiki sitemap extensions tonight and implement the most viable one. I’ve already given some thought as to how to instruct the search engines to crawl the site. My plan so far:
Crawl frequency
- The Student Jobs pages, which are automatically updated all the time should be set to be crawled hourly
- The Unisa module pages, which will (hopefully) be edited often should be crawled daily
- Static pages, such as the Editing help pages should be crawled monthly
- Pages that almost never change, such as the Privacy policy page need only be crawled yearly
Priority
This is a value between 0 and 1, indicating the relative importance of pages.
- Main Page (i.e. the home page) - priority 1
- Main categories (e.g. Unisa Modules, Student Jobs) - priority 0.9
- Sub-categories (e.g. Accounting, Geography, Economics) - priority 0.8
- Pages that fall under the sub-categories (e.g. ILW1036, MNX202J, CLA101S) - priority 0.7
Pages that I don’t want to be crawled include discussion pages and redirect pages. All my crawl preferences can be made known with just 2 files: robots.txt and sitemap.xml.



