Website Checklist
Like many of my long programming posts, this is as much a reminder list for myself, with vague attempts at explanations tacked on to validate it as a post. Not for general consumption. Do not use orally. You kilometerage may vary
Here in 2012, websites are an important part of our everyday life. I've had the joy of writing a few myself, and I've learnt a few things along the way. I've also collected a number of rather more arbitrary dispositions; I have the misfortune of not knowing which is which anymore, so this post will contain a mix.
Sticking to a doctype
The web has a number of different document types available for use, which can be a bit confusing. They come from two basic families - XHTML and HTML. There are a number of different version from each family, but there are four currently relevant ones - HTML4, XHTML-transitional, XHTML-strict, and HTML5. The XHTML family, although a significant improvement over HTML4, can generally be discounted as it's MIME-type (see later) is not supported by a number of browsers. HTML5 is a re-working of HTML4 with some of the ideas from XHTML included, along with some new tags designed to help convey the semantics of page layout, such as the <header> and <footer> tags
Another useful thing that HTML5 has over the others is that it has the shortest DOCTYPE definition of the lot1:
<!DOCTYPE html>
MIME Type and other headers
This is something that most web servers will handle for you, unless you are serving information through custom scripts (such as PHP). The MIME type of a file is a string that describes the format+language of that file. For example, a normal text file is text/plain, a JPEG image is image/jpg. The correct mime type for any pretty much any file format can be looked up online; the two I want to bring to your attention are the MIME types of HTML and XHTML.
HTML has a simple, logical text/html; XHTML, on the other hand, is application/xhtml+xml, which is the reason that some browsers won't render an XHTML file. Although it is possible to serve an XHTML file with MIME type text/html, this can lead to odd behaviour (such as other browsers trying to handle it as an HTML file).
Similarly, it is always a good idea to serve files with the correct MIME type. This is often a problem with things like CSS-minifying scripts, where the script doesn't set a MIME-type, so the web server falls back to the default2.
There are some other useful headers to set, such as Content-Length, which is required for browsers to show progress bars when doing downloads. Again, for static files, this should be handled by the web server
Caching
Something that often isn't handled by the web server is caching of files. However, most major servers now correctly handle modification checking - the client sends either a timestamp or e-tag (some kind of hash of the file) to the server as part of the request, and the server only sends back the full file if it has a new version. Otherwise, it sends back the status code 304 Not Modified, and no content - which reduces the amount transfered
However, this still requires the client to connect to the server, make the request, and wait for the reply. For items like logos, which change very infrequently, it would be better if the client only checks for a new version after a certain amount of time. This is the basis of caching on the internet, and I try to use it where I can.
Unfortunately, it's not that widely used - the blogging software I use doesn't set things like the background images for the layout, or the style sheet that is loaded with every page, to cache, meaning more work for both my server and your machines when you click on a link.
Status Codes and Error Pages
Oh. Oh God. Look, it's quite simple. There are a number of different status codes that things can return. For a static website, none of these really matter. But, the moment people start doing any kind of dynamic content, they somehow seem to get this amazingly wrong. If the content is not found, you send the status code 404 Not Found, and the content should be 404 error page. If you have a permanent redirect, you use the code 301 Moved Permanently, and supply the Location field with the content - not just appending the new content.
I'm going to go into this is much more detail later - including an in depth look at 300-series codes, which is where most of the confusion occurs
Error logging
Error logging is something of a conundrum on larger servers - those which are actually using virtual hosts will be able to route the logs per host. The problem lies more with large servers with one large group of websites managed by many different people - an example is a Student's Union website, which hosts all the websites of the clubs in the society.
Linux, as always, gives us tools to handle this - tail and grep can be used to build a live feed of the errors for your section of the site, and grep and tail can be used to get the list of recent errors.
The server logs, however, are not the only error logs that your website may have. If you're using any CGI scripting, or any kind of server-side scripting at all, you're eventually going to run into errors there. Remember: this is your code, so you have no excuse for it not handling errors smoothly3.
What you do with the errors is up to you - I seem to recall that under PHP/MySQL that writing them to the database was a popular option4, so long as a few things are remembered. The first one is what to do if you have some issue initialising your error handling, something that is often forgotten. The second is that sending the error data back to the client is a bad idea: in the best case, it confuses them; in the worst case, you've just handed your server over to a very lucky script kiddy.
You also want clear error pages for when something goes wrong, which look in keeping with the design of the site. These want to be as close to static as possible (minimum or scripts, etc), and the only things that should really be dynamic on an error page might be some supplementary information for the error. That said, the only thing worse than sparse information on an error page is incorrect information on an error page - something which once pulled me into an extended service request with ICT here when their service request system decided to stop working.
Sessions and Cookies
Most of the errors that I personally come across are caused by cookies - the website expects me to look at some advert before viewing the site, and attempts to set a cookie to note that I've seen the advert. My browser drops the cookie5, gets redirected to the actual page. The server checks for the cookie, and sends me back to look at the advert.
Most cookies are used in this kind of way - session control. They are used to connect your request history for the site together. Of course, some cookies are used to track you across the entire internet (Google, Facebook, etc. I'm looking at you). However, there are a number of other ways to do this which don't involve setting cookies - most servers and clients assume that if you've requested one page from the server, that you're likely to request some more, so the socket is kept open. This socket can be used to uniquely identify the session, as it'll be a separate socket from everyone else.
This method, of course, doesn't work when one side closes the socket between each request - something that may happen in older, or lighter clients6. The other option is the inbuilt authentication methods in HTTP; you'll likely have seen it without knowing what it is - when the browser displays a user/password dialog for a web page, HTTP authentication is being use. It has the disadvantage that you can't alter the look, and it does also add data to the headers like cookies7. You can also tie it into robust security systems, such as using Kerberos.
Content
Yeah...you might want to have some of this. It's never been something I've been good at8. Working out what content you want where can be something of an issue. Also, the styling, the URI design, an everything else that is visible to the client.
But, I'll save that for another post
- 1 ↑ This actually annoys me, because the DOCTYPE directive had well-defined semantics, which this definition does not honour. Interpreted the old way, this does not actually specify what document structure should be used, but only that will be the root tag of the document
- 2 ↑ A guide to writing file serving scripts is in the pipeline
- 3 ↑ "I couldn't be bothered" and "I'll get around to it" are not valid excuses
- 4 ↑ I really hope that I'm making that up.
- 5 ↑ I dislike cookies as a rule
- 6 ↑ lighter clients, such as command line browsers, are also likely to deal with cookies
- 7 ↑ Although, the data it adds is well defied, as it what pages to send it for
- 8 ↑ with the possible exception of this blog?