The Underused Bits of ASCII
7-bit ASCII and it's mapping onto UTF-81 is one of the few nice things left in in the mess that is character encoding. The 128 possible values that ASCII are mainly use for the latin character set, ‘arabic’ numerals, and the common punctuation you see on a standard keyboard.
The layout also makes it quite nice to read in binary; 10xxxxx are the uppercase letters, 11xxxxx the lower case letters, and the nubmers hidden down in 011xxxx. Punctuation gets scattered around the edges of these blocks, and in the space before the numbers in the 010xxxx2.
However, this is only half of the available space — all of the characters which start with a 00 have not be mentioned. These 32 characters are the non-printable, or control, characters, which have sadly fallen by the wayside in popular programming. Their acronyms, and the short explanations3, are shown in the table below.
SOHStart of Heading
STXStart of Text
ETXEnd of Text
EOTEnd of Transmission
DLEData Line Escape
DC1Device Control 1
DC2Device Control 2
DC3Device Control 3
DC4Device Control 4
ETBEnd of Transmit Block
EMEnd of Medium
A lot of the document description characters have fallen out of use for a variety of reasons, mostly due to the common decision of editors not to show them as display characters; the complexities of entering them on a keyboard when working with text files. In binary files, separators are often replaced by the knowledge of how large the fields are — even if this is not the case, as binary data can contain the value of these control characters, so a system of escaping must be put in place.
There are, however, a number of places where these characters are still of use. They are still prevalent in a few cases4 in terminals, such as the bell, backspace, and line ending keys. However their use isn't limited to terminal control.
One of my current working projects is a remote interface to my media player of choice, Foobar2000. As all the transmissions are tag and command data, and do not contain binary content, the stream is being treated as UTF-8 data. This kind of data is well suited to be split into fields, lines, etc, and I can even avoid the assumption that these fields should not exist in the tag data by using the escape character to escape any such instances.
Some of these characters are also being used in the command stream: ACK and NACK are useful as pre-defined constants for indicating all sorts of state changes, as well as just acknowledgements.
Here is where I begin to be a bit more inventive with my usage of the command characters: for the Foobar2000 project, I commenced non-data transmissions with an ENQ; the use of the header and text characters was skipped, and the command ended with EOT.
In a more recent project, I got to use FSs and RSs in a flat-file database. This implementation gave a much reduced chance of issues occurring in the data, as these non-printable characters were rejected by the front end, so the fields were better handled than, say, a CSV.
Se, they're there for you, when you need them. Perfect matches to the printf/scanf family of functions, and helping you look after your data, the ASCII non-printable characters.
- 1 ↑ http://en.wikipedia.org/wiki/UTF-8#Advantages
- 2 ↑ http://www.asciitable.com/
- 3 ↑ http://www.asciitable.com/
- 4 ↑ http://en.wikipedia.org/wiki/Control_character#In_ASCII