Myself, Coding, Ranting, and Madness

The Consciousness Stream Continues…

The Underused Bits of ASCII

5 Dec 2012 8:00 Tags: Programming

7-bit ASCII and it's mapping onto UTF-81 is one of the few nice things left in in the mess that is character encoding. The 128 possible values that ASCII are mainly use for the latin character set, ‘arabic’ numerals, and the common punctuation you see on a standard keyboard.

The layout also makes it quite nice to read in binary; 10xxxxx are the uppercase letters, 11xxxxx the lower case letters, and the nubmers hidden down in 011xxxx. Punctuation gets scattered around the edges of these blocks, and in the space before the numbers in the 010xxxx2.

However, this is only half of the available space — all of the characters which start with a 00 have not be mentioned. These 32 characters are the non-printable, or control, characters, which have sadly fallen by the wayside in popular programming. Their acronyms, and the short explanations3, are shown in the table below.

  1. NUL Null char
  2. SOH Start of Heading
  3. STX Start of Text
  4. ETX End of Text
  5. EOT End of Transmission
  6. ENQ Enquiry
  7. ACK Acknowledgment
  8. BEL Bell
  9. BS Back Space
  10. HT Horizontal Tab
  11. LF Line Feed
  12. VT Vertical Tab
  13. FF Form Feed
  14. CR Carriage Return
  15. SO Shift Out
  16. SI Shift In
  17. DLE Data Line Escape
  18. DC1 Device Control 1
  19. DC2 Device Control 2
  20. DC3 Device Control 3
  21. DC4 Device Control 4
  22. NAK Negative ACK
  23. SYN Synchronous Idle
  24. ETB End of Transmit Block
  25. CAN Cancel
  26. EM End of Medium
  27. SUB Substitute
  28. ESC Escape
  29. FS File Separator
  30. GS Group Separator
  31. RS Record Separator
  32. US Unit Separator

A lot of the document description characters have fallen out of use for a variety of reasons, mostly due to the common decision of editors not to show them as display characters; the complexities of entering them on a keyboard when working with text files. In binary files, separators are often replaced by the knowledge of how large the fields are — even if this is not the case, as binary data can contain the value of these control characters, so a system of escaping must be put in place.

There are, however, a number of places where these characters are still of use. They are still prevalent in a few cases4 in terminals, such as the bell, backspace, and line ending keys. However their use isn't limited to terminal control.

One of my current working projects is a remote interface to my media player of choice, Foobar2000. As all the transmissions are tag and command data, and do not contain binary content, the stream is being treated as UTF-8 data. This kind of data is well suited to be split into fields, lines, etc, and I can even avoid the assumption that these fields should not exist in the tag data by using the escape character to escape any such instances.

Some of these characters are also being used in the command stream: ACK and NACK are useful as pre-defined constants for indicating all sorts of state changes, as well as just acknowledgements.

Here is where I begin to be a bit more inventive with my usage of the command characters: for the Foobar2000 project, I commenced non-data transmissions with an ENQ; the use of the header and text characters was skipped, and the command ended with EOT.

In a more recent project, I got to use FSs and RSs in a flat-file database. This implementation gave a much reduced chance of issues occurring in the data, as these non-printable characters were rejected by the front end, so the fields were better handled than, say, a CSV.

Se, they're there for you, when you need them. Perfect matches to the printf/scanf family of functions, and helping you look after your data, the ASCII non-printable characters.

  1. 1 http://en.wikipedia.org/wiki/UTF-8#Advantages
  2. 2 http://www.asciitable.com/
  3. 3 http://www.asciitable.com/
  4. 4 http://en.wikipedia.org/wiki/Control_character#In_ASCII