The Underused Bits of ASCII « Myself, Coding, Ranting, and Madness

The Underused Bits of ASCII

5 Dec 2012 8:00 Tags: Programming

7-bit ASCII and it's mapping onto UTF-81 is one of the few nice things left in in the mess that is character encoding. The 128 possible values that ASCII are mainly use for the latin character set, ‘arabic’ numerals, and the common punctuation you see on a standard keyboard.

The layout also makes it quite nice to read in binary; 10xxxxx are the uppercase letters, 11xxxxx the lower case letters, and the nubmers hidden down in 011xxxx. Punctuation gets scattered around the edges of these blocks, and in the space before the numbers in the 010xxxx2.

However, this is only half of the available space — all of the characters which start with a 00 have not be mentioned. These 32 characters are the non-printable, or control, characters, which have sadly fallen by the wayside in popular programming. Their acronyms, and the short explanations3, are shown in the table below.

NUL Null char
SOH Start of Heading
STX Start of Text
ETX End of Text
EOT End of Transmission
ENQ Enquiry
ACK Acknowledgment
BEL Bell
BS Back Space
HT Horizontal Tab
LF Line Feed
VT Vertical Tab
FF Form Feed
CR Carriage Return
SO Shift Out
SI Shift In
DLE Data Line Escape
DC1 Device Control 1
DC2 Device Control 2
DC3 Device Control 3
DC4 Device Control 4
NAK Negative ACK
SYN Synchronous Idle
ETB End of Transmit Block
CAN Cancel
EM End of Medium
SUB Substitute
ESC Escape
FS File Separator
GS Group Separator
RS Record Separator
US Unit Separator

A lot of the document description characters have fallen out of use for a variety of reasons, mostly due to the common decision of editors not to show them as display characters; the complexities of entering them on a keyboard when working with text files. In binary files, separators are often replaced by the knowledge of how large the fields are — even if this is not the case, as binary data can contain the value of these control characters, so a system of escaping must be put in place.

There are, however, a number of places where these characters are still of use. They are still prevalent in a few cases4 in terminals, such as the bell, backspace, and line ending keys. However their use isn't limited to terminal control.

One of my current working projects is a remote interface to my media player of choice, Foobar2000. As all the transmissions are tag and command data, and do not contain binary content, the stream is being treated as UTF-8 data. This kind of data is well suited to be split into fields, lines, etc, and I can even avoid the assumption that these fields should not exist in the tag data by using the escape character to escape any such instances.

Some of these characters are also being used in the command stream: ACK and NACK are useful as pre-defined constants for indicating all sorts of state changes, as well as just acknowledgements.

Here is where I begin to be a bit more inventive with my usage of the command characters: for the Foobar2000 project, I commenced non-data transmissions with an ENQ; the use of the header and text characters was skipped, and the command ended with EOT.

In a more recent project, I got to use FSs and RSs in a flat-file database. This implementation gave a much reduced chance of issues occurring in the data, as these non-printable characters were rejected by the front end, so the fields were better handled than, say, a CSV.

Se, they're there for you, when you need them. Perfect matches to the printf/scanf family of functions, and helping you look after your data, the ASCII non-printable characters.

Myself, Coding, Ranting, and Madness

Home

Feeds

Tags

Other

The Underused Bits of ASCII