Recent Weblogs

Links I like

The Perfect Whitelist

Character classes can be thought of as a character range, each character is represented by it's index in the ASCII table. We can create abnormal ranges to cover the ideal set of characters allowed for input.

Regular Expression Character Classes and the ASCII Table

Character classes are a crucial tool of the regular expression engine, they allow for a variety of functionality but the focus for this feature is their ability to create character ranges. By using hyphen notation we can create a character class that specifies a common range, like [a-z] which means "a through z" and number ranges like [0-9].

ASCII's role in computers and communication is pivotal. It defines a character encoding scheme to translate characters to their ASCII encodings and vice versa. It's the building blocks for all character data in a computer system. The ASCII table defines a guide for humans to relate characters to their ASCII representation.

Abnormal Character Classes

So given that the ASCII table defines an incremental list for the characters defined for an English keyboard, and that using character classes we can create ranges along this list within regular expressions. We could for instance create a character range like...


 [A-z]
					

Which at first glance it would be easy to see capital A through lowercase z. However this range is more than just alpha character and this isn't necessarily a bad thing, although being aware of what it is included in the range is necessary.

The range contains a bunch of punctuation characters and even the infamous <angle brackets>!!! So if you're hoping such a whitelist would be just the ticket for your alpha only action, you're gonna have rethink things. Being aware of this flaw allows us to adapt the whitelist to ensure we're entering valid input.

The Great Character Range

The ASCII table defines most of the visual English keyboard input characters in the range of 32 to 126 (decimal). Which line up to the space and tilde characters.


 [ -~]
					

Looks pretty strange but thats a character range that will match everything on the English keyboard. Which is a really good start for a whitelist, this will avoid all of the MS curly quotes or Asian glyphs that can cause trouble at various levels of your site or application.

The Angle Escape

In most web based input filtering the most wanted characters are the angle brackets. In some cases it's an attacker attempting to inject some malicious code, but more often its just some oblivious user who has copied from a web resource and is pasting into your latest and greatest form. Simply stripping the angle bracket characters would avoid the crucial threat of an attacker, but that leaves behind a mess of attributes and tag names for your everyday user.

With another crafty regex you can strip the entire tag leaving only the text content behind. The regex in English is "Match an opening angle bracket, and match zero or more characters that are not a closing angle bracket, until you hit a closing angle bracket"

					   
 <[^>]*>
					

The Final Filter

By utilizing both of these regular expressions together we minimize the risk of malicious input as well as corrupt characters.

					
 (<[^>]*>)|([^ -~])
				    

Use the form below to interactively test the filter described in the article.



Resources

Comments are Disabled.