Put yourself in the shoes of an inexperienced mathematician solving its first th...

zAy0LfpBZLC8mAC · on April 14, 2014

... which is exactly what you should not do. Here, let me post this:

"If you want to create a horizontal line in HTML, you write <hr>"

See that? There is nothing "unclean" about it, hence you should not "clean" it. You just have to encode it if you output it embedded in HTML. That's why calling it "sanitizing" is misleading.

gizmogwai · on April 14, 2014

Again, wrong.

Encoding without proper context means "convert in a coded form". Hum that's not exactly what we want. So, let's add the "computing context", now we have, as an example, the ability to encode a WAVE file into a MP3. But wait, we lost information here! Bummer...

Sanitization in the context of computing does not specifically means that you have to "encode", or better, "transcode". It means that you have to take appropriate measure so that your input DATA cannot be interpreted as CODE by the receiver. Bonus point is taken if the measure you choose is lossless in term of information carried by your data.

zAy0LfpBZLC8mAC · on April 14, 2014

Well, yeah, "transcode" might be better, but then again there isn't really any hard difference between "encode" and "transcode", or possibly "encode" is just useless because it can not ever happen without an associated decoding of the information source?

But no, in a way, you are getting it all backwards, or at least a bit confusing.

This is how you should construct a system that processes user input:

First, the input format should be defined such that it can only describe things that make sense within the given context, in particular it should usually not be possible to represent in it instructions for programming language interpreters.

Second, whenever you have to represent user input in some context, you have to encode (well, transcode) it into the format of that context. This transcoding generally should only change representation and not change the meaning of the converted information.

This automatically implies that you can not "inject code". There isn't really anything magic about "code". That's what I think is a large part of the confusion around "sanitizing input". The input can not represent code, the conversion does not change the meaning, so if the input can not represent code, the transcoding obviously can not cause code to appear either, and thus you are safe - and not only are you safe, but your system also works as it should otherwise, which it potentially does not if you start "removing dangerous characters".

That is why you should not "sanitize", but only validate and encode/transcode/convert. Which you need to do anyway for your system to work properly. Lack of injection vulnerabilities will result automatically.