Welcome! Please see the About page for a little more info on how this works.

0 votes
in data.json by

The 32 control characters U+0000 through U+001F are never allowed in raw form in JSON strings.

From (link: https://www.ecma-international.org/publications/standards/Ecma-404.htm text: ECMA-404):

bq. All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

From (link: https://tools.ietf.org/html/rfc7159#section-7 text: RFC 7159):

bq. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

When
`} (the default), all characters outside the 32-127 range are escaped using {{\uCAFE}} syntax (or for the special whitespace cases, using named escapes).

However, when `
} is supplied to the {{write}} or {{write-str}} functions, some of the control characters are written in raw form, resulting in invalid JSON. This is improper behavior; the library should never produce JSON that violates the specification(s), no matter what options the user supplies.

This patch escapes the control characters even when
`} is supplied.

There is a bit of special handling to exclude the named escapes in the control character range — the {{write-string}} function always escapes the characters (8, 9, 10, 12, 13) which have special escaped names and thus require special treatment.

I did not add any control character validation to the parsing functionality, following Postel's law:

bq. [TCP] implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.


Why use } at all if I'm worried about compliance? Well, Unicode is a really good idea, and pairs very nicely with the UTF-8 character encoding, which is also a really good idea. UTF-8 encodes text much more efficiently than spelling out literal escapes. The default (}) does not leverage the compression benefits of UTF-8 — which is a trade-off, since ASCII is nearly impossible to screw up, compared to UTF-8, if you aren't expecting UTF-8 (but you should be expecting UTF-8).

So, in short, I want to be able to leverage UTF-8 and remain confident that I'll get valid JSON output, without having to sanitize the (unusual) control characters out of all the strings in my data.

1 Answer

0 votes
by
Reference: https://clojure.atlassian.net/browse/DJSON-28 (reported by alex+import)
...