The 32 control characters U+0000 through U+001F are never allowed in raw form in JSON strings.
From (link: https://www.ecma-international.org/publications/standards/Ecma-404.htm text: ECMA-404):
bq. All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.
From (link: https://tools.ietf.org/html/rfc7159#section-7 text: RFC 7159):
bq. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
When
`
} (the default), all characters outside the 32-127 range are escaped using {{\uCAFE}} syntax (or for the special whitespace cases, using named escapes).
However, when `
} is supplied to the {{write}} or {{write-str}} functions, some of the control characters are written in raw form, resulting in invalid JSON. This is improper behavior; the library should never produce JSON that violates the specification(s), no matter what options the user supplies.
This patch escapes the control characters even when
`
} is supplied.
There is a bit of special handling to exclude the named escapes in the control character range — the {{write-string}} function always escapes the characters (8, 9, 10, 12, 13) which have special escaped names and thus require special treatment.
I did not add any control character validation to the parsing functionality, following Postel's law:
bq. [TCP] implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Why use } at all if I'm worried about compliance? Well, Unicode is a really good idea, and pairs very nicely with the UTF-8 character encoding, which is also a really good idea. UTF-8 encodes text much more efficiently than spelling out literal escapes. The default (
}) does not leverage the compression benefits of UTF-8 — which is a trade-off, since ASCII is nearly impossible to screw up, compared to UTF-8, if you aren't expecting UTF-8 (but you should be expecting UTF-8).
So, in short, I want to be able to leverage UTF-8 and remain confident that I'll get valid JSON output, without having to sanitize the (unusual) control characters out of all the strings in my data.