Literals for Unicode code points (and perhaps also sequences thereof)

Question

Literals for Unicode code points (and perhaps also sequences thereof)

asked Jan 12 in Syntax and reader by Peter Monks
edited Jan 15 by Peter Monks

Context
For historical reasons the JVM type system's support for Unicode code points is poor, and while this is usually invisible to the developer it becomes a hassle when String literals containing non-Latin1 code points are used in code. It also becomes particularly problematic when cross-platform (cljc) code is attempting to do this, since other platforms may not share this historical oddity so solutions that "work" in ClojureJVM may break in other dialects.

For example, the transgender flag emoji (a single grapheme cluster that happens to be defined by 5 Unicode code points) cannot easily be constructed at the REPL or in a source file without detailed knowledge of the JVM's history (and associated knowledge of UTF-16, an increasingly obsolete character encoding).

Using the documented code points for this grapheme cluster with the JVM's Unicode escaping mechanism does not give the expected outcome:

The correct, but unintuitive solution is to remember that the JVM does not directly support Unicode code points in the supplemental planes, and then to translate the supplemental code point U+1F3F3 into its UTF-16 code unit / surrogate pair representation:

Note: I had to use screenshots for this, since ask.clojure doesn't appear to support Unicode supplemental code points properly either...

Question/request/proposal
Clojure can sidestep this issue in a purely accretive manner, providing better consistency across the JVM and other runtimes, by adding direct support for Unicode literals.

This would involve adding a new literal syntax that represents a single Unicode code point, and perhaps also a new literal syntax that represents a sequence of Unicode code points (perhaps supporting not only the novel Unicode code point literal, but also the existing Character and String literals). Both of these new literals would produce a standard JVM (or JavaScript, or ...) String object, in whatever native encoding those objects employ on their respective platforms - after such literals are read, it's all just the extant String data type - there is no runtime impact.

An example literal syntax

While I am not proposing a specific syntax for these new literals here (though such a task is a necessary step), for illustrative purposes here is an example of what these literals might approximately look like:

Single Unicode code point literals:

#U+0061: produces a String containing the Latin letter a: "a"
#U+1F921: produces a String containing the clown emoji (which ask.clojure cannot display)

Sequences of Unicode code point literals:

#U+[U+0061 U+0020 U+1F921]: produces the 3 grapheme cluster String: "a <clown emoji>"
#U+["a " U+1F921]: produces the same String, but demonstrates why it may be useful to support a mix of literals within the sequence (for readability)
#U+[\a \space U+1F921]: ditto
#U+[U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F]: produces a String containing a single grapheme cluster (the transgender flag emoji)

This final example is an ideal test case, since the transgender flag emoji is a single Unicode grapheme cluster, defined by 5 Unicode code points, but on the JVM (for the historical reason listed originally) is made up of 6 Characters.

Note that the sequence literal may not be necessary, since str could be used with the single code point literal syntax; e.g. (str #U+1F3F3 #U+FE0F #U+200D #U+26A7 #U+FE0F). Whether shifting the cost of string concatenation from read-time to runtime matters or not is another topic worthy of deeper consideration.

Other notes
Orthogonal to this proposal (at least from a Clojure core perspective; from a user perspective they're closely related), it would also be useful if Clojure core (perhaps in the clojure.string namespace) had functions to turn Strings into sequences of code points (as integers) and vice versa. Both the JVM and JavaScript provide native APIs for doing this (and presumably other platforms do too), but providing these as standard functions in Clojure core (similar to what was done with parse-long, parse-double, and parse-boolean in Clojure v1.11) has value and is also purely accretive.

commented Jan 12 by alexmiller

commented Jan 12 by Adrian

commented Jan 12 by Peter Monks
edited Jan 15 by Peter Monks

@Adrian yeah my proposal deliberately doesn't get into grapheme clusters (except to mention them in passing), since (as you say) that changes over time as new Unicode versions are released, and the JVM doesn't do a particularly good job of staying up to date (the `java.text.BreakIterator` class is the specific culprit here). FWIW the only reliable way I've found to identify grapheme clusters as per recent Unicode versions is to add ICU4J to my classpath and use their `BreakIterator` implementation instead of the JVM's.

But just to reiterate - it would still be required with this proposal for a Clojure developer to manually list out each individual code point that makes up a multi-code-point grapheme cluster (such as the transgender flag, grapheme clusters that make use of combining diacritics, etc.). This proposal doesn't change that. What it does do is eliminate the (often forgotten, and platform-specific / non-portable) step of having to convert supplemental plane code points into a UTF-16 surrogate pair. At this point that has become a pretty JVM specific and non-portable thing that few people remember, as not many commonly used systems use UTF-16 any more - pretty much everything else has embraced the "UTF-8 everywhere" mantra.

And yeah, it took me a few edits before I figured out that ask.clojure was corrupting my supplemental plane code points in this post. I guess it too uses UTF-16 (or perhaps UCS-2 <vomiting emoji>) under the hood. <winking emoji>

Literals for Unicode code points (and perhaps also sequences thereof)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Categories

Literals for Unicode code points (and perhaps also sequences thereof)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories