Welcome! Please see the About page for a little more info on how this works.

0 votes
ago in Syntax and reader by
edited ago by

Context
For historical reasons the JVM type system's support for Unicode code points is poor, and while this is usually invisible to the developer it becomes a hassle when String literals containing non-Latin1 code points are used in code. It also becomes particularly problematic when cross-platform (cljc) code is attempting to do this, since other platforms may not share this historical oddity so solutions that "work" in ClojureJVM may break in other dialects.

For example, the transgender flag emoji (a single grapheme cluster that happens to be defined by 5 Unicode code points) cannot easily be constructed at the REPL or in a source file without detailed knowledge of the JVM's history (and associated knowledge of UTF-16, an increasingly obsolete character encoding).

Using the documented code points for this grapheme cluster with the JVM's Unicode escaping mechanism does not give the expected outcome:

The correct, but unintuitive solution is to remember that the JVM does not directly support Unicode code points in the supplemental planes, and then to translate the supplemental code point U+1F3F3 into its UTF-16 code unit / surrogate pair representation:

Note: I had to use screenshots for this, since ask.clojure doesn't appear to support Unicode supplemental code points properly either...

Question/request/proposal
Clojure can sidestep this issue in a purely accretive manner, providing better consistency across the JVM and other runtimes, by adding direct support for Unicode literals.

This would involve adding a new literal syntax that represents a single Unicode code point, and perhaps also a new literal syntax that represents a sequence of Unicode code points (perhaps supporting not only the novel Unicode code point literal, but also the existing Character and String literals). Both of these new literals would produce a standard JVM (or JavaScript, or ...) String object, in whatever native encoding those objects employ on their respective platforms - after such literals are read, it's all just the extant String data type - there is no runtime impact.

An example literal syntax

While I am not proposing a specific syntax for these new literals here (though such a task is a necessary step), for illustrative purposes here is an example of what these literals might approximately look like:

Single Unicode code point literals:

  • U+0061: produces a String containing the Latin letter a: "a"
  • U+1F921: produces a String containing the clown emoji (which ask.clojure cannot display)

Sequences of Unicode code point literals:

  • U+[U+0061 U+0020 U+1F921]: produces the 3 grapheme cluster String: "a <clown emoji>"
  • U+["a " U+1F921]: produces the same String, but demonstrates why it may be useful to support a mix of literals within the sequence (for readability)
  • U+[\a \space U+1F921]: ditto
  • U+[U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F]: produces a String containing a single grapheme cluster (the transgender flag emoji)

This final example is an ideal test case, since the transgender flag emoji is a single Unicode grapheme cluster, defined by 5 Unicode code points, but on the JVM (for the historical reason listed originally) is made up of 6 Characters.

Note that the sequence literal may not be necessary, since str could be used with the single code point literal syntax; e.g. (str U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F). Whether shifting the cost of string concatenation from read-time to runtime matters or not is another topic worthy of deeper consideration.

Other notes
Orthogonal to this proposal (at least from a Clojure core perspective; from a user perspective they're closely related), it would also be useful if Clojure core (perhaps in the clojure.string namespace) had functions to turn Strings into sequences of code points (as integers) and vice versa. Both the JVM and JavaScript provide native APIs for doing this (and presumably other platforms do too), but providing these as standard functions in Clojure core (similar to what was done with parse-long, parse-double, and parse-boolean in Clojure v1.11) has value and is also purely accretive.

ago by
Can you explain why that's not the expected outcome in that first example? I just want to make sure I'm understanding.
ago by
After reading it two more times, I think I got it.
ago by
> For example, the transgender flag emoji (a single grapheme cluster that happens to be defined by 5 Unicode code points) cannot easily be constructed at the REPL or in a source file without detailed knowledge of the JVM's history (

I'm not sure how easy they are, but there are a few methods for constructing this string from the REPL without a history lesson:

;; Copy and paste into your editor, assuming your editor is unicode aware
> "️‍⚧️"
"️‍⚧️"

;; explicitly create a string from code points
> (String. (int-array [127987 65039 8205 9895 65039]) 0 5)
"️‍⚧️"

;; create a string from code points in hexedecimal
> (String. (int-array [0xD83C 0xDFF3 0xFE0F 0x200D 0x26A7 0xFE0F]) 0 5)
"️‍⚧️"
ago by
Argh. My comment did use the same transgender emoji flag, but it was changed when I submitted my comment.
ago by
One other note is that the way code points are grouped into grapheme clusters changes over time. New versions of unicode can and do change how code points are grouped into grapheme clusters.

Also note that the JDK's support for finding grapheme cluster boundaries depends on the JDK version: https://stackoverflow.com/a/76109241
ago by
edited ago by
@Adrian yeah my proposal deliberately doesn't get into grapheme clusters (except to mention them in passing), since (as you say) that changes over time as new Unicode versions are released, and the JVM doesn't do a particularly good job of staying up to date (the `java.text.BreakIterator` class is the specific culprit here).  FWIW the only reliable way I've found to identify grapheme clusters as per recent Unicode versions is to add ICU4J to my classpath and use their `BreakIterator` implementation instead of the JVM's.

But just to reiterate - it would still be required with this proposal for a Clojure developer to manually list out each individual code point that makes up a multi-code-point grapheme cluster (such as the transgender flag, grapheme clusters that make use of combining diacritics, etc.).  This proposal doesn't change that.  What it does do is eliminate the (often forgotten) step of having to convert supplemental plane code points into a UTF-16 surrogate pair.  At this point that has become a pretty JVM specific thing that few people remember, as not many commonly used systems use UTF-16 any more - pretty much everything else has embraced the "UTF-8 everywhere" mantra.

And yeah, it took me a few edits before I figured out that ask.clojure was corrupting my supplemental plane code points in this post.  I guess it too uses UTF-16 (or perhaps UCS-2 <vomiting emoji>) under the hood.  <winking emoji>

1 Answer

+1 vote
ago by

Created feature request jira https://clojure.atlassian.net/browse/CLJ-2935

...