Welcome! Please see the About page for a little more info on how this works.

+2 votes
in Syntax and reader by

Is it not possible to create a tagged literal for a clojure.core.Vec (that result of the vector-of function)? Creating a reader function for such a tag is trivial, and it works well enough with read-str and EDN readers. But the REPL always winds up holding a clojure.lang.PersistentVector.

This issue came about as I thrash around looking for a round-trip support for a hexstring of bytes with support for idiomatic Clojure (clojure.lang.ISeq and immutability are high priorities).

  1. Despite the promise of (vector-of :byte ...), it's not possible to round trip the byte data cleanly with clojure.core.Vec.
  2. With a tag reader and print-dup writer I can get round-trip ability for JVM native byte arrays, but they don't support clojure.lang.ISeq and they're mutable.
  3. Regular vectors are heterogeneous and I can't safely appropriate
    print-dup for them to achieve my goals.

None of these options are attractive.

2 Answers

0 votes

What do you mean by "But the REPL always winds up holding a clojure.lang.PersistentVector."? The REPL doesn't "hold" anything.

edited by
It appears that anything implementing PersistentVector is caught up during the analyze phase when evaluated in the repl.  If you define a data reader that yields a Vec, that will be evaluated into a persistent vector  containing boxed types of the formerly primitive  Vec contents.  This doesn't happen if, as in Andy's example (and my initial hack on this at the clojureverse [thread](https://clojureverse.org/t/tagged-literal-for-clojure-core-vec-not-possible-for-clojure/6452/6) ) you bind data readers and use read-string.  You'll retain the Vec that's read because it's never evaluated as a VectorExpr by the repl.  That doesn't solve OP's original problem of having a data_reader that produces a Vec that won't be eval'd into a vector.  The closest I got was defining a custom type to bypass the vector check, so it'd fall through evaluation.  There is a lingering problem with print-dup with that route though.
Yes, my choice of words "winds up holding" was not very precise.  In the REPL, reading a tagged literal directly (*not* with `read-str`) evaluates to a `clojure.lang.PersistentVector`.  It's a bit baffling at first because the printed representation of the `clojure.lang.PersistentVector` is the same as `clojure.core.Vec`.
0 votes

Below is a sample REPL session that shows one way to define a data reader that returns a clojure.core.Vec object containing bytes, and uses calls in the REPL to demonstrate that it is this type.

If you could share a similar REPL session of something you have tried that is giving you results that are clojure.lang.PersistentVector when you hoped they were clojure.core.Vec, sharing them in a follow-up comment might help us determine what is going on.

$ clojure
Clojure 1.10.1
user=> (defn first-non-hex-char [string]
  (re-find #"[^0-9a-fA-F]" string))
user=> (defn hex-string-to-clojure-core-vec-of-byte [hex-string]
  (if-let [bad-hex-digit-string (first-non-hex-char hex-string)]
    (throw (ex-info (format "String that should consist of only hexadecimal digits contained: %s (UTF-16 code point %d)"
                            (int (first bad-hex-digit-string)))
                    {:input-string hex-string
                     :bad-hex-digit-string bad-hex-digit-string}))
    (if (not (zero? (mod (count hex-string) 2)))
      (throw (ex-info (format "String contains odd number %d of hex digits.  Should be even number of digits."
                              (count hex-string))
                      {:input-string hex-string
                       :length (count hex-string)}))
      ;; There are likely more efficient ways to do this, if
      ;; performance is critical for you.  I have done no performance
      ;; benchmarking on this code.  This code is taking advantage of
      ;; JVM library calls ready aware of.
      (let [hex-digit-pairs (re-seq #"[0-9a-fA-F]{2}" hex-string)
            byte-list (map (fn [two-hex-digit-str]
                              (java.lang.Short/valueOf two-hex-digit-str 16)))
        (apply vector-of :byte byte-list)))))
user=> (def bv1
  (binding [*data-readers*
            (assoc *data-readers*
                   'my.ns/byte-vec user/hex-string-to-clojure-core-vec-of-byte)]
    (read-string "#my.ns/byte-vec \"0123456789abcdef007f80ff\"")))
user=> bv1
[1 35 69 103 -119 -85 -51 -17 0 127 -128 -1]
user=> (type bv1)
user=> (type (bv1 0))
bytevector.core> (deftype blee [x])
bytevector.core> #=(bytevector.core.blee. #=(vector-of :byte 1 0 1))
    Syntax error compiling fn* at (*cider-repl     workspacenew\bytevector:localhost:59588(clj)*:1:8145).
    Can't embed object in code, maybe print-dup not defined:    clojure.core$reify__8311@31c365b
bytevector.core> #=(bytevector.core.blee. 2)
#object[bytevector.core.blee 0x6e95ec6b "bytevector.core.blee@6e95ec6b"]

It has to do with how print-dup is typically working, with read-time eval the #= stuff, I think.
Yep, Tom and I noted that behavior last week as well.  I imagine that the special form has some subtle impact on the evaluation.

It's a nice work-around in some situations, but not so much for REPL work.
The root cause of the "Can't embed object in code, maybe print-dup not defined" with an object that has "reify" and a bunch of hex digits in its printed representation, is the following combination of factors:

(1) Clojure primitive vectors are defined with deftype.

(2) For all types defined via deftype, there is an emitValue Java method inside of Clojure's Compiler.java source file that has many cases for deciding how to embed a literal value in JVM byte code.  You can search that file for the first occurrence of "IType", which is a Java interface that Clojure deftype-created types all implement, in order to later recognize that they were objects of a class created via deftype.  When such an object is a literal inside of Clojure code, emitValue attempts to create JVM byte code that can construct the original value when that JVM byte code is later executed, and for deftype-created objects, it always tries to iterate through all fields of the object, and emit code for the field and its value.

(3) Clojure primitive vectors have a field "am", short for "array manager", that is an object created by calling Clojure's "reify" function.  This object is used to implement several Java methods on 'leaves' of the tree used to represent Clojure primitive vectors, one such object for each different primitive type, since the JVM byte code for dealing with arrays of each primitive type is different, and Rich was probably going for run-time efficiency here by not detecting the primitive type at run time on every operation, but instead having an object that already had baked into it code for dealing with that vector's primitive type.

(4) emitValue, when called with an object that is the return of a "reify" call, tries to call `RT.printString` on it, which would work if a `print-dup` method were defined to handle such objects, but in general objects returned by "reify" can have arbitrary references to other JVM objects with internal state, or can have internal state themselves, so there is no good general way to create a `print-dup` definition that handles all possible objects created by calling "reify".

What could be done about this?

There are probably many alternatives I haven't thought of, but here are a few potential approaches, most of which would require changing Clojure's implementation in some way.

(approach #1a)
Change Clojure's primitive vector implementation so that all of its field values were immutable values with printable representations, i.e. no objects returned from 'reify', nor any function references.  Since primitive vectors are trees with O(log_32 n) depth, the representation created via emitValue would reflect that tree structure, but it seems like it could be made to work correctly.  This would likely lead to some lower run-time performance of operations on primitive vectors, since there would need to be a "case" or other conditional code to handle the different primitive types in leaf nodes.

(approach #1b)
Create a new implementation of Clojure primitive vectors that uses deftype, but has the changes suggested in #1a above.  No changes to Clojure's implementation would be required, since it would be a 3rd party implementation that can make its own implementation choices.

(approach #2)
Change the emitValue method in Compiler.java so that for deftype-created objects, it somehow checked whether there was a print-dup method for that object's class first, and used it if it was available, falling back to the current approach if there was not.  That would be somewhat tricky in this case, because Clojure primitive vectors implement the clojure.lang.IPersistentCollection interface, which already has a print-dup method that will not work for primitive vectors.  One possibility is not to simply call print-dup and see what happens, but to check whether the print-dup multimethod has an implementation for _exactly_ the class of the object one is trying to do emitValue on, e.g. clojure.core.Vec for primitive vectors.  Such an exact class check for multimethod implementations seems against the philosophy of multimethods in Clojure, and seems a bit hackish.

Another cleaner variation on this idea would be to define a new "emittable" interface in Clojure's implementation, and if a deftype-created class implemented it, then emitValue would use the 'emit' method of that interface on objects that implemented it.

(approach #3)
Create a separate Clojure primitive vector implementation that does not use deftype, nor defrecord, and falls into the last "else" case of the long if-then-else daisy chain of Clojure's emitValue.  This seems difficult, or maybe impossible, to me, without changing the emitValue method, because it currently has a case for clojure.lang.IPersistentVector before the last "else", and it would be very strange to try creating a Clojure primitive vector implementation that did not implement that interface.

Of the ones I have thought about, approach #1b, or the last variant of approach #2, seem possibly workable.  #1b requires no changes to Clojure's implementation.  #2 definitely does.  Approach #3 probably isn't really a viable alternative, for reasons stated above.

More details can be found in this repo's README: https://github.com/jafingerhut/vec-data-reader
edited by
Andy, your analysis matches my experience, you express the problem well and you propose some reasonable solutions.  Thank you.

I seriously considered #1b (since the others are outside my weak Java skils) and even put together a trial implementation.  One frustration I encountered is that _whatever_ bottom type is used to store the data (in my case, a persistent vector of bytes with homogeneity enforced by my implementation of IPersistentVector/assocN) I needed my type to implement ISeq.   But as soon as I did, my ability to control printing was gone.  I'm sorry I don't remember any more details of that experiment... maybe I can resurrect it.

(Sidebar: perhaps related to your observation in #2, but an interesting tangent : by what means does _print-method_ for clojure.core.Vec get determined?  This line (https://github.com/clojure/clojure/blob/master/src/clj/clojure/gvec.clj#L455) seems like the crux, but I don't understand how ::Vec is in play and in fact when I override (presumably) Vec's print method I don't use the global hierarchy -I just defmethod for the class.  My sneaking suspicion is that, like my attempts with my own type, it's is never being called.)

My ultimate goal is to support a hexstring literal (REPL-compatible) reader and printer backed by some type that supports idiomatic Clojure operations on vectors.  A deftype backed by clojure.core.Vec seemed so tantalizingly close ...

Thank you again for your deep analysis of this.

[edited after a reading of your repo clarified exactly when and why I lose control of my deftype's printing]
I do not understand why the print-method in gvec.clj has a dispatch value of `::Vec` -- I would have expected it to be the class `clojure.core.Vec`, but I may not have all of the context necessary to understand why it is `::Vec`.

You can use `(methods print-method)` to see all dispatch values that have a `defmethod` defined for them -- it will be the keys of the map, all of which are class and interface names, with only the two in gvec.clj being keywords.  You can use `(get-method print-method <some-expression>)` to see which method would be called for a particular value.  If you want to know which dispatch value that corresponds to, you can either find it manually in the output of `(methods print-method)`, or you can write some code to find it for you.

I might have some time to look at a trial implementation of #1b, if you still have it around, in case I notice anything amiss, but no promises.