Clojure Q&A - Recent questions tagged performance

Can direct-linking be enabled for my own code and not for libraries?

Tue, 27 Feb 2024 12:47:39 +0000

Blanketly enablind direct-linking can break libraries, and I can't control what vars in them are affected, but I'd still like to improve performance at least in parts of my own code.

Would the following work?

(binding [clojure.core/*compiler-options* 
  {:direct-linking true}]
  ,,,
  )

Here's the docs: https://clojuredocs.org/clojure.core/*compiler-options*

Do you use arity 3 or higher comparisons such as = < <= > >= == in performance sensitive code?

Tue, 16 Jan 2024 22:14:09 +0000

For comparisons (= < <= > >= ==) arity 3 and higher are much slower than doing e.g. (and (< a b) (< b c)). The arity 3 is in my research rather less common but still used in e.g. Manifold stream.clj [0] or rrb_vector rrbt.clj [1]. Arity 3 is idiomatic in places, where you compare lower and upper bound for a "variable". However, I don't think arity 4 and higher is used much in practical code. At least I haven't found any interesting instances of such a use.

An example implementation (based on the one currently in core) could be:

(defn <'
  "Returns non-nil if nums are in monotonically increasing order,
  otherwise false."
  {:inline         (fn [x y] `(. clojure.lang.Numbers (lt ~x ~y)))
   :inline-arities #{2}
   :added          "1.0"}
  ([x] true)
  ([x y] (. clojure.lang.Numbers (lt x y)))
  ([x y z] (and (<' x y) (<' y z)))
  ([x y z & more]
   (if (<' x y z)
     (if (next more)
       (recur y z (first more) (next more))
       (<' z (first more)))
     false)))

[0] https://github.com/clj-commons/manifold/blob/c3fc69066f3abba0b5ab0f4c2b1c4338bcc61d19/src/manifold/stream.clj#L978
[1] https://github.com/clojure/core.rrb-vector/blob/master/src/main/clojure/clojure/core/rrb_vector/rrbt.clj

Performance improvements to creation of small vectors with TransientVector

Tue, 13 Jun 2023 12:52:56 +0000

Sorry for doing it backward and submitting the ticket /patch first. The ticket is here: https://clojure.atlassian.net/jira/software/c/projects/CLJ/issues/CLJ-2786

The ticket contains all the important details, benchmarking results, and code. Here I would like to hear whether people generally use transients, how they decide if transients will lead to performance improvement rather than degradation, and any other possible doubts. I always had mixed feelings when doing something through transients, probably due to a section in Joy of Clojure claiming that transients are inefficient for small inputs. The benchmark done with the current version of Clojure overall confirms that.

clojure.walk/keywordize-keys and stringify-keys unnecessarily allocate

Sun, 16 Apr 2023 19:54:16 +0000

clojure.walk/keywordize-keys and clojure.walk/stringify-keys allocate [k v] vectors that are converted to map entries.

Benchmarks show that it is more efficient to operate on map entries directly.

(require '[clojure.walk :as walk]
         '[criterium.core :as c])

(defn keywordize-keys
  [m]
  (walk/postwalk (fn [kv]
                   (if (and (map-entry? kv)
                            (string? (key kv)))
                     (clojure.lang.MapEntry. (keyword (key kv)) (val kv))
                     kv))
                 m))

(defn stringify-keys
  [m]
  (walk/postwalk (fn [kv]
                   (if (and (map-entry? kv)
                            (keyword? (key kv)))
                     (clojure.lang.MapEntry. (name (key kv)) (val kv))
                     kv))
                 m))

(let [sz 500000
      m (into {} (map (fn [i] [(str (random-uuid)) i]))
              (range sz))]
  ;; intern keys (doesn't seem to impact benchmark)
  (run! #(keyword (key %)) m)
  (doseq [f '[walk/keywordize-keys keywordize-keys]
          :let [f' (resolve f)]]
    (prn f)
    (c/quick-bench (f' m))
    nil))

;; walk/keywordize-keys
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 559.425609 ms
;;     Execution time std-deviation : 17.477808 ms
;;    Execution time lower quantile : 535.799373 ms ( 2.5%)
;;    Execution time upper quantile : 572.055045 ms (97.5%)
;;                    Overhead used : 2.097250 ns

;; keywordize-keys
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 413.512748 ms
;;     Execution time std-deviation : 9.081118 ms
;;    Execution time lower quantile : 402.917998 ms ( 2.5%)
;;    Execution time upper quantile : 422.893519 ms (97.5%)
;;                    Overhead used : 2.097250 ns

(let [sz 500000
      m (into {} (map (fn [i] [(keyword (str (random-uuid))) i]))
              (range sz))]
  (doseq [f '[walk/stringify-keys stringify-keys]
          :let [f' (resolve f)]]
    (prn f)
    (c/quick-bench (f' m))
    nil))

;; walk/stringify-keys
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 473.410415 ms
;;     Execution time std-deviation : 25.763722 ms
;;    Execution time lower quantile : 451.515206 ms ( 2.5%)
;;    Execution time upper quantile : 515.015561 ms (97.5%)
;;                    Overhead used : 2.097250 ns
;; 
;; Found 1 outliers in 6 samples (16.6667 %)
;; 	low-severe	 1 (16.6667 %)
;;  Variance from outliers : 14.2242 % Variance is moderately inflated by outliers

;; stringify-keys
;; Evaluation count : 6 in 6 samples of 1 calls.
;;              Execution time mean : 322.547283 ms
;;     Execution time std-deviation : 17.561204 ms
;;    Execution time lower quantile : 303.155082 ms ( 2.5%)
;;    Execution time upper quantile : 341.169831 ms (97.5%)
;;                    Overhead used : 2.097250 ns

The relative performance improvement is similar even in very small maps like {"a" {"b" {"c" 1, 9 "d"}, "z" 5}}.

Use transducers on clojure.core lazy sequence transformations

Tue, 11 Apr 2023 06:55:00 +0000

Hi,

I heard that if transducers were included before in the language, they would have been used as the building blocks of all sequence lazy operations. Since they were added later, you need to adapt your code a bit to use transducers. And I was wondering why using eduction is not an option? I'm trying to understand the scenarios in which it's not a performance advantage to define the sequence lazy operations like this (of course assuming the same behaviour):

(defn map
  ([f] ;; the standard transducer definition
   ,,,)
  ([f coll]
   (eduction (map f) coll))

I'm trying to understand given the assumption (which might be an erroneous assumption) that

(->> (range 5000000)
     (eduction (map inc))
     (eduction (filter odd?))
     (eduction (map dec))
     (eduction (filter even?))
     (eduction (map (fn [n] (+ 3 n))))
     (eduction (filter odd?))
     (eduction (map inc))
     (eduction (filter odd?))
     (eduction (map dec))
     (eduction (filter even?))
     (eduction (map (fn [n] (+ 3 n))))
     (eduction (filter odd?))
     (into []))

is the same as

(->> (range 5000000)
     (map inc)
     (filter odd?)
     (map dec)
     (filter even?)
     (map (fn [n] (+ 3 n)))
     (filter odd?)
     (map inc)
     (filter odd?)
     (map dec)
     (filter even?)
     (map (fn [n] (+ 3 n)))
     (filter odd?)
     (into []))

Compile parser ahead of time

Mon, 27 Feb 2023 19:57:36 +0000

tools.cli/parse-opts accepts the args, the option-spec, and the additional options. It calls compile-option-specs and required-arguments to build the specs and req. Then it performs the validation with those specs and req on the provided args (along with the options). For a given application, the first two steps aren't going to change across calls.

I propose a new make-parse-opts-fn function that performs the compile-option-specs and required-arguments up front and returns a function that relies on the compiled specs and req. It could be used like this: (def compiled-parser (make-parse-opts-fn cli-options)).

Microbenchmarking with criterium shows a more than double increase in speed (tools.cli/parse-opts first, pre-compiled parser second):

; user=> (def cli-options
  [["-h" "--help" "This message"]
   [nil "--extra" "Output in extra format"
    :default false]
   ["-q" "--quiet" "Print no suggestions, only return exit code"
    :default false]])
#'user/cli-options

; user=> (bench (cli/parse-opts ["--quiet" "src"] cli-options :in-order true))
Evaluation count : 10962 in 6 samples of 1827 calls.
             Execution time mean : 65.021329 µs
    Execution time std-deviation : 5.276033 µs
   Execution time lower quantile : 58.658638 µs ( 2.5%)
   Execution time upper quantile : 69.724228 µs (97.5%)
                   Overhead used : 9.607933 ns
nil

; user=> (bench (compiled-parser ["--quiet" "src"] :in-order true))
Evaluation count : 24660 in 6 samples of 4110 calls.
             Execution time mean : 25.090846 µs
    Execution time std-deviation : 253.361821 ns
   Execution time lower quantile : 24.769079 µs ( 2.5%)
   Execution time upper quantile : 25.413286 µs (97.5%)
                   Overhead used : 9.607933 ns
nil

If the additional options are lifted into the make-parse-opts-fn as well (trading flexibility for speed), the difference is even more dramatic, providing roughly 10x speed increase over the existing tools.cli/parse-opts function:

; user=> (bench (compiled-parser-2 ["--quiet" "src"]))
Evaluation count : 73116 in 6 samples of 12186 calls.
             Execution time mean : 8.349923 µs
    Execution time std-deviation : 51.419864 ns
   Execution time lower quantile : 8.282266 µs ( 2.5%)
   Execution time upper quantile : 8.407998 µs (97.5%)
                   Overhead used : 9.607933 ns

Found 2 outliers in 6 samples (33.3333 %)
	low-severe	 1 (16.6667 %)
	low-mild	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil

I can provide a patch for this if there's interest.

defrecord should efficiently implement IReduceInit

Sun, 22 Jan 2023 16:37:25 +0000

In one my the TMD pathways (that came from mastodon) we expand the dataset size using defrecords. Surprisingly considering everything else that performance case is doing, implementing IReduceInit spend up the timings considerably.


tech.v3.dataset.reductions-test> (require '[criterium.core :as crit])
nil
tech.v3.dataset.reductions-test> (defrecord YMC [year-month ^long count]
  ;; clojure.lang.IReduceInit
  ;; (reduce [this rfn init]
  ;;   (let [init (reduced-> rfn init
  ;;                  (clojure.lang.MapEntry/create :year-month year-month)
  ;;                  (clojure.lang.MapEntry/create :count count))]
  ;;     (if (and __extmap (not (reduced? init)))
  ;;       (reduce rfn init __extmap)
  ;;       init)))
  )
tech.v3.dataset.reductions_test.YMC
tech.v3.dataset.reductions-test> (let [yc (YMC. :a 1)]
                                   (crit/quick-bench (reduce (fn [acc v] v) nil yc)))
Evaluation count : 6729522 in 6 samples of 1121587 calls.
             Execution time mean : 87.375170 ns
    Execution time std-deviation : 0.173728 ns
   Execution time lower quantile : 87.104982 ns ( 2.5%)
   Execution time upper quantile : 87.550708 ns (97.5%)
                   Overhead used : 2.017589 ns
nil
tech.v3.dataset.reductions-test> (defrecord YMC [year-month ^long count]
   clojure.lang.IReduceInit
   (reduce [this rfn init]
     (let [init (reduced-> rfn init
                    (clojure.lang.MapEntry/create :year-month year-month)
                    (clojure.lang.MapEntry/create :count count))]
       (if (and __extmap (not (reduced? init)))
         (reduce rfn init __extmap)
         init)))
  )
tech.v3.dataset.reductions_test.YMC
tech.v3.dataset.reductions-test> (let [yc (YMC. :a 1)]
                                   (crit/quick-bench (reduce (fn [acc v] v) nil yc)))
Evaluation count : 43415358 in 6 samples of 7235893 calls.
             Execution time mean : 11.775423 ns
    Execution time std-deviation : 0.197683 ns
   Execution time lower quantile : 11.594695 ns ( 2.5%)
   Execution time upper quantile : 12.079668 ns (97.5%)
                   Overhead used : 2.017589 ns
nil
tech.v3.dataset.reductions-test> (defmacro reduced->
  [rfn acc & data]
  (reduce (fn [expr next-val]
            `(let [val# ~expr]
               (if (reduced? val#)
                 val#
                 (~rfn val# ~next-val))))
          acc
          data))

#'tech.v3.dataset.reductions-test/reduced->
tech.v3.dataset.reductions-test>

Talking this over with other members it appears the value lookup pathway could also be optimized in the case when __extmap is not nil (it uses clojure.core/get as opposed to a direct getorDefault call.

https://github.com/techascent/tech.ml.dataset/blob/master/test/tech/v3/dataset/reductions_test.clj#L197

Odd performance penalty on numeric benchmark alleviated by going to doubles...

Tue, 10 Jan 2023 17:51:55 +0000

I happened across the performance benchmark here and I was curious why clojure was getting beaten by java.

So I tossed it into the profiler (after modifying their version to use unchecked math - which didn't help) and nothing shows up. hmm. decompile and find that

// Decompiling class: leibniz$calc_pi_leibniz
import clojure.lang.*;

public final class leibniz$calc_pi_leibniz extends AFunction implements LD
{
    public static double invokeStatic(final long rounds) {
        final long end = 2L + rounds;
        long i = 2L;
        double x = 1.0;
        double pi = 1.0;
        while (i != end) {
            final double x2 = -x;
            final long n = i + 1L;
            final double n2 = x2;
            pi += Numbers.divide(x2, 2L * i - 1L);
            x = n2;
            i = n;
        }
        return Numbers.unchecked_multiply(4L, pi);
    }

    @Override
    public Object invoke(final Object o) {
        return invokeStatic(RT.uncheckedLongCast(o));
    }

    @Override
    public final double invokePrim(final long rounds) {
        return invokeStatic(rounds);
    }
}

So looks like the double/long boundary is costing us at least a method lookup maybe in Numbers.divide?
So I just coerce everything to double (even our index variable):

(def rounds 100000000)

(defn calc-pi-leibniz2
  "Eliminate mixing of long/double to avoid clojure.numbers invocations."
  ^double
  [^long rounds]
  (let [end (+ 2.0 rounds)]
    (loop [i 2.0 x 1.0 pi 1.0]
      (if (= i end)
        (* 4.0 pi)
        (let [x (- x)]
          (recur (inc i) x (+ pi (/ x (dec (* 2 i))))))))))

leibniz=> (c/quick-bench (calc-pi-leibniz rounds))
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 575.352216 ms
    Execution time std-deviation : 10.070268 ms
   Execution time lower quantile : 566.210399 ms ( 2.5%)
   Execution time upper quantile : 588.772187 ms (97.5%)
                   Overhead used : 1.884700 ns
nil
leibniz=> (c/quick-bench (calc-pi-leibniz2 rounds))
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 158.509049 ms
    Execution time std-deviation : 759.113165 ╡s
   Execution time lower quantile : 157.234899 ms ( 2.5%)
   Execution time upper quantile : 159.205374 ms (97.5%)
                   Overhead used : 1.884700 ns
nil

Any ideas why the java implementation not paying the same penalty for division? [both versions are implemented with unchecked-math at :warn-on-boxed].

I also tried a variant with fastmath's primitive math operators and actually got slower. So far nothing has beaten coercing the loop index i into a double (which I would normally never do).

How/can I profile within a transaction?

Tue, 06 Dec 2022 22:16:21 +0000

I've had good luck using VisualVM to find code bottlenecks. But now it shows that all the time is spent within a LockingTransaction. Are there any tricks for getting performance results within transactions?

profileTransaction.png gist

Can I make this routine for scoring a graph bisection more efficient?

Sun, 25 Sep 2022 20:30:23 +0000

My code is spending most of its time scoring bisections: determining how many edges of a graph cross from one set of nodes to the other.

Assume bisect is a set of half of a graph's nodes (ints), and edges is a list of (directed) edges [ [n1 n2] ...] where n1,n2 are also nodes.

(defn tstBisectScore
  "number of edges crossing bisect"
  ([bisect edges]
   (tstBisectScore bisect 0 edges))

  ([bisect nx edge2check]
   (if (empty? edge2check)
     nx

     (let [[n1 n2] (first edge2check)
           inb1 (contains? bisect n1)
           inb2 (contains? bisect n2)]
       (if (or (and inb1 inb2)
               (and (not inb1) (not inb2)))
         (recur bisect nx (rest edge2check))
         (recur bisect (inc nx) (rest edge2check))))

     )))

The only clues I have via sampling the execution of this code (using VisualVM) shows most of the time spent in clojure.core$empty_QMARK_, and most of the rest in clojure.core$contains_QMARK_. (first and rest take only a small fraction of the time.)

Any suggestions as to how I could tighten the code?

Would it make sense to have a flag to cache protocol implementation lookups?

Mon, 25 Apr 2022 13:02:50 +0000

In recent conversation I learned that some developers avoid and discourage the use of satisfies? because of performance concerns. I was told to "just look at the implementation, you will sweat bullets".

The implementation in this case is in find-protocol-impl which has to check metadata, traverse the inheritance chain as well as implemented interfaces.

The part of finding a protocol implementation for a specific class seems emminently cacheable, except for the fact that one can extend the protocol later on.

Would it make sense to cache this, and have extend-* invalidate the cache? Or alternatively to have a flag (e.g. -J-Dclojure.cache-protocols=true) for use in production where you know all protocol implementations will be loaded before lookups happen?

Interface to safely construct persistent datastructures

Mon, 11 Apr 2022 06:19:39 +0000

When looking for execution speed in an application, it's tempting to reach for a java.util.ArrayList both when constructing maps and vectors. However there are some pitfalls in doing so. It is not obvious that for eg vectors, one should use LazilyPersistentVector/createOwning over eg PersistentVector/adopt, the former working correctly, whereas the only works for vectors smaller than 32.

Likewise for maps, where one could choose to create a PersistentArrayMap if there are less than 9 entries in the map, but should use a PersistentHashMap if there are 9 or more entries.

I don't want to prescribe solutions here, but a couple of fns in clojure.core that took arrays as params and returned the appropriate data structure:
(array->vector arr) ;; does what LazilyPersistentVector/createOwning does (array->map arr) ;; returns a PAM or PHM depending on size

clojure.data.csv/write-csv predicate creates a set with check for each cell

Sun, 03 Apr 2022 11:30:17 +0000

I discovered this by accident:
Since quote? is invoked for each cell and the set used as a predicate
#{separator quote \return \newline}
isn't a constant expression at compile time, for each cell clojure.lang.RT.set will end up getting invoked.

For boring input data such as

  (def xs (vec (for [_ (range 1000)]
                 (mapv identity (range 10)))))

it ends up consuming about 50% CPU

Can easily be avoided by binding the set to a local before closing over it inside write-csv

Would be nice if the compiler could detect this.

Thanks
Ben

Persistent collections can implement equiv() more efficiently

Thu, 30 Sep 2021 16:57:25 +0000

I found that structural equality between persistent collections makes very few assumptions which lead to inefficient implementations, especially for vectors and maps.

The thrust of the implementation is dispatching via methods which directly iterate over the underlying arrays.

These implementations aren't the prettiest or most idiomatic but they're efficient. If this gets implemented it would look different in Java anyway.

I tried these alternative implementations and found dramatic speed ups:

Vector

(let [die (clojure.lang.Reduced. false)]
  (defn vec-eq
    [^PersistentVector v ^Iterable y]
    (let [iy (.iterator y)]
      (.reduce v (fn [_ x] (if (= x (.next iy)) true die)) true))))

This works well when comparing vectors and for vector x list
Current implementation goes through a loop from 0 to count and calls nth for every element. nth calls arrayFor() every time, while both reduce and an iterator get the backing array once per array.

Map

(let [o (Object.)
      die (clojure.lang.Reduced. false)
      eq (fn [m2] (fn [b k v]
                   (let [v' (.valAt ^IPersistentMap m2 k o)]
                     (if (.equals o v')
                       die
                       (if (= v v') true die)))))]
  (defn map-eq
    [m1 m2]
    (.kvreduce ^IKVReduce m1 (eq m2) true)))

Here, too, the implementation iterates directly over the underlying array structure.
Current implementation casts the array to seq then iterates over it while getting entries from the other map via the Map interface.
This implementation avoids casting the map to a sequence and does not allocate entries.

Sequences

When the receiver is a list the object compared against it and the receiver will be cast to a seq.

It could be more efficient to compare it with other collections via an iterator

(defn iter-eq
  [^Iterable x ^Iterable y]
  (let [ix (.iterator x)
        iy (.iterator y)]
    (loop []
      (if (.hasNext ix)
        (if (= (.next ix) (.next iy))
          (recur)
          false)
        true))))

Benchmarking

With criterium, vec-eq wins both cases. There are diminishing returns with size increase but still at n=64 vec-eq is twice as fast as =.
map-eq is also 2-3x faster for bigger maps and up to 10x faster for smaller maps

(doseq [n [1 2 4 8 16 32 64]
        :let [v1 (vec (range n))
              v2 (vec (range n))]]
  (println 'iter-eq n (iter-eq v1 v2))
  (cc/quick-bench (iter-eq v1 v2))
  (println 'vec-eq n (vec-eq v1 v2))
  (cc/quick-bench (vec-eq v1 v2))
  (println '= n (= v1 v2))
  (cc/quick-bench (= v1 v2)))


(doseq [n [1 2 4 8 16 32 64]
        :let [v1 (vec (range n))
              v2 (list* (range n))]]
  (println 'iter-eq n (iter-eq v1 v2))
  (cc/quick-bench (iter-eq v1 v2))
  (println 'vec-eq n (vec-eq v1 v2))
  (cc/quick-bench (vec-eq v1 v2))
  (println '= n (= v1 v2))
  (cc/quick-bench (= v1 v2)))

(doseq [n [1 2 4 8 16 32 64]
        :let [m1 (zipmap (range n) (range n))
              m2 (zipmap (range n) (range n))]]
  (cc/quick-bench (map-eq m1 m2))
  (cc/quick-bench (= m1 m2)))

Addendum:
Also checked the following cases:

(doseq [n [10000 100000]
        :let [v1 (vec (range n))
              v2 (assoc v1 (dec (count v1)) 7)]]
  (cc/quick-bench (vec-eq v1 v2))
  (cc/quick-bench (iter-eq v1 v2))
  (cc/quick-bench (= v1 v2)))

(doseq [n [100000]
        :let [m1 (zipmap (range n) (range n))
              m2 (assoc m1 (key (last m1)) 7)]]
  (cc/quick-bench (map-eq m1 m2))
  (cc/quick-bench (= m1 m2)))

Optimized implementations still win by huge margins

The clojure `rand-int` method seems to be 4 times slower than its java counterpart `java.util.Random.nextInt()`

Fri, 04 Jun 2021 09:12:42 +0000

Check the below gist to see detailed results.

Note that using interop is ok for me,
I am in an operations research use case where time consumption of random value is important.

But I wonder why clojure should not propose a better version of rand-int.

I suspect (but I will check if needed) that the nextInt version may be more uniform (statistically closer from the theory).

PS: Duplicate from a question in slack, channel clojure.

Can core.match LiteralPattern emit more specialized code?

Tue, 01 Dec 2020 07:29:48 +0000

Hello,
Looking at the code emitted by core.match, I see that literals are always compared using =.
While it works, wouldn't it be faster to specialize equality for different literals, such as:

 (number? l) `(and (number? ~ocr) (== ~l ~ocr))
 (keyword? l) `(identical? ~l ~ocr)
 (nil? l) `(nil? ~ocr)
 (true? l) `(true? ~ocr)
 (false? l) `(false? ~ocr)
 (string? l) `(.equals ~l ~ocr)

(snippet added to LiteralPattern cond.

Crude benchmarks show it's faster than = and all the tests pass, too.

Think this warrants a patch?

clojure.walk/walk can use transducers and protocols

Fri, 13 Nov 2020 19:00:00 +0000

Looking at clojure.walk/walk's implementation it looks like there's a good opportunity to improve its performance for vectors and maps by using a transducer for the coll? case:

   (coll? form) (outer (into (empty form) (map inner form))) ; old
   (coll? form) (outer (into (empty form) (map inner) form)) ; new

Another opportunity is replacing the cond dispatch with a protocol.

Also see this Jira: faster, more flexible dispatch for clojure.walk

See full implementation and benchmarks below:

Walk with transducer implementation

(defn walk*
  [inner outer form]
  (cond
    (list? form) (outer (apply list (map inner form)))
    (instance? clojure.lang.IMapEntry form)
    (outer (clojure.lang.MapEntry/create (inner (key form)) (inner (val form))))
    (seq? form) (outer (doall (map inner form)))
    (instance? clojure.lang.IRecord form)
    (outer (reduce (fn [r x] (conj r (inner x))) form form))
    (coll? form) (outer (into (empty form) (map inner) form))
    :else (outer form)))

(defn postwalk*
  [f form]
  (walk* (fn [form'] (postwalk* f form')) f form ))

Protocol implementation (with transducer):

(defprotocol IWalk
  (-walk [form inner outer]))

(extend-protocol IWalk

  clojure.lang.PersistentList
  (-walk [form inner outer]
    (outer (apply list (map inner form))))

  clojure.lang.PersistentQueue
  (-walk [form inner outer]
    (outer (apply list (map inner form))))

  clojure.lang.MapEntry
  (-walk [form inner outer]
    (outer (clojure.lang.MapEntry/create (inner (key form)) (inner (val form)))))

  clojure.lang.LazySeq
  (-walk [form inner outer]
    (outer (doall (map inner form))))

  clojure.lang.PersistentVector
  (-walk [form inner outer]
    (outer (into (empty form) (map inner) form)))

  clojure.lang.PersistentArrayMap
  (-walk [form inner outer]
    (outer (into (empty form) (map inner) form)))

  clojure.lang.PersistentHashMap
  (-walk [form inner outer]
    (outer (into (empty form) (map inner) form)))

  clojure.lang.PersistentHashSet
  (-walk [form inner outer]
    (outer (into (empty form) (map inner) form)))

  Object
  (-walk [form inner outer]
    (if (instance? clojure.lang.IRecord form)
      (outer (reduce (fn [r x] (conj r (inner x))) form form))
      (outer form)))

  nil
  (-walk [form inner outer]
    (outer form)))

(defn postwalk
  [f form]
  (-walk form (fn [form'] (postwalk f form')) f))

Benchmark results (with criterium)

(require '[criterium.core :as cc])

(def form
  '(1 2 [3 4 5] {:a 6 7 8} [9 [10]] #{:b 7}))


(do
  (cc/bench (postwalk identity form))
  (cc/bench (postwalk* identity form))
  (cc/bench (walk/postwalk identity form)))

;;; Evaluation count : 8413020 in 60 samples of 140217 calls.
;;;              Execution time mean : 7.287719 µs
;;;     Execution time std-deviation : 128.290658 ns
;;;    Execution time lower quantile : 7.119399 µs ( 2.5%)
;;;    Execution time upper quantile : 7.509465 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 1 outliers in 60 samples (1.6667 %)
;;; 	low-severe	 1 (1.6667 %)
;;;  Variance from outliers : 6.2932 % Variance is slightly inflated by outliers
;;; Evaluation count : 7252680 in 60 samples of 120878 calls.
;;;              Execution time mean : 8.393008 µs
;;;     Execution time std-deviation : 140.292941 ns
;;;    Execution time lower quantile : 8.222419 µs ( 2.5%)
;;;    Execution time upper quantile : 8.724502 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 3 outliers in 60 samples (5.0000 %)
;;; 	low-severe	 2 (3.3333 %)
;;; 	low-mild	 1 (1.6667 %)
;;;  Variance from outliers : 6.2524 % Variance is slightly inflated by outliers
;;; Evaluation count : 5888880 in 60 samples of 98148 calls.
;;;              Execution time mean : 10.259563 µs
;;;     Execution time std-deviation : 344.368716 ns
;;;    Execution time lower quantile : 10.017438 µs ( 2.5%)
;;;    Execution time upper quantile : 10.594850 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 2 outliers in 60 samples (3.3333 %)
;;; 	low-severe	 1 (1.6667 %)
;;; 	low-mild	 1 (1.6667 %)
;;;  Variance from outliers : 20.5816 % Variance is moderately inflated by outliers

(def form
  '(defn walk*
     [inner outer form]
     (cond
       (list? form) (outer (apply list (map inner form)))
       (instance? clojure.lang.IMapEntry form)
       (outer (clojure.lang.MapEntry/create (inner (key form)) (inner (val form))))
       (seq? form) (outer (doall (map inner form)))
       (instance? clojure.lang.IRecord form)
       (outer (reduce (fn [r x] (conj r (inner x))) form form))
       (coll? form) (outer (into (empty form) (map inner) form))
       :else (outer form))))


(do
  (cc/bench (postwalk identity form))
  (cc/bench (postwalk* identity form))
  (cc/bench (walk/postwalk identity form)))

;;; Evaluation count : 1812840 in 60 samples of 30214 calls.
;;;              Execution time mean : 33.196956 µs
;;;     Execution time std-deviation : 961.919396 ns
;;;    Execution time lower quantile : 32.063979 µs ( 2.5%)
;;;    Execution time upper quantile : 34.546564 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 6 outliers in 60 samples (10.0000 %)
;;; 	low-severe	 2 (3.3333 %)
;;; 	low-mild	 1 (1.6667 %)
;;; 	high-mild	 3 (5.0000 %)
;;;  Variance from outliers : 15.8051 % Variance is moderately inflated by outliers
;;; Evaluation count : 1653840 in 60 samples of 27564 calls.
;;;              Execution time mean : 36.626230 µs
;;;     Execution time std-deviation : 441.227719 ns
;;;    Execution time lower quantile : 35.798588 µs ( 2.5%)
;;;    Execution time upper quantile : 37.373995 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;; Evaluation count : 1728600 in 60 samples of 28810 calls.
;;;              Execution time mean : 35.173883 µs
;;;     Execution time std-deviation : 400.776590 ns
;;;    Execution time lower quantile : 34.697017 µs ( 2.5%)
;;;    Execution time upper quantile : 35.825413 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 1 outliers in 60 samples (1.6667 %)
;;; 	low-severe	 1 (1.6667 %)
;;;  Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

(def form
  {1 {2 {3 {4 {5 {6 {7 {8 {9 {10 11}}}}}}}}}})


(do
  (cc/bench (postwalk identity form))
  (cc/bench (postwalk* identity form))
  (cc/bench (walk/postwalk identity form)))

;;; Evaluation count : 6947100 in 60 samples of 115785 calls.
;;;              Execution time mean : 8.809319 µs
;;;     Execution time std-deviation : 163.576702 ns
;;;    Execution time lower quantile : 8.627843 µs ( 2.5%)
;;;    Execution time upper quantile : 9.126265 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 2 outliers in 60 samples (3.3333 %)
;;; 	low-severe	 2 (3.3333 %)
;;;  Variance from outliers : 7.8088 % Variance is slightly inflated by outliers
;;; Evaluation count : 6457380 in 60 samples of 107623 calls.
;;;              Execution time mean : 9.628546 µs
;;;     Execution time std-deviation : 171.963701 ns
;;;    Execution time lower quantile : 9.316393 µs ( 2.5%)
;;;    Execution time upper quantile : 9.976758 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 5 outliers in 60 samples (8.3333 %)
;;; 	low-severe	 2 (3.3333 %)
;;; 	low-mild	 2 (3.3333 %)
;;; 	high-mild	 1 (1.6667 %)
;;;  Variance from outliers : 7.7664 % Variance is slightly inflated by outliers
;;; Evaluation count : 5483100 in 60 samples of 91385 calls.
;;;              Execution time mean : 11.064318 µs
;;;     Execution time std-deviation : 167.430489 ns
;;;    Execution time lower quantile : 10.854539 µs ( 2.5%)
;;;    Execution time upper quantile : 11.447064 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 2 outliers in 60 samples (3.3333 %)
;;; 	low-severe	 1 (1.6667 %)
;;; 	low-mild	 1 (1.6667 %)
;;;  Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

(def form
  [1 [2 [3 [4 [5 [6 [7 [8 [9 [10 11]]]]]]]]]])

(do
  (cc/bench (postwalk identity form))
  (cc/bench (postwalk* identity form))
  (cc/bench (walk/postwalk identity form)))

;;; Evaluation count : 7627620 in 60 samples of 127127 calls.
;;;              Execution time mean : 7.770194 µs
;;;     Execution time std-deviation : 81.222440 ns
;;;    Execution time lower quantile : 7.610275 µs ( 2.5%)
;;;    Execution time upper quantile : 7.913045 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;; Evaluation count : 6941880 in 60 samples of 115698 calls.
;;;              Execution time mean : 8.726047 µs
;;;     Execution time std-deviation : 133.165422 ns
;;;    Execution time lower quantile : 8.557593 µs ( 2.5%)
;;;    Execution time upper quantile : 8.961663 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 1 outliers in 60 samples (1.6667 %)
;;; 	low-severe	 1 (1.6667 %)
;;;  Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
;;; Evaluation count : 5045520 in 60 samples of 84092 calls.
;;;              Execution time mean : 12.051122 µs
;;;     Execution time std-deviation : 223.757365 ns
;;;    Execution time lower quantile : 11.799274 µs ( 2.5%)
;;;    Execution time upper quantile : 12.768594 µs (97.5%)
;;;                    Overhead used : 9.033571 ns
;;;
;;; Found 3 outliers in 60 samples (5.0000 %)
;;; 	low-severe	 1 (1.6667 %)
;;; 	low-mild	 2 (3.3333 %)
;;;  Variance from outliers : 7.8088