Welcome! Please see the About page for a little more info on how this works.

+9 votes
in Clojure by

Whilst doing performance work recently, I discovered that unrolling to single assoc calls were significantly faster than using multiple keys (~10% for my particular application). Zachary Tellman then pointed out that clojure.core doesn't unroll assoc at all, even for the case of relatively low numbers of keys.

We already unroll other performance critical functions that call things via apply, e.g. update https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L5914, but assoc (which is, I think in the critical path for quite a bunch of applications and libraries), would likely benefit from this.

I have not yet developed patches for this, but I did some standalone benchmarking work:

https://github.com/yeller/unrolling-assoc-benchmarks

benchmark results:

code: https://github.com/yeller/unrolling-assoc-benchmarks/blob/master/src/bench_assoc_unrolling.clj

| |1 |2 |3 |4 |
| :-- | :-- | :-- | :-- | :-- |
| empty array map (not unrolled) | 23ns | 93ns | 156ns | 224ns |
| empty array map (unrolled assoc) | N/A | 51ns | 80ns | 110ns |
| | | | | |
| 20 element persistent hashmap (not unrolled) | 190ns | 313ns | 551ns | 651ns |
| 20 element persistent hashmap (unrolled assoc) | N/A | 250ns | 433ns | 524ns |
| | | | | |
| record (not unrolled) | 12ns | 72ns | 105ns | 182ns |
| record (unrolled assoc) | N/A | 21ns | 28ns | 41ns |

Each measurement was made in a separate JVM, to avoid JIT path dependence.

Benchmarks were ran on a commodity server (8 cpus, 32gb ram), with ubuntu 12.04 and a recent release of Java 8. Attached are cpuinfo, uname and java -version output.

Relatively standard JVM production flags were enabled, and care was taken to disable leiningen's startup time optimizations (which disable many of the JIT optimizations).

Benchmarks can be run by cloning the repository, and running script/bench

There's one outstanding question for this patch: How far should we unroll these calls? update (which is unrolled in the 1.7 alphas) is unrolled to 3 arguments. Adding more unrolling isn't difficult, but it does impact the readability of assoc.

Patch: CLJ-1656-v5.patch

25 Answers

0 votes
by

Comment made by: tcrayford

Ok, attached assoc.diff, which unrolls this to a single level more than the current code (so supporting two key/value pairs without recursion). The code's going to get pretty complex in the case with more than the unrolled number of keys if we continue on this way, so I'm unsure if this is a good approach, but the performance benefits seem very compelling.

0 votes
by

Comment made by: michaelblume

Since the unroll comes out kind of hairy, why not have a macro write it for us?

0 votes
by

Comment made by: michaelblume

Patch v2 includes assoc!

0 votes
by

Comment made by: tcrayford

I benchmarked conj with similar unrolling, across a relatively wide range of datatypes from core (lists, sets, vectors, each one empty and then again with 20 elements):

| | 1 | 2 | 3 | 4 |
| :-- | :-- | :-- | :-- | :-- | :-- |
| empty vector (not unrolled) | 19ns | 57ns | 114ns | 126ns |
| empty vector (unrolled conj) | N/A | 44ns | 67ns | 91ns |
| | | | | |
| 20 element vector (not unrolled) | 27.35ns | 69ns | 111ns | 107ns |
| 20 element vector (unrolled conj) | N/A | 54ns | 79ns | 104ns |
| | | | | |
| empty list (not unrolled) | 7ns | 28ns | 53ns | 51ns |
| empty list (unrolled conj) | N/A | 15ns | 20ns | 26ns |
| | | | | |
| twenty element list (not unrolled) | 8.9ns | 26ns | 49ns | 49ns |
| twenty element list (unrolled) | N/A | 15ns | 19ns | 30ns |
| | | | | |
| empty set (not unrolled) | 64ns | 170ns | 286ns | 290ns |
| empty set (unrolled) | N/A | 154ns | 249ns | 350ns |
| | | | | |
| twenty element set (not unrolled) | 33ns | 81ns | 132ns | 130ns |
| twenty element set (unrolled) | N/A | 69ns | 108ns | 139ns |

Benchmarks were run on the same machine as before. There's a less clear benefit here, except for lists, where the overhead of creating seqs and recursing seems to be clearly dominating the cost of actually doing the conj (which makes sense - conj on any element list should be a very cheap operation). Raw benchmark output is here: https://gist.github.com/tcrayford/51a3cd24b8b0a8b7fd74

0 votes
by

Comment made by: tcrayford

Michael Blume: I like those patches! They read far nicer to me than my original patch. Did you check if any of those macro generated methods blew Hotspot's hot code inlining limit? (it's 235 bytecodes). That'd be my only worry with using macros here - it's easy to generate code that defeats the inliner.

0 votes
by

Comment made by: michaelblume

Thanks! This is new for me, so I might be doing the wrong thing, but I just ran nodisassemble over both definitions and the "instruction numbers" next to each line go up to 219 for the varargs arity assoc and up to 251 for assoc!, so, assuming I'm looking at the right thing, maybe that one needs to have an arity taken off? If I remove the highest arity I get 232 for varargs which is just under the line.

I guess alternatively we could call assoc! instead of assoc!* in the varargs arity, which removes a lot of code -- in that case it's 176 for varargs and 149 for six pairs.

0 votes
by

Comment made by: michaelblume

Gah, I forgot to include coll in the varargs call to assoc!

which reminds me that this patch needs tests.

0 votes
by

Comment made by: michaelblume

OK, this has some fixes I made after examining the disassembled output. There's a change to the assoc!* macro to be sure it type-hints correctly -- I'm honestly not sure why it didn't type-hint properly before, but it does now. Also, I changed the call to assoc! rolling up the first six entries at the top of the varargs version from a macro call to a function call so it'd fit within the 251 inlineable bytecodes. (This, again, is assuming I'm reading the output correctly).

0 votes
by

Comment made by: tcrayford

Michael: Wanna push a branch with these patches to clojars or something? Then I can rerun the benchmarks with the exact code in the patches.

0 votes
by

Comment made by: michaelblume

Hmm, not sure I know how to do that -- here's a branch on github though https://github.com/MichaelBlume/clojure/tree/unroll-assoc

0 votes
by

Comment made by: michaelblume

v5 marks the helper macros private.

0 votes
by

Comment made by: tcrayford

Michael: was that branch just based off clojure/clojure master? I tried running benchmarks off it, but ran into undefined var errors when building this code (which doesn't happen with alpha5):

(Retrieving com/yellerapp/clojure-unrolled-assoc/1.7.0-unrollassoc-SNAPSHOT/clojure-unrolled-assoc-1.7.0-unrollassoc-20150213.092242-1.pom from clojars)
(Retrieving com/yellerapp/clojure-unrolled-assoc/1.7.0-unrollassoc-SNAPSHOT/clojure-unrolled-assoc-1.7.0-unrollassoc-20150213.092242-1.jar from clojars)
(Retrieving org/clojure/clojure/1.3.0/clojure-1.3.0.jar from central)
Exception in thread "main" java.lang.RuntimeException: Unable to resolve symbol: bench in this context, compiling:(bench_assoc_unrolling.clj:5)

at clojure.lang.Compiler.analyze(Compiler.java:6235)
at clojure.lang.Compiler.analyze(Compiler.java:6177)
at clojure.lang.Compiler$InvokeExpr.parse(Compiler.java:3452)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6411)
at clojure.lang.Compiler.analyze(Compiler.java:6216)
at clojure.lang.Compiler.analyze(Compiler.java:6177)
at clojure.lang.Compiler$BodyExpr$Parser.parse(Compiler.java:5572)
at clojure.lang.Compiler$FnMethod.parse(Compiler.java:5008)
0 votes
by

Comment made by: michaelblume

Ok, how are you building? Why the strange clojure group?

0 votes
by

Comment made by: michaelblume

The existing version of assoc does runtime checking that an even number of varargs are passed in, but assoc! does not. Do we want to preserve this behavior or do checks in both?

0 votes
by

Comment made by: michaelblume

Also, I'm curious how relevant inlining is here -- does HotSpot inlining actually work with Var invocation when there's a getRootBinding step in the way?

...