Unroll assoc and assoc! for small numbers of arguments

Question

Unroll assoc and assoc! for small numbers of arguments

asked Feb 6, 2015 in Clojure by jira

Whilst doing performance work recently, I discovered that unrolling to single assoc calls were significantly faster than using multiple keys (~10% for my particular application). Zachary Tellman then pointed out that clojure.core doesn't unroll assoc at all, even for the case of relatively low numbers of keys.

We already unroll other performance critical functions that call things via apply, e.g. update https://github.com/clojure/clojure/blob/master/src/clj/clojure/core.clj#L5914, but assoc (which is, I think in the critical path for quite a bunch of applications and libraries), would likely benefit from this.

I have not yet developed patches for this, but I did some standalone benchmarking work:

https://github.com/yeller/unrolling-assoc-benchmarks

benchmark results:

code: https://github.com/yeller/unrolling-assoc-benchmarks/blob/master/src/bench_assoc_unrolling.clj

| |1 |2 |3 |4 |
| :-- | :-- | :-- | :-- | :-- |
| empty array map (not unrolled) | 23ns | 93ns | 156ns | 224ns |
| empty array map (unrolled assoc) | N/A | 51ns | 80ns | 110ns |
| | | | | |
| 20 element persistent hashmap (not unrolled) | 190ns | 313ns | 551ns | 651ns |
| 20 element persistent hashmap (unrolled assoc) | N/A | 250ns | 433ns | 524ns |
| | | | | |
| record (not unrolled) | 12ns | 72ns | 105ns | 182ns |
| record (unrolled assoc) | N/A | 21ns | 28ns | 41ns |

Each measurement was made in a separate JVM, to avoid JIT path dependence.

Benchmarks were ran on a commodity server (8 cpus, 32gb ram), with ubuntu 12.04 and a recent release of Java 8. Attached are cpuinfo, uname and java -version output.

Relatively standard JVM production flags were enabled, and care was taken to disable leiningen's startup time optimizations (which disable many of the JIT optimizations).

Benchmarks can be run by cloning the repository, and running script/bench

There's one outstanding question for this patch: How far should we unroll these calls? update (which is unrolled in the 1.7 alphas) is unrolled to 3 arguments. Adding more unrolling isn't difficult, but it does impact the readability of assoc.

Patch: CLJ-1656-v5.patch

25 Answers

Page:

jira · Answer 1 · 2015-02-09T18:01:47+0000

Comment made by: tcrayford

Ok, attached assoc.diff, which unrolls this to a single level more than the current code (so supporting two key/value pairs without recursion). The code's going to get pretty complex in the case with more than the unrolled number of keys if we continue on this way, so I'm unsure if this is a good approach, but the performance benefits seem very compelling.

jira · Answer 2 · 2015-02-09T21:35:57+0000

Comment made by: michaelblume

Since the unroll comes out kind of hairy, why not have a macro write it for us?

jira · Answer 3 · 2015-02-09T23:01:28+0000

Comment made by: tcrayford

I benchmarked conj with similar unrolling, across a relatively wide range of datatypes from core (lists, sets, vectors, each one empty and then again with 20 elements):

| | 1 | 2 | 3 | 4 |
| :-- | :-- | :-- | :-- | :-- | :-- |
| empty vector (not unrolled) | 19ns | 57ns | 114ns | 126ns |
| empty vector (unrolled conj) | N/A | 44ns | 67ns | 91ns |
| | | | | |
| 20 element vector (not unrolled) | 27.35ns | 69ns | 111ns | 107ns |
| 20 element vector (unrolled conj) | N/A | 54ns | 79ns | 104ns |
| | | | | |
| empty list (not unrolled) | 7ns | 28ns | 53ns | 51ns |
| empty list (unrolled conj) | N/A | 15ns | 20ns | 26ns |
| | | | | |
| twenty element list (not unrolled) | 8.9ns | 26ns | 49ns | 49ns |
| twenty element list (unrolled) | N/A | 15ns | 19ns | 30ns |
| | | | | |
| empty set (not unrolled) | 64ns | 170ns | 286ns | 290ns |
| empty set (unrolled) | N/A | 154ns | 249ns | 350ns |
| | | | | |
| twenty element set (not unrolled) | 33ns | 81ns | 132ns | 130ns |
| twenty element set (unrolled) | N/A | 69ns | 108ns | 139ns |

Benchmarks were run on the same machine as before. There's a less clear benefit here, except for lists, where the overhead of creating seqs and recursing seems to be clearly dominating the cost of actually doing the conj (which makes sense - conj on any element list should be a very cheap operation). Raw benchmark output is here: https://gist.github.com/tcrayford/51a3cd24b8b0a8b7fd74

jira · Answer 4 · 2015-02-09T23:04:31+0000

Comment made by: tcrayford

Michael Blume: I like those patches! They read far nicer to me than my original patch. Did you check if any of those macro generated methods blew Hotspot's hot code inlining limit? (it's 235 bytecodes). That'd be my only worry with using macros here - it's easy to generate code that defeats the inliner.

jira · Answer 5 · 2015-02-09T23:57:19+0000

Comment made by: michaelblume

Thanks! This is new for me, so I might be doing the wrong thing, but I just ran nodisassemble over both definitions and the "instruction numbers" next to each line go up to 219 for the varargs arity assoc and up to 251 for assoc!, so, assuming I'm looking at the right thing, maybe that one needs to have an arity taken off? If I remove the highest arity I get 232 for varargs which is just under the line.

I guess alternatively we could call assoc! instead of assoc!* in the varargs arity, which removes a lot of code -- in that case it's 176 for varargs and 149 for six pairs.

jira · Answer 6 · 2015-02-10T00:01:37+0000

Comment made by: michaelblume

Gah, I forgot to include coll in the varargs call to assoc!

which reminds me that this patch needs tests.

jira · Answer 7 · 2015-02-10T04:27:41+0000

Comment made by: michaelblume

OK, this has some fixes I made after examining the disassembled output. There's a change to the assoc!* macro to be sure it type-hints correctly -- I'm honestly not sure why it didn't type-hint properly before, but it does now. Also, I changed the call to assoc! rolling up the first six entries at the top of the varargs version from a macro call to a function call so it'd fit within the 251 inlineable bytecodes. (This, again, is assuming I'm reading the output correctly).

jira · Answer 8 · 2015-02-10T12:38:09+0000

Comment made by: tcrayford

Michael: Wanna push a branch with these patches to clojars or something? Then I can rerun the benchmarks with the exact code in the patches.

jira · Answer 9 · 2015-02-10T20:36:35+0000

Comment made by: michaelblume

Hmm, not sure I know how to do that -- here's a branch on github though https://github.com/MichaelBlume/clojure/tree/unroll-assoc

jira · Answer 10 · 2015-02-12T19:12:18+0000

Comment made by: michaelblume

v5 marks the helper macros private.

jira · Answer 11 · 2015-02-13T10:11:47+0000

Comment made by: tcrayford

Michael: was that branch just based off clojure/clojure master? I tried running benchmarks off it, but ran into undefined var errors when building this code (which doesn't happen with alpha5):

(Retrieving com/yellerapp/clojure-unrolled-assoc/1.7.0-unrollassoc-SNAPSHOT/clojure-unrolled-assoc-1.7.0-unrollassoc-20150213.092242-1.pom from clojars)
(Retrieving com/yellerapp/clojure-unrolled-assoc/1.7.0-unrollassoc-SNAPSHOT/clojure-unrolled-assoc-1.7.0-unrollassoc-20150213.092242-1.jar from clojars)
(Retrieving org/clojure/clojure/1.3.0/clojure-1.3.0.jar from central)
Exception in thread "main" java.lang.RuntimeException: Unable to resolve symbol: bench in this context, compiling:(bench_assoc_unrolling.clj:5)

at clojure.lang.Compiler.analyze(Compiler.java:6235)
at clojure.lang.Compiler.analyze(Compiler.java:6177)
at clojure.lang.Compiler$InvokeExpr.parse(Compiler.java:3452)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6411)
at clojure.lang.Compiler.analyze(Compiler.java:6216)
at clojure.lang.Compiler.analyze(Compiler.java:6177)
at clojure.lang.Compiler$BodyExpr$Parser.parse(Compiler.java:5572)
at clojure.lang.Compiler$FnMethod.parse(Compiler.java:5008)

jira · Answer 12 · 2015-02-23T23:08:50+0000

Comment made by: michaelblume

Ok, how are you building? Why the strange clojure group?

jira · Answer 13 · 2015-02-23T23:09:37+0000

Comment made by: michaelblume

The existing version of assoc does runtime checking that an even number of varargs are passed in, but assoc! does not. Do we want to preserve this behavior or do checks in both?

jira · Answer 14 · 2015-02-24T00:01:00+0000

Comment made by: michaelblume

Also, I'm curious how relevant inlining is here -- does HotSpot inlining actually work with Var invocation when there's a getRootBinding step in the way?

Unroll assoc and assoc! for small numbers of arguments

Please log in or register to add a comment.

Please log in or register to answer this question.

25 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

Unroll assoc and assoc! for small numbers of arguments

Please log in or register to add a comment.

Please log in or register to answer this question.

25 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories