Welcome! Please see the About page for a little more info on how this works.

–1 vote
in Collections by

As discussed on the mailing list (link: 1), this patch has two unrolled variants of vectors and maps, with special inner classes for each cardinality. Currently both grow to six elements before spilling over into the general versions of the data structures, which is based on rough testing but can be easily changed. At Rich's request, I haven't included any integration into the rest of the code, and there are top-level static create() methods for each.

The sole reason for this patch is performance, both in terms of creating data structures and performing operations on them. This can be seen as a more verbose version of the trick currently played with PersistentArrayMap spilling over into PersistentHashMap. Based on the benchmarks, which can be run by cloning cambrian-collections (link: 2) and running 'lein test :benchmark', this should supplant PersistentArrayMap. Performance is at least on par with PAM, and often much faster. Especially noteworthy is the creation time, which is 5x faster for maps of all sizes (lein test :only cambrian-collections.map-test/benchmark-construction), and on par for 3-vectors, but 20x faster for 5-vectors. There are similar benefits for hash and equality calculations, as well as calls to reduce().

This is a big patch (over 5k lines), and will be kind of a pain to review. My assumption of correctness is based on the use of collection-check, and the fact that the underlying approach is very simple. I'm happy to provide a high-level description of the approach taken, though, if that will help the review process.

I'm hoping to get this into 1.7, so please let me know if there's anything I can do to help accomplish that.

(link: 1) https://groups.google.com/forum/#!topic/clojure-dev/pDhYoELjrcs
(link: 2) https://github.com/ztellman/cambrian-collections

Patch: unrolled-vector-2.patch

Screener Notes: The approach is clear and understandable. Given the volume of generated code, I believe that best way to improve confidence in this code is to get people using it asap, and add collection-test (link: 3) to the Clojure test suite. I would also like to get the generator (link: 4) included in the Clojure repo. We don't need to necessarily automate running it, but would be nice to have it nearby if we want to tweak something later.

(link: 3) https://github.com/ztellman/collection-check/blob/master/src/collection_check.clj
(link: 4) https://github.com/ztellman/cambrian-collections/blob/master/generate/cambrian_collections/vector.clj

43 Answers

0 votes
by

Comment made by: ztellman

What is our reference for "relevant" performance? I (or anyone else) can provide microbenchmarks for calculating hashes or whatever else, but that doesn't prove that it's an important improvement. I previously provided benchmarks for JSON decoding in Cheshire, but that's just one of many possible real-world benchmarks. It might be useful to have an agreed-upon list of benchmarks that we can use when debating what is and isn't useful.

0 votes
by

Comment made by: mikera

I was interested in this implementation so created a branch that integrates Zach's unrolled vectors on top of clojure 1.8.0-alpha2. I also simplified some of the code (I don't think the metadata handling or unrolled seqs are worthwhile, for example)

Github branch: https://github.com/mikera/clojure/tree/clj-1517

Then I ran a set of micro-benchmarks created by Peter Taoussanis

Results: https://gist.github.com/mikera/72a739c84dd52fa3b6d6

My findings from this testing:
- Performance is comparable (within +/- 20%) on the majority of tests
- Zach's approach is noticeably faster (by 70-150%) for 4 operations (reduce, mapv, into, equality)

My view is that these additional optimisations are worthwhile. In particular, I think that reduce and into are very important operations. I also care about mapv quite a lot for core.matrix (It's fundamental to many numerical operations on arrays implemented using Clojure vectors).

Happy to create a patch along these lines if it would be acceptable.

0 votes
by

Comment made by: ztellman

The reduce improvements are likely due to the unrolled reduce and kvreduce impls, but the others are probably because of the unrolled transient implementation. The extra code required to add these would be pretty modest.

0 votes
by

Comment made by: mikera

I actually condensed the code down to a single implementation for Transient and TupleSeq. I don't think these really need to be fully unrolled for each Tuple type. That helps by making the code even smaller (and probably also helps performance, given JVM inline caching etc.)

0 votes
by

Comment made by: ptaoussanis

Hey folks,

Silly question: is there actually a particular set of reference benchmarks that everyone's currently using to test the work on tuples? It took me a while to notice how bad the variance was with my own set of micro benchmarks.

Bumping all the run counts up till the noise starts ~dying down, I'm actually seeing numbers now that don't seem to agree with others here (?).

Google Docs link: https://docs.google.com/spreadsheets/d/1QHY3lehVF-aKrlOwDQfyDO5SLkGeb_uaj85NZ7tnuL0/edit?usp=sharing
gist with the benchmarks: https://gist.github.com/ptaoussanis/0a294809bc9075b6b02d

Thanks, cheers! :-)

0 votes
by

Comment made by: ztellman

Hey Peter, I can't reproduce your results, and some of them are so far off what I'd expect that I have to think there was some data gathering error. For instance, the assoc operation being slower is kind of inexplicable, considering the unrolled version doesn't do any copying, etc. Also, all of your numbers are significantly slower than the ones on my 4 year old laptop, which is also a bit strange.

Just to make sure that we're comparing apples to apples, I've adapted your benchmarks into something that pretty-prints the mean runtime and variance for 1.7, 1.8-alpha2, and Mike's 1517 fork. It can be found at https://github.com/ztellman/tuple-benchmark, and the results of a run at https://gist.github.com/ztellman/3701d965228fb9eda084.

0 votes
by

Comment made by: mikera

Hey Zach just looked at your benchmarks and they are definitely more consistent with what I am seeing. The overall nanosecond timings look about right from my experience with similar code (e.g. working with small vectors in vectorz-clj).

0 votes
by

Comment made by: ptaoussanis

Hi Zach, thanks for that!

Have updated the results -
Gist: https://gist.github.com/ptaoussanis/0a294809bc9075b6b02d
Google docs: https://goo.gl/khgT83

Note that I've added an extra sheet/tab to the Google doc for your own numbers at https://gist.github.com/ztellman/3701d965228fb9eda084.

Am still struggling to produce results that show any consistent+significant post-JIT benefit to either of the tuple implementations against the micro benchmarks and one larger small-vec-heavy system I had handy.

It's looking to me like it's maybe possible that the JIT's actually optimising away most of the non-tuple inefficiencies in practice?

Of course it's very possible that my results are off, or my expectations wrong. The numbers have been difficult to pin down.

It definitely helped to have a standardised reference micro benchmark to work against (https://github.com/ztellman/tuple-benchmark). Could I perhaps suggest a similar reference macro benchmark (maybe something from core.matrix, Mike?)

Might also be a good idea to define a worthwhile target performance delta for ops like these that run in the nanosecond range (or for the larger reference macro benchmark)?

Just some thoughts from someone passing through in case they're at all useful; know many of you have been deeply involved in this for some time so please feel free to ignore any input that's not helpful :-)

Appreciate all the efforts, cheers!

0 votes
by

Comment made by: richhickey

I think everyone should back off on their enthusiasm for this approach. After much analysis, I am seeing definite negative impacts to tuples, especially the multiple class approach proposed by Zach. What happens in real code is that the many tuple classes cause call sites that see different sized vectors to become megamorphic, and nothing gets adequately optimized. In particular, call sites that will see tuple-sized and large vectors (i.e. a lot of library code) will optimize differently depending upon which they see often first. So, if you run your first tight loop on vector code that sees tuples, that code could later do much worse (50-100%) on large vectors than before the tuples patch was in place. Being much slower on large collections is a poor tradeoff for being slightly faster on small ones.

Certainly different tuple classes for arity 0-6 is a dead idea. You get as good or better optimization (at some cost in size) from a single class e.g. one with four fields, covering sizes 0-4. I have a working version of this in a local branch. It is better in that sites that include pvectors are only bi-morphic, but I am still somewhat skittish given what I've seen.

The other takeaway is that the micro benchmarks are nearly worthless for detecting these issues.

0 votes
by

Comment made by: ztellman

I'm pretty sure that all of my real-world applications of the tuples (via clj-tuple) have been fixed cardinality, and wouldn't have surfaced any such issue. Thanks for putting the idea through its paces.

0 votes
by

Comment made by: mikera

Rich these are good insights - do you have a benchmark that you are using as representative of real world code?

I agree that it is great if we can avoid call sites becoming megamorphic, though I also believe the ship has sailed on that one already when you consider the multiple types of IPersistentVector that already exist (MapEntry, PersistentVector, SubVector plus any library-defined IPersistentVector instances such as clojure.core.rrb-vector). As a consequence, the JVM is usually not going to be able to prove that a specific IPersistentVector interface invocation is monomorphic, which is when the really big optimisations happen.

In most of the real world code that I've been working with, the same size/type of vector gets used repeatedly (Examples: iterating over map entries, working with a sequence of size-N vectors), so in such cases we should be able to rely on the polymorphic inline cache doing the right thing.

The idea of a single Tuple class for sizes 0-4 is interesting, though I can't help thinking that a lot of the performance gain from this may stem from the fact that a lot of code does stuff like (reduce conj (link: ) .....) or the transient equivalent which is a particularly bad use case for Tuples, at least from the call site caching perspective. There may be a better way to optimise such cases rather than simply trying to make Tuples faster.... e.g. calling asTransient() on a Tuple0 could perhaps switch straight into the PersistentVector implementation.

0 votes
by

Comment made by: colinfleming

Here's a relevant issue from Google's Guava project, where they also found serious performance degradation with fixed-size collections: https://github.com/google/guava/issues/1268. It has a lot of interesting detail about how the performance breaks down.

0 votes
by
Reference: https://clojure.atlassian.net/browse/CLJ-1517 (reported by ztellman)
...