Welcome! Please see the About page for a little more info on how this works.

+4 votes
in Sequences by

After looking at some performance improvements around clojure.core/distinct, I discovered that Nikita (@tonsky) had made suggestions about this back at the end of 2016, along with a code submission at CLJ-2090.

At the time, Alex made a comment about waiting until clojure.core/contains? supported transient sets, and the issue was pushed back. I note that this has now been addressed. However, in trying a few different approaches, I actually get better performance not using a transient at all, but instead checking if a conj has modified the object. This is done using a non-atomic deref/vreset, which begs the questions… is accessing a transducer supposed to be considered thread-safe? There are approaches to mitigate this, but there isn't a need if that's not a guarantee.

I discuss some of the approaches and provide some benchmarking in this project on Github. The fastest approach actually uses the ITransientSet interface directly, which I got the impression Alex would like to avoid.

by
Might I suggest adding a benchmark for the existing `distinct` so it's easy to compare how much improvement each implementation provides.
by
Every test is a comparison to clojure.core/distinct, so I’m not sure what you mean, sorry.

The benchmarks get the time for an operation on the original distinct, and then then compares it to the variation. The % comparison is the percentage reduction from clojure.core/distinct.

Or are you asking for something that I’m misunderstanding?
by
I understand that they show the difference by comparison, but I think including the base numbers would be helpful.
by
They’re on the left of every line. Did you want to see them printed elsewhere as well?
by
Oh I'm sorry, I completely missed that those were the baseline numbers.
by
For anyone who may be following this, I expanded the benchmarking, separating out the laziness change from the transducer variations. I also added a new variation for completeness.

To be honest, the best performing functions surprised me. I did not expect the persistent set to perform as well as it did, and I also did not expect the transient variations to perform worse for large sequences. However, these both made sense when I considered the shape of the data (large sequences with a small percentage of unique values).

I was also surprised that the transient set approach that skips the call to `contains?` performed as well as it did. While it avoids some of the code path of an earlier function, I am surprised that it didn't slow down similarly to the `distinct-transient` function.

Please log in or register to answer this question.

...