Welcome! Please see the About page for a little more info on how this works.

0 votes
in Clojure by

I have read that Clojure sequences "cache" their values as they are realized (here and here for example). This often seems to come up when discussing the relative merits of using transducers for transforming values rather than chaining sequence functions, as the latter approach creates "intermediate cached sequences".

However, I have also come across the advice "don’t hold on to your head" when dealing with sequences, because holding a reference to the beginning of the sequence prevents GC from freeing up memory used to hold previously seen values in the sequence.

At my current level of understanding these ideas appear to conflict. If sequences cache their results, then why does it matter whether you hold the head? Does caching not imply that all the realised elements are held onto anyway? Or looked at the other way, why would sequences cache their realised values if the GC is free to clean them up?

I know I must be missing something obvious here, but I would really appreciate an explanation.

1 Answer

+2 votes
selected by
Best answer

A realized sequence is a chain of objects in memory, with each link in the chain pointing to a value and to the next link until you eventually get to a function object which is the remaining unrealized portion of the sequence. References to any link in the realized chain cause the rest of the chain from that point forward to be strongly held and unable to be gc'ed.

If you are only pointing to the unrealized "end" of the chain, then all of the links "behind" you can be collected. When you "hold the head", that pointer is to the beginning of the chain, which prevents the N links from beginning to the unrealized point to be held in memory.

For example:

(def r (repeat 100000000 "abcdef"))   ;; r holds a strong reference to the head
(count r)


(count (repeat 100000000 "abcdef"))

which can walk the seq, allowing the links behind the counter to be gc'ed. (Note that range is a typical thing people use for these kinds of examples, but it has an optimized count for the typical cases.)

Thanks Alex, that all makes sense. The bit I still dont quite understand is why seq's are said to be "cached"? If there is no ref to the previously realised values (as in your second example) then in what scenario are the previously realised values reused (i.e. a cache hit)? Or is the term cache here not intended to imply reuse of the values, just that they are still in memory until a GC pass?
I believe the answer is that the caching here means that if you _do_ hold onto the head, whereas on the first traversal through the sequence, each element must be calculated, if you then traverse it a second, third, etc. time while holding onto the head, no recalculation is done on the later traversals.  The values are cached by consuming memory to avoid the need to recalculate them.
"Cache" here just means that the links in the chain retain the value computed at that point in the chain.
Thank you both. So is it then fair to say that the potential performance concern with chained seq funcs is that the intermediate sequences can cause more frequent GC passes to clear the realised values (in the typical case where previous values are not referenced explicitly)?
Yes, nested sequence functions will create N layers of nodes (vs 1 layer for transducer reduction), which means N times as much garbage, and thus N times as much collection.