Welcome! Please see the About page for a little more info on how this works.

0 votes
in Collections by
edited by

I know that lazy sequences can be processed in a chunked manner. That is, it may be that if my program needs n elements from a lazy sequence, more than n might be realized.

Can I assume that if I run the same code a second time in the same version of Clojure, that exactly the same elements will get realized? I would think so, but I don't want to assume that without knowing.

(I'm including additional info here in case anyone is interested in why this matters, but I don't think it's necessary to know my use case in order to answer the question. So feel free to skip this paragraph. I'm running simulations that make use of a pseudorandom number generator. These involve random walks that are truncated if the walk encounters a "target". It's convenient to generate a walk in advance as a lazy sequence, and then stop realizing its elements when a target is found. Normally, I throw away intermediate results such as the original random walk. They exist in a data structure at one time, but then I don't use them in the end. However, I want to be able to go back and rerun a simulation, sometimes, in order to examine the intermediate results. It's easy enough to store the seed that I used for a simulation run, and use it to set the random number generator to start in the same state. However, chunking can cause the random number generator to be called more times than are strictly necessary. That's OK, but if chunking can vary depending on something other than my code and the Clojure version, that could mean that rerunning a simulation with the same seed would not be guaranteed to produce the same results. This is not at present a big deal. I can make the intermediate data non-lazy without much of a performance hit. However, I'm still wondering whether that's necessary.)

2 Answers

+1 vote
selected by
Best answer

You should not assume anything about when lazy sequences are realized, so no that's not guaranteed.

If you want control over generation, use loop/recur or reduce with reduced, or other such techniques to get that control.

OK!  Will do.  Thanks.
+1 vote

This doesn't make sense to me:

that could mean that rerunning a simulation with the same seed would not be guaranteed to produce the same results

The number of times the function is evaluated seems irrelevant (except e.g. for performance [e.g. costly function computation] or other cases like entangled side effects....). Since you are drawing results from the sequence, and halting at some criterion, the process that generates these results should be reproducible with the same inputs. If that iterative process relies on its own PRNG initialized with a specific seed value, and the state transition function depends on an immutable context and the PRNG's current state (e.g. "functionally pure" but with the benign side effects of a stateful PRNG), then I don't see how you could create a different sequence of states. If the chunk generation decides to run over 22 more times than necessary, it does not effect the preceding 10. Those 10 values (say the 10th of which is sufficient for the halting criterion when consuming the seq), are cached and completely independent of any successive values, so the immutability of the sequence is preserved. So long as you are isolating a PRNG for each generated sequence, I believe you should have complete independence and reproducibility (when controlling for factors like PRNG implementation, runtime, clojure version, etc.).

I have done discrete event simulation stuff (primarily deterministic, although with the occasional stochastic initial conditions) for over a decade. Reproducibility, particularly for comparative verification and as you mentioned, fine-grained inspection of the evolution of the system (e.g. intermediate states) is very important. If you control the seed, the PRNG instance, and other factors mentioned earlier, this leads to reproducible history.

As mentioned, if you are still worried, you can always retain familiarity, performance, and control by using iterate and transduce/reduce/into to define an eager iterative process that won't chunk (or generate intermediate seqs). I typically go this route these days for performance reasons more than anything else, although my simulations tend to be smaller in cardinality of the "frames" generated and more substantial in the state transition function complexity.

edited by
Thanks @Tom.  I think that what I didn't make clear is that I want to run a series of separate simulations (with different parameters) without reinitializing the PRNG.  (Not reinitializing it is a good way to allow the PRNG to make sure that the new numbers are independent of preceding numbers.  i.e. to the extent that any PRNG can be trusted to do that--which is a different topic.)

Suppose I initialize a PRNG with a seed.  Then I do a simulation run that generates a lazy sequence steps1.  I do something equivalent to 'take 10000', let's say.  There is one PRNG call per step, so I've now advanced the PRNG through 10000+k states, where k is extra processing due to chunking.  Then, using the same PRNG but not reseeding it, starting from that point I do a different simulation run that creates a lazy sequence steps2, and run 'take 10000' on it.   I then throw away steps2, but later I want to examine it more closely, so I rerun the first simulation (or maybe just step the PRNG 10000+k times, if I knew what k was), and then run the second simulation again to recreate steps2.

But according to @alexmiller, I should not assume that k will always be the same.  If k could be different, then  even running the first simulation before the second would not guarantee that I was actually be recreating steps2 as it was before.

If I don't use laziness, then I would know exactly how many times the PRNG was called to create steps1--10000 in this example--so to recreate steps2, I only need to give the PRNG the known seed and then throw away the first 10000 numbers, which doesn't take much time. Then I can perform the second simulation to create steps2.  (That was probably more detail than needed.)

An alternative would be to read out the state of the PRNG and store it, and then use that to initialize it if I need to recreate a run.  I'm using a PRNG (Well19937c) that has a larger internal state than the long seeds I use to initialize it.

(I took out the laziness last night.  I think I might be seeing a small performance benefit as a result, but most of the processing time is in routines that run on each step regardless, so the difference is small.)
Yeah, I see now.  You are creating an implicit dependency in the PRNG state by trying to draw from it as a stream, with the expectation that your prng will be deterministically affected by chunking overdraw.  I don't see the particular advantage in this (excepting some more useful distribution of pseudo random values) vs. computing a new seed and initializing between runs.  I guess you only have one seed to remember for an arbitrary number of derived sequences.  Under this methodology though, in order to discover what happens after the nth sequence, say steps_n, you now have to run through all of the predecessors to arrive at the starting point to observe the sequence you're interested in.  I would probably just compute or derive a seed at the terminus of each sequence to use as a the initializer for the next, which prevents you having to capture the internal state.  You could introduce some implicit dependency between seq generation, but you are building an implicit distribution of seeds over the PRNG space and sampling from it as a function of random walks.....I don't think the quality of the sampling will be all that biased, but ymmv.

I think another option (to preserve and maximize laziness) is to unchunk per stuart's answer https://stackoverflow.com/a/3409568 .  Just turn off chunking :)
I am being more careful than necessary to avoid possible correlations between runs, but I would rather do that.  I agree that generating a new seed each time, for example by using the same PRNG, is reasonable.  (I would rather not generate multiple seeds some other way, e.g. from system time and stuff like that, for fear that if that happened too soon, I could accidentally end up with the same seed.)  

Notice that generating a next seed from the PRNG as a long reduces a 624-bit state (for a WELL19937c) to a 64-bit state between runs.  Shouldn't make a difference in practice, but still, continuing from the previous 624-bit state would be better, if it did make a difference.

I'm influenced somewhat by L'Ecuyer et al.'s "Random numbers for parallel computers".  
I'm not running simulations in parallel, but some of the principles involved are relevant.  But again, I'm probably overdoing it.

I also throw away a couple thousand numbers after I seed the PRNG, because if you're really unlucky, a WELL generator can start in a state that is less random, for a while.  But I don't think the WELL19937c should have this problem.  (https://www.iro.umontreal.ca/~lecuyer/myftp/papers/wellrng.pdf )

Unchunking: Wow, that's interesting.  Thanks.
>Notice that generating a next seed from the PRNG as a long reduces a 624-bit state (for a WELL19937c) to a 64-bit state between runs.

Interesting.  It looks like the apache commons math api implementation allows you to fetch the int array and seed with it if you want to persist state forward, so storing int arrays as your seed instead of downgrading into long space could be viable to prevent loss of precision.  Thanks for the notes.
Thanks.  Yeah, I started wondering if I could do that after I generated a long sequence of experiments with one seed.  I think would have to cycle the PRNG a few hundred thousand times to rerun one of the later sets of experiments.

I don't yet see a method for reading the state.  What class were you looking at?  I might be missing a relevant interface or something in the class hierarchy.  Did you mean the `v` "bytes pool" field in `Abstract Well`?  I'll have to check the source, but that sounds like it might be what I'd need.  Or are you looking at Apache Commons Math 4?  That seemed still under construction, so I've been using Math 3.6.1, which has a very different organization. (Please feel free to not answer--it's not your job to look up stuff in Apache Commons for me.)
getStateInternal() is in the development versions of the Apache Commons PRNGs.