Welcome! Please see the About page for a little more info on how this works.

0 votes
in core.async by

This is my first patch in a core Clojure project. I will try to be as clear and verbose in the problem statement,
the solution the patch brings and the value proposition.

h3. Problem:

The {{timeout}} function returns a channel that maybe unsafe to call {{close!}} on. It's because in an application that creates
several timeout channels very closely in time, the newly created timeout channels will in fact be the previously created channels
if the call to `timeout` happens less than {{clojure.core.async.impl.timers/TIMEOUT_RESOLUTION_MS}} after the previous ones (currently 10 ms).

It means that manually closing a channel causes code that creates new ones shortly after will in fact retrieves a closed channel.
Taking from it, using {{alt!}} etc hence exhibits unwanted behavior.

It happened to us when we tried to create a safer take in the context of a high traffic website (10+ millions page views/day), which uses timeout channels
against normal channels for calling external services and ensures that no go block is indefinitely waiting for a channel to clause.

Here's the code, to give more context to the discussion:

(defmacro <?
  "Safer <! (take), that times out to avoid waiting infinitely"
  `(let [timeout-ch# (async/timeout 120000) ;; 2 minutes
         val# (as/alt!
                timeout-ch# :timeout
                ~port ([v#] v#))]
       (instance? Throwable val#) (throw val#)
       (= val# :timeout) (throw (ex-info "A timeout occur in a channel async" {}))
       :else val#)))

Using timeout channels (with long timeouts of 2 minutes) without closing them solved our core problem (too many TCP sockets in CLOSE_WAIT state)
but immediately caused the application to use approximately 10%+ CPU and way more memory than before.

After analyzing the problem, we concluded that leaving the timeout channels as is resulted in them continuously executing the code
that checks if their time had passed, even if the "safe take" go block had returned. And most of the time, it's not the external
services that time out or fail, they just work and respond fast.

So for us, closing (with {{close!}}) the timeout channels immediately after the {{alt!}} failed because of the {{TIMEOUT_RESOLUTION_MS}} sharing.
After the first {{<?}}, subsequent calls to {{<?}} that happens shortly after would fails since previous timeout channels were reused and already closed.
In the context of a web application it's very common to do several requests to external services (and coordinates them with core.async)
before being able to send the response. I believe many other kind of applications would behave the same way.

h3. Solution:

Our solution consists in creating another timeout function that:
- doesn't try to reuse channels created in the {{TIMEOUT_RESOLUTION_MS}} window
- always returns a fresh channel
- doesn't add the newly created channel to the {{^ConcurrentSkipListMap timeouts-map}} to avoid collaboration with other normal timeout channels.

Current name is {{closable-timeout}}, to be validated. It could be {{safe-timeout}} or something else.

Please note that we already implemented and deployed the solution to production. We measured precisely the impact and are
confident that this is potentially a good patch that the wider Clojure community would benefit from. What we measured:

- When using the normal timeout channels (without closing them manually), memory usage (committed heap) reached our maximum level very quickly (9GB)
on all our 4 machines and stabilized at this level. Our understand is that it's because each of our requests would create
on average 4 timeout channels that would be garbage collected at least 2 minutes after.
During this 2 minutes window we would get lots of requests, so lot of garbage would be be created before being collected.
The garbage collector (G1) would use on average 3-4% CPU time working (young generation).

- When using our custom timeout channels, memory usage (committed heap) immediately returned to normal (2-5GB) depending on the machine
and stabilized at this level. We believe that it's because garbage channels are collected much faster so they don't pollute the heap
and create much less data to collect for the GC. CPU time spent in collection decreased back to our normal levels of operation (less than 0.5%).
Our understanding of G1 is that one of its specialty is doing much more frequent garbage collections of small amount of data,
and doing so in a much more efficient way than previous GC. Being able to close timeout channels manually allowed our application
to align with G1 behaviour closely.

The patch includes a test that demonstrates the problem, as well as another test that just use instead the new {{closable-timeout}} channel
that doesn't exhibit the problem.

h3. Value proposition:

Why submitting a patch while it can be done in user-space? Because to implement this functionality we had to rely on the
{{clojure.core.async.impl.timers}} namespace which should be an implementation detail, and also had to add the channel
to the {{^DelayQueue timeouts-queue}} from the same namespace by using the {{@#'}} trick (like this {{(def ^DelayQueue async-timeouts-queue @#'clojure.core.async.impl.timers/timeouts-queue)}}).
Hence any refactoring of core.async could break our code or anyone doing the same thing.

h3. Things to consider:

- We haven't tested with other garbage collectors, nor on the GraalVM.
- We didn't change at all the current timeout implementation, understand that this {{TIMEOUT_RESOLUTION_MS}} trick is a performance optimization.
- We added a note in the docstring of the normal timeout about the fact that normal timeout channels are unsafe to close manually.
- We wonder if it would be better, for normal timeout channels to throw an Exception when calling {{close!}} on them since it's unsafe.
- I haven't given lots of thoughts about CLJS core.async, since this repo contains only Clojure code but both CLJ and CLJS tests (no idea how the thing works...)

PS: I have no idea if I should have run some script from the {{script}} folder.
PS2: Not sure if this is an enhancement or a defect, but as it bit us hard I chose defect.
PS3: I'm open to all comments, no string attached.

1 Answer

0 votes
Reference: https://clojure.atlassian.net/browse/ASYNC-225 (reported by alex+import)