Share your thoughts in the 2024 State of Clojure Survey!

Welcome! Please see the About page for a little more info on how this works.

0 votes
in Protocols by
There is a PermGen memory leak that we have tracked down to protocol methods and multimethods called inside an {{eval}}, because of the caches these methods use. The problem only arises when the value being cached is an instance of a class (such as a function or reify) that was defined inside the {{eval}}. Thus extending {{IFn}} or dispatching a multimethod on an {{IFn}} are likely triggers.

*Reproducing:* The easiest way that I have found to test this is to set "{{-XX:MaxPermSize}}" to a reasonable value so you don't have to wait too long for the PermGen spaaaaace to fill up, and to use "{{-XX:+TraceClassLoading}}" and "{{-XX:+TraceClassUnloading}}" to see the classes being loaded and unloaded.

{code:title=leiningen project.clj}
(defproject permgen-scratch "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.5.0-RC1"]]
  :jvm-opts ["-XX:MaxPermSize=32M"
             "-XX:+TraceClassLoading"
             "-XX:+TraceClassUnloading"])


You can use {{lein swank 45678}} and connect with slime in emacs via {{M-x slime-connect}}.

To monitor the PermGen usage, you can find the Java process to watch with "{{jps -lmvV}}" and then run "{{jstat -gcold +_<PROCESS_ID>_+ 1s}}". According to [the jstat docs|http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html#gcold_option], the first column (PC) is the "Current permanent space capacity (KB)" and the second column (PU) is the "Permanent space utilization (KB)". VisualVM is also a nice tool for monitoring this.

h2. Multimethod leak

Evaluating the following code will run a loop that eval's {{(take* (fn foo []))}}.

{code:title=multimethod leak}
(defmulti take* (fn [a] (type a)))

(defmethod take* clojure.lang.Fn
  [a]
  '())

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)


In the {{lein swank}} session, you will see many lines like below listing the classes being created and loaded.


[Loaded user$eval15802$foo__15803 from __JVM_DefineClass__]
[Loaded user$eval15802 from __JVM_DefineClass__]


These lines will stop once the PermGen space fills up.

In the jstat monitoring, you'll see the amount of used PermGen space (PU) increase to the max and stay there.


-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
 31616.0  31552.7    365952.0         0.0      4     0    0.000    0.129
 32000.0  31914.0    365952.0         0.0      4     0    0.000    0.129
 32768.0  32635.5    365952.0         0.0      4     0    0.000    0.129
 32768.0  32767.6    365952.0      1872.0      5     1    0.000    0.177
 32768.0  32108.2    291008.0     23681.8      6     2    0.827    1.006
 32768.0  32470.4    291008.0     23681.8      6     2    0.827    1.006
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258


A workaround is to run {{prefer-method}} before the PermGen space is all used up, e.g.

(prefer-method take* clojure.lang.Fn java.lang.Object)


Then, when the used PermGen space is close to the max, in the {{lein swank}} session, you will see the classes created by the eval'ing being unloaded.

[Unloading class user$eval5950$foo__5951]
[Unloading class user$eval3814]
[Unloading class user$eval2902$foo__2903]
[Unloading class user$eval13414]


In the jstat monitoring, there will be a long pause when used PermGen space stays close to the max, and then it will drop down, and start increasing again when more eval'ing occurs.


-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
 32768.0  32767.9    159680.0     24573.4      6     2    0.167    0.391
 32768.0  32767.9    159680.0     24573.4      6     2    0.167    0.391
 32768.0  17891.3    283776.0     17243.9      6     2   50.589   50.813
 32768.0  18254.2    283776.0     17243.9      6     2   50.589   50.813


The {{defmulti}} defines a cache that uses the dispatch values as keys. Each eval call in the loop defines a new foo class which is then added to the cache when {{take*}} is called, preventing the class from ever being GCed.

The prefer-method workaround works because it calls {{clojure.lang.MultiFn.preferMethod}}, which calls the private {{MultiFn.resetCache}} method, which completely empties the cache.

h2. Protocol leak

The leak with protocol methods similarly involves a cache. You see essentially the same behavior as the multimethod leak if you run the following code using protocols.

{code:title=protocol leak}
(defprotocol ITake (take* [a]))

(extend-type clojure.lang.Fn
  ITake
  (take* [this] '()))

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)


Again, the cache is in the {{take*}} method itself, using each new {{foo}} class as a key.

*Workaround:* A workaround is to run {{-reset-methods}} on the protocol before the PermGen space is all used up, e.g.

(-reset-methods ITake)


This works because {{-reset-methods}} replaces the cache with an empty MethodImplCache.

*Patch:* protocol_multifn_weak_ref_cache.diff

*Screened by:*

23 Answers

0 votes
by

Comment made by: chouser@n01se.net

I think the most obvious solution would be to constrain the size of the cache. Adding an item to the cache is already not the fastest path, so a bit more work could be done to prevent the cache from growing indefinitely large.

That does raise the question of what criteria to use. Keep the first n entries? Keep the n most recently used (which would require bookkeeping in the fast cache-hit path)? Keep the n most recently added?

0 votes
by

Comment made by: jsmorph

At a minimum, perhaps a switch to disable the caches -- with obvious performance impact caveats.

Seems like expensive LRU logic is probably the way to go, but maybe don't have it kick in fully until some threshold is crossed.

0 votes
by

Comment made by: alexmiller

A report seeing this in production from mailing list:
https://groups.google.com/forum/#!topic/clojure/_n3HipchjCc

0 votes
by

Comment made by: adrianm

So this is why we've been running into PermGen space exceptions! This is a fairly critical bug for us - I'm making extensive use of multimethods in our codebase and this exception will creep in at runtime randomly.

0 votes
by

Comment made by: hiredman

it might be better to split this in to two issues, because at a very abstract level the two issues are the "same", but concretely they are distinct (protocols don't really share code paths with multimethods), keeping them together in one issue seems like a recipe for a large hard to read patch

0 votes
by

Comment made by: hiredman

naive-lru-method-cache-for-multimethods.diff replaces the methodCache in multimethods with a very naive lru cache built on PersistentHashMap and PersistentQueue

0 votes
by

Comment made by: hiredman

naive-lru-for-multimethods-and-protocols.diff creates a new class clojure.lang.LRUCache that provides an lru cache built using PHashMap and PQueue behind an IPMap interface.

changes MultiFn to use an LRUCache for its method cache.

changes expand-method-impl-cache to use an LRUCache for MethodImplCache's map case

0 votes
by

Comment made by: hiredman

I suspect my patch naive-lru-for-multimethods-and-protocols.diff is just wrong, unless MethodImplCache really is being used as a cache we can't just toss out entries when it gets full.

looking at the deftype code again, it does look like MethidImplCache is being used as a cache, so maybe the patch is fine

if I am sure of anything it is that I am unsure so hopefully someone who is sure can chime in

0 votes
by

Comment made by: bronsa

I haven't looked at your patch, but I can confirm that the MethodImplCache in the protocol function is just being used as a cache

0 votes
by

Comment made by: killme2008

I developed a new patch that convert the methodCache in MultiFn to use WeakReference for dispatch value,and clear the cache if necessary.

I've test it with the code in ticket,and it looks fine.The classes will be unloaded when perm gen is almost all used up.

0 votes
by

Comment made by: alexmiller

I don't know which to evaluate here. Does multifn_weak_method_cache.diff supersede naive-lru-for-multimethods-and-protocols.diff or are these alternate approaches both under consideration?

0 votes
by

Comment made by: hiredman

the most straight forward thing, I think, is to consider them as alternatives, I am not a huge fan of weakrefs, but of course not using weakrefs we have to pick some bounding size for the cache, and the cache has a strong reference that could prevent a gc, so there are trade offs. My reasons to stay away from weak refs in general are using them ties the behavior of whatever you are building to the behavior of the gc pretty strongly. that may be considered a matter of personal taste

0 votes
by

Comment made by: jafingerhut

All patches dated Aug 8 2014 and earlier no longer applied cleanly to latest master after some commits were made to Clojure on Aug 29, 2014. They did apply cleanly before that day.

I have not checked how easy or difficult it might be to update the patches.

0 votes
by

Comment made by: hiredman

I've updated naive-lru-for-multimethods-and-protocols.diff to apply to the current master

0 votes
by

Comment made by: jafingerhut

Thanks, Kevin. While JIRA allows multiple attachments to a ticket with the same filename but different contents, that can be confusing for people looking for a particular patch, and for a program I have that evaluates patches for things like whether they apply and build cleanly. Would you mind removing the older one, or in some other way making all the names unique?

...