PermGen leak in multimethods and protocol fns when evaled

Question

PermGen leak in multimethods and protocol fns when evaled

asked Jan 30, 2013 in Protocols by jira

There is a PermGen memory leak that we have tracked down to protocol methods and multimethods called inside an {{eval}}, because of the caches these methods use. The problem only arises when the value being cached is an instance of a class (such as a function or reify) that was defined inside the {{eval}}. Thus extending {{IFn}} or dispatching a multimethod on an {{IFn}} are likely triggers.

*Reproducing:* The easiest way that I have found to test this is to set "{{-XX:MaxPermSize}}" to a reasonable value so you don't have to wait too long for the PermGen spaaaaace to fill up, and to use "{{-XX:+TraceClassLoading}}" and "{{-XX:+TraceClassUnloading}}" to see the classes being loaded and unloaded.

{code:title=leiningen project.clj}
(defproject permgen-scratch "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.5.0-RC1"]]
  :jvm-opts ["-XX:MaxPermSize=32M"
             "-XX:+TraceClassLoading"
             "-XX:+TraceClassUnloading"])

You can use {{lein swank 45678}} and connect with slime in emacs via {{M-x slime-connect}}.

To monitor the PermGen usage, you can find the Java process to watch with "{{jps -lmvV}}" and then run "{{jstat -gcold +_<PROCESS_ID>_+ 1s}}". According to [the jstat docs|http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html#gcold_option], the first column (PC) is the "Current permanent space capacity (KB)" and the second column (PU) is the "Permanent space utilization (KB)". VisualVM is also a nice tool for monitoring this.

h2. Multimethod leak

Evaluating the following code will run a loop that eval's {{(take* (fn foo []))}}.

{code:title=multimethod leak}
(defmulti take* (fn [a] (type a)))

(defmethod take* clojure.lang.Fn
  [a]
  '())

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)

In the {{lein swank}} session, you will see many lines like below listing the classes being created and loaded.

[Loaded user$eval15802$foo__15803 from __JVM_DefineClass__]
[Loaded user$eval15802 from __JVM_DefineClass__]

These lines will stop once the PermGen space fills up.

In the jstat monitoring, you'll see the amount of used PermGen space (PU) increase to the max and stay there.

-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
31616.0 31552.7    365952.0         0.0      4     0    0.000    0.129
32000.0 31914.0    365952.0         0.0      4     0    0.000    0.129
32768.0 32635.5    365952.0         0.0      4     0    0.000    0.129
32768.0 32767.6    365952.0      1872.0      5     1    0.000    0.177
32768.0 32108.2    291008.0     23681.8      6     2    0.827    1.006
32768.0 32470.4    291008.0     23681.8      6     2    0.827    1.006
32768.0 32767.2    698880.0     24013.8      8     4    1.073    1.258
32768.0 32767.2    698880.0     24013.8      8     4    1.073    1.258
32768.0 32767.2    698880.0     24013.8      8     4    1.073    1.258

A workaround is to run {{prefer-method}} before the PermGen space is all used up, e.g.

(prefer-method take* clojure.lang.Fn java.lang.Object)

Then, when the used PermGen space is close to the max, in the {{lein swank}} session, you will see the classes created by the eval'ing being unloaded.

[Unloading class user$eval5950$foo__5951]
[Unloading class user$eval3814]
[Unloading class user$eval2902$foo__2903]
[Unloading class user$eval13414]

In the jstat monitoring, there will be a long pause when used PermGen space stays close to the max, and then it will drop down, and start increasing again when more eval'ing occurs.

-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
32768.0 32767.9    159680.0     24573.4      6     2    0.167    0.391
32768.0 32767.9    159680.0     24573.4      6     2    0.167    0.391
32768.0 17891.3    283776.0     17243.9      6     2   50.589   50.813
32768.0 18254.2    283776.0     17243.9      6     2   50.589   50.813

The {{defmulti}} defines a cache that uses the dispatch values as keys. Each eval call in the loop defines a new foo class which is then added to the cache when {{take*}} is called, preventing the class from ever being GCed.

The prefer-method workaround works because it calls {{clojure.lang.MultiFn.preferMethod}}, which calls the private {{MultiFn.resetCache}} method, which completely empties the cache.

h2. Protocol leak

The leak with protocol methods similarly involves a cache. You see essentially the same behavior as the multimethod leak if you run the following code using protocols.

{code:title=protocol leak}
(defprotocol ITake (take* [a]))

(extend-type clojure.lang.Fn
  ITake
  (take* [this] '()))

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)

Again, the cache is in the {{take*}} method itself, using each new {{foo}} class as a key.

*Workaround:* A workaround is to run {{-reset-methods}} on the protocol before the PermGen space is all used up, e.g.

(-reset-methods ITake)

This works because {{-reset-methods}} replaces the cache with an empty MethodImplCache.

*Patch:* protocol_multifn_weak_ref_cache.diff

*Screened by:*

23 Answers

Page:

jira · Answer 1 · 2014-08-30T02:43:36+0000

Comment made by: hiredman

I deleted all of my attachments accept for my latest and greatest

jira · Answer 2 · 2014-08-30T15:51:46+0000

Comment made by: killme2008

I updated multifn_weak_method_cache2.diff patch too.

I think using weak reference cache is better,because we have to keep one cache per multifn.When you have many multi-functions, there will be many LRU caches in memory,and they will consume too much memory and CPU for evictions. You can't choose a proper threshold for LRU cache in every environment.
But i don't have any benchmark data to support my opinion.

jira · Answer 3 · 2014-09-10T20:38:19+0000

Comment made by: alexmiller

I'm going to set the LRU cache patch aside. I don't think it's possible to find a "correct" size for it and it seems weird to me to extend APersistentMap to build such a thing anyways.

I think it makes more sense to follow the same strategy used for other caches (such as the Keyword cache) - a combination ConcurrentHashMap with WeakReferences and a ReferenceQueue for clean-up. I don't see any compelling reason not to take the same path as other internal caches.

jira · Answer 4 · 2014-09-10T21:44:41+0000

Comment made by: alexmiller

Stepping back a little to think about the problem.... our requirements are:
1) cache map of dispatch value (could be any Object) to multimethod function (IFn)
2) do we want keys to be compared based on equality or identity? identity-based opens up more reference-based caching options and is fine for most common dispatch types (Class, Keyword), but reduces (often eliminates?) cache hits for all other types where values are likely to be equiv but not identical (vector of strings for example)
3) concurrent access to cache
4) cache cannot grow without bound
5) cache cannot retain strong references to dispatch values (the cache keys) because the keys might be instances of classes that were loaded in another classloader which will prevent GC in permgen

multifn_weak_method_cache.diff uses a ConcurrentHashMap (#3) that maps RefWrapper around keys to IFn (#1). The patch uses Util.equals() (#2) for (Java) equality-based comparisons. The RefWrapper wraps them in WeakReferences to avoid #5. Cache clearing based on the ReferenceQueue is used to prevent #4.

A few things definitely need to be fixed:
- Util.equals() should be Util.equiv()
- methodCache and rq should be final
- Why does RefWrapper have obj and expect rq to possibly be null?
- RefWrapper fields should all be final
- Whitespace errors in patch

Another idea entirely - instead of caching dispatch value, cache based on hasheq of dispatch value then equality check on value. Could then use WeakHashMap and no RefWrapper.

This patch does not cover the protocol cache. Is that just waiting for the multimethod case to look good?

jira · Answer 5 · 2014-09-11T01:18:00+0000

Comment made by: killme2008

Hi, alex, thanks for your review.But the latest patch is multifn_weak_method_cache2.diff. I will update the patch soon by your review, but i have a few questions to be explained.

1) I will use Util.equiv() instead of Util.equals().But what's the difference of them?
2) When the RefWrapper is retained as key in ConcurrentHashMap, it wraps the obj in WeakReference.But when trying to find it in ConcurrentHashMap, it uses obj directly as strong reference, and create it with passing null ReferenceQueue.Please look at the multifn_weak_method_cache2.diff line number 112. It short, the patch stores the dispatch value as weak reference in cache,but uses strong reference for cache getting.

3) If caching dispatch value based on hasheq , can we avoid hasheq value conflicts? If two different dispatch value have a same hasheq( or why it doesn't happen?), we would be in trouble.

Sorry, the patch doesn't cover the protocol cache, i will add it ASAP.

jira · Answer 6 · 2014-09-11T08:02:08+0000

Comment made by: killme2008

The new patch 'protocol_multifn_weak_ref_cache.diff' is uploaded.

1) Using Util.equiv() instead of Util.equals()
2) Moved the RefWrapper and it's associated methods to Util.java, and refactor the code based on alex's review.
3) Fixed whitespace errors.
4) Fixed PermGen leak in protocol fns.

jira · Answer 7 · 2014-10-03T16:35:15+0000

Comment made by: alexmiller

I screened this ticket again with Brenton Ashworth and had the following comments:

1) We need to have a performance test to verify that we have not negatively impacted performance of multimethods or protocol invocation.
2) Because there are special cases around null keys in the multimethod cache, please verify that there are existing example tests using null dispatch values in the existing test coverage.
3) In Util$RefWrapper.getObj() - why does this return this.ref at the end? It was not clear to me that the comment was correct or that this was useful in any way.
4) In Util$RefWrapper.clearRefWrapCache() - can k == null in that if check? If not, can we omit that? Also, if you explicitly create the Iterator from the entry set, you can call .remove() on it more efficiently than calling .remove() on the cache itself.
5) In core_deftype / MethodImplCache, it appears that you are modifying a now-mutable field rather than the prior version that was going to great lengths to stay immutable. It's not clear to me what the implications of this change are and that concerns me. Can it use a different collection or code to stay immutable?
6) Please update the description of this ticket to include an approach section that describes the changes we are making.

Thanks!

jira · Answer 8 · 2019-06-26T12:00:00+0000

Reference: https://clojure.atlassian.net/browse/CLJ-1152 (reported by chouser@n01se.net)

PermGen leak in multimethods and protocol fns when evaled

Please log in or register to add a comment.

Please log in or register to answer this question.

23 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

PermGen leak in multimethods and protocol fns when evaled

Please log in or register to add a comment.

Please log in or register to answer this question.

23 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories