Welcome! Please see the About page for a little more info on how this works.

+1 vote
ago in Clojure by

We're seeing what looks like a violation of the keyword-interning invariant in production. The bug is deterministic (repeats thousands of times in the same order on the same Aleph/Netty thread) and is cleared by recompiling the affected namespaces via nREPL.

(when (and (map? result) (nil? (:payload result)))   ; this WHEN fires
  (let [payload-key (->> (keys result)
                         (filter #(.contains (pr-str %) "payload"))
                         first)
        payload-via-key (when payload-key (get result payload-key))]
    (log/warn {:result-type                   (str (type result))
               :result-keys                   (pr-str (keys result))
               :payload-key-equals-literal?   (= payload-key :payload)
               :payload-key-identical?        (identical? payload-key :payload)
               :payload-via-found-key-nil?    (nil? payload-via-key)}
              "diagnostic")))

Logged values when the bug fires:

{:result-type                   clojure.lang.PersistentArrayMap
 :result-keys                   (:payload :aws-xray)
 :payload-key-equals-literal?   true
 :payload-key-identical?        true
 :payload-via-found-key-nil?    false}

So:

  • The (:payload result) in the when returned nil.
  • A few lines later, the diagnostic body proves that the first key in (keys result) IS the body-site :payload literal by identity (and therefore by =, since Keyword inherits Object.equals).
  • (get result that-key) returns the actual non-nil payload value.

PersistentArrayMap.indexOf uses == for Keyword keys, so the only way (:payload result) returns nil while (keys result) yields a key that is identical? to :payload at a nearby site is if the :payload literal at the WHEN site and the :payload literal at the body site are two different Keyword instances, even though they're written identically in one source-level function.

We compareed the :payload literal with the payload-key and found that both have -

  • identical content hashcode (-383036092)
  • identical name bytes [0x70 0x61 0x79 0x6C 0x6F 0x61 0x64]
  • identical codepoints
  • same classloader

so it really looks like two distinct interned Keyword instances with the same name.

Environment

  • Clojure 1.12.4
  • Eclipse Temurin JDK 25, Shenandoah GC, virtual threads enabled
  • ARM64 (AWS Graviton, ECS Fargate)
  • Aleph + Netty, transit-clj/transit-java for decoding

Questions

  1. Could this be a bug in JVM / Clojure runtime?
  2. Has similar behavior been seen in other cases?
  3. What more evidence could we capture?
ago by
It cannot be two distinct instances of Keyword since `(identical? payload-key :payload)` returns `true`. You say that the bug is cleared when the namespace is "
recompiled" (I assume you mean reloaded) via nREPl. So could it be that there are multiple files with the same namespace on your classpath, and you load the properly working one via the REPL? Or maybe you have a cached `.class` file that's somehow newer than the corresponding `.clj` file but contains wrong bytecode.
ago by
Thanks for your response.

On stale / duplicate `.class` files:

Our setup is source-only. We build the uberjar with uberdeps (no AOT — it just packages source on classpath into a jar), and the container launches via `clojure.main -m ...`, so every namespace is read from .clj at JVM start. Our deps.edn has no AOT step, no gen-class, no :aot key, no compile-target on the classpath.

---

On the identical? returning true:

That check is at the body site only. It tells us `payload-key == :payload` at the body-site. The `when` guard's `(:payload result)` returning nil tells us `:payload != array[0]` at the when-site. Since `(keys result)` puts payload-key at array[0], and `payload-key == :payload`, we get `when-site :payload != body-site :payload` — two distinct Keyword instances, where both happen to be visible within one source-level fn.

If not two distinct instances of Keyword, what else could cause `(identical? payload-key :payload)` to return `true` but `(:payload result)` to return `nil` in the same function?
ago by
The only other idea that I have is that some `:payload` literals, specifically the one at the top-level `when`, could have invisible characters in them or characters that look identical to the characters in the ASCII range but have different Unicode code points.
ago by
Thanks for the quick reply.
Yes, I had also suspected that. But I logged content hashcode, name bytes and codepoints for the `:payload` literal and `payload-key` and everything is identitcal.

Hashcode was `-383036092`
name bytes were `name bytes [0x70 0x61 0x79 0x6C 0x6F 0x61 0x64]`

1 Answer

+1 vote
ago by

It might also be interesting to know if you are getting a null due to not-found key vs null value in the map. You could supply a not-found arg on (:payload result :NOT-FOUND) to distinguish.

Have you recently updated Clojure?

Is this problem reproducible such that you could try it again with different code, deps, etc?

Keyword invocation uses special call sites and the compiler code for that changed in Clojure 1.12.3, so I would be interested in whether the behavior changes if you use Clojure 1.12.2.

Or alternately, does the behavior change if you modify the code from (:payload result) to (get result :payload).

ago by
Thank you for the response, Alex — really appreciate it.

Clojure upgrade:
We jumped from Clojure 1.11.1 to 1.12.4 on 2025-12-22. Same commit also moved us from JDK 17 to JDK 25 with virtual threads. The bug investigation started around March 2026, so there's roughly a 2 month gap between the upgrade and our first reports though it's possible the bug existed earlier and went unnoticed.

Reproducibility:
Not manually. But we see this happening about 1 to 3 times a week. We have 10 instances running and handling roughly the same amount of workload, but the bug suddenly starts on one of the instances and then doesn't stop. We haven't been able to identify any patterns. It seems quite random.

Experiments we will try:
1. (:payload result :NOT-FOUND)
2. (get result :payload) vs (:payload result)

Since the bug isn't manually reproducible and nREPL recompile fixes it for the affected instance, each experiment needs a new release, and a probabilistic waiting window of ~1 week before we can claim a result with any confidence. We will do both the experiments above in a single release.

Depending on the results from the first two experiments, we will also consider downgrading Clojure to 1.12.2 and seeing if that fixes the issue.

Question:
Given that your hypothesis points at the keyword call-site compilation rather than at duplicate interned instances, would logging `System/identityHashCode` of the :payload literal and payload-key still be a useful diagnostic to add? Or does the call-site explanation make it largely irrelevant?

I'll report back as soon as we have signal — likely within a week. Thank you again.
...