Odd performance penalty on numeric benchmark alleviated by going to doubles...

Question

Odd performance penalty on numeric benchmark alleviated by going to doubles...

asked Jan 10, 2023 in Java Interop by Tom

I happened across the performance benchmark here and I was curious why clojure was getting beaten by java.

So I tossed it into the profiler (after modifying their version to use unchecked math - which didn't help) and nothing shows up. hmm. decompile and find that

// Decompiling class: leibniz$calc_pi_leibniz
import clojure.lang.*;

public final class leibniz$calc_pi_leibniz extends AFunction implements LD
{
    public static double invokeStatic(final long rounds) {
        final long end = 2L + rounds;
        long i = 2L;
        double x = 1.0;
        double pi = 1.0;
        while (i != end) {
            final double x2 = -x;
            final long n = i + 1L;
            final double n2 = x2;
            pi += Numbers.divide(x2, 2L * i - 1L);
            x = n2;
            i = n;
        }
        return Numbers.unchecked_multiply(4L, pi);
    }

    @Override
    public Object invoke(final Object o) {
        return invokeStatic(RT.uncheckedLongCast(o));
    }

    @Override
    public final double invokePrim(final long rounds) {
        return invokeStatic(rounds);
    }
}

So looks like the double/long boundary is costing us at least a method lookup maybe in Numbers.divide?
So I just coerce everything to double (even our index variable):

(def rounds 100000000)

(defn calc-pi-leibniz2
  "Eliminate mixing of long/double to avoid clojure.numbers invocations."
  ^double
  [^long rounds]
  (let [end (+ 2.0 rounds)]
    (loop [i 2.0 x 1.0 pi 1.0]
      (if (= i end)
        (* 4.0 pi)
        (let [x (- x)]
          (recur (inc i) x (+ pi (/ x (dec (* 2 i))))))))))

leibniz=> (c/quick-bench (calc-pi-leibniz rounds))
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 575.352216 ms
    Execution time std-deviation : 10.070268 ms
   Execution time lower quantile : 566.210399 ms ( 2.5%)
   Execution time upper quantile : 588.772187 ms (97.5%)
                   Overhead used : 1.884700 ns
nil
leibniz=> (c/quick-bench (calc-pi-leibniz2 rounds))
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 158.509049 ms
    Execution time std-deviation : 759.113165 ╡s
   Execution time lower quantile : 157.234899 ms ( 2.5%)
   Execution time upper quantile : 159.205374 ms (97.5%)
                   Overhead used : 1.884700 ns
nil

Any ideas why the java implementation not paying the same penalty for division? [both versions are implemented with unchecked-math at :warn-on-boxed].

I also tried a variant with fastmath's primitive math operators and actually got slower. So far nothing has beaten coercing the loop index i into a double (which I would normally never do).

commented Jan 10, 2023 by Ben Sless

In my benchmarks this gives the same performance as your double solution without coercing the index to be double:

(defn calc-pi-leibniz3
  "Eliminate mixing of long/double to avoid clojure.numbers invocations."
  ^double
  [^long rounds]
  (let [end (+ 2 rounds)]
    (loop [i 2 x 1.0 pi 1.0]
      (if (= i end)
        (* 4.0 pi)
        (let [x (- x)]
          (recur (inc i) x (+ pi (/ x (double (unchecked-dec-int (unchecked-multiply-int (unchecked-int 2) (unchecked-int i))))))))))))

It's about realizing where cast instructions are inserted by each compiler

This is the bytecode for the java solution:

  public static double go(int);
    descriptor: (I)D
    flags: (0x0009) ACC_PUBLIC, ACC_STATIC
    Code:
      stack=6, locals=6, args_size=1
         0: dconst_1
         1: dstore_1
         2: dconst_1
         3: dstore_3
         4: iconst_2
         5: istore        5
         7: iload         5
         9: iload_0
        10: iconst_2
        11: iadd
        12: if_icmpge     39
        15: dload_3
        16: ldc2_w        #24                 // double -1.0d
        19: dmul
        20: dstore_3
        21: dload_1
        22: dload_3
        23: iconst_2
        24: iload         5
        26: imul
        27: iconst_1
        28: isub
        29: i2d
        30: ddiv
        31: dadd
        32: dstore_1
        33: iinc          5, 1
        36: goto          7
        39: dload_1
        40: ldc2_w        #26                 // double 4.0d
        43: dmul
        44: dup2
        45: dstore_1
        46: dreturn

Here is for your solution:

    public static double invokeStatic(long rounds);
        Flags: PUBLIC, STATIC
        Code:
               0: ldc2_w          2.0
               3: lload_0         /* rounds */
               4: invokestatic    clojure/lang/Numbers.add:(DJ)D
               7: dstore_2        /* end */
               8: ldc2_w          2.0
              11: dstore          i
              13: dconst_1
              14: dstore          x
              16: dconst_1
              17: dstore          pi
              19: dload           i
              21: dload_2         /* end */
              22: dcmpl
              23: ifne            36
              26: ldc2_w          4.0
              29: dload           pi
              31: dmul
              32: goto            72
              35: athrow
              36: dload           x
              38: dneg
              39: dstore          x
              41: dload           i
              43: dconst_1
              44: dadd
              45: dload           x
              47: dload           pi
              49: dload           x
              51: ldc2_w          2
              54: dload           i
              56: invokestatic    clojure/lang/Numbers.multiply:(JD)D
              59: dconst_1
              60: dsub
              61: ddiv
              62: dadd
              63: dstore          pi
              65: dstore          x
              67: dstore          i
              69: goto            19
              72: dreturn

And my solution:

    public static double invokeStatic(long rounds);
        Flags: PUBLIC, STATIC
        Code:
               0: ldc2_w          2
               3: lload_0         /* rounds */
               4: ladd
               5: lstore_2        /* end */
               6: ldc2_w          2
               9: lstore          i
              11: dconst_1
              12: dstore          x
              14: dconst_1
              15: dstore          pi
              17: lload           i
              19: lload_2         /* end */
              20: lcmp
              21: ifne            34
              24: ldc2_w          4.0
              27: dload           pi
              29: dmul
              30: goto            71
              33: athrow
              34: dload           x
              36: dneg
              37: dstore          x
              39: lload           i
              41: lconst_1
              42: ladd
              43: dload           x
              45: dload           pi
              47: dload           x
              49: ldc2_w          2
              53: lload           i
              55: l2i
              56: imul
              57: iconst_1
              58: isub
              59: i2d
              60: ddiv
              61: dadd
              62: dstore          pi
              64: dstore          x
              66: lstore          i
              68: goto            17
              71: dreturn

It also avoids all method calls and works directly with primitives

Still, with all this handwaving it doesn't perform better than your solution

Both solutions achieve the same performance as the Java one according to my benchmarks

commented Jan 10, 2023 by Tom

commented Jan 12, 2023 by Chris Nuernberger

1 Answer

alexmiller · Answer 1 · 2023-01-10T18:46:08+0000

I won't have time to take a look at this for a while, but it is pretty easy to fall into boxing at the loop/recur boundary which will cause substantial slowdowns, but it's easiest to tell for sure by looking at the bytecode.

Odd performance penalty on numeric benchmark alleviated by going to doubles...

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Categories

Odd performance penalty on numeric benchmark alleviated by going to doubles...

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Categories