Welcome! Please see the About page for a little more info on how this works.

0 votes
in Compiler by

How to check regexp equality in nested structures?

(= #"." #".") ;; #=> false
(= [#"."] [#"."]) ;; #=> false
(= ["."] ["."]) ;; #=> true

I believe that a regexp is a value object.
So two regexp are equal if they equals literally.

Is it possible to update = function for regexp equality support?

Maybe like this:

(defprotocol IEquals
 (equals [a b]))

(extend-protocol IEquals
  Object
  (equals [a b] (.equals a b))
  
  Pattern
  (equals [a b]
    (and
      (instance? Pattern b)
      (= (str a) (str b))) 

And clojure.lang.Util.equiv uses
IEquals#equals instead of Object#equals.
But I down know how to call protocol method from clojure.lang.Util.

1 Answer

+2 votes
by

Regex patterns have identity equality semantics. This has been discussed at length and we have no plans to change this in Clojure. For both performance and philosophical reasons (value-based equality), there is no protocol-based way to extend the equality abstraction, and I think it's unlikely we will ever add that.

If you want to compare patterns for equality, one option is to keep your regexes as strings for comparison, then use re-pattern to convert to a re at the point of use.

by
Another option would be to call (.pattern #"some|regex") which will return the underlying string and compare those strings.
by
Converting "at the point of use" is also a performance hit, since `re-pattern` does regex compilation, which should only happen once, especially when reusing a regex.

I'm current use case is to categorise a stream of string via a curated list of ~18000 regex-to-category mappings, which doesn't change frequently.

It's very much desirable to keep those 18000 regexes compiled, otherwise doing a `re-find` on them against ~6000 strings takes 21s instead of 9.5s. If I `(memoize re-pattern)`, it takes 26s.

Not having regex equality, at least based on the string they were created from, just makes writing tests quite painful too.

On one hand regexes feel like a 1st class citizen of Clojure, since they even have their built-in literal syntax, can be used as hash-map keys, but then they brake down when it comes to equality.

If a `java.util.regex.Pattern` is compiled from the same string, the resulting object will behave the same way, so it's safe to be considered equal.

I fail to see how is it relevant that different regex strings might result in the same matching behaviour. We are talking about the `=` operator, not a `does-it-behave-the-same?` operator... :/

What would be the use-case for detecting whether a regex object instance is the same as another one?
by
A Clojure regex literal is a read time construct that compiles to a runtime Pattern object (I presume some kind of state machine). That compiled object’s behavior is derived from both the string AND the flags used at pattern compile time. Pattern objects have identity semantics in Java. Clojure follows the lead of the host here.

We don’t feel it makes sense to pretend to be able to compare regexes as values. This has been long ago considered and decided.

It’s relatively easy to create your own deftype that wraps a regex pattern and implements whatever equality semantics you want.
...