Welcome! Please see the About page for a little more info on how this works.

+2 votes
in Regex by

The JVM method java.util.regex.Pattern has the option of taking a second parameter for flags, which is a bitwise combination of the various allowed values. Has a similar arity been considered for the clojure.core.re-pattern function?

For instance:

(def rflags {\i java.util.regex.Pattern/CASE_INSENSITIVE
             \m java.util.regex.Pattern/MULTILINE
             \s java.util.regex.Pattern/DOTALL
             \u java.util.regex.Pattern/UNICODE_CASE
             \d java.util.regex.Pattern/UNIX_LINES
             \x java.util.regex.Pattern/LITERAL
             \c java.util.regex.Pattern/CANON_EQ})
(defn re-flags [s] 
  (reduce bit-or 0 (map #(rflags % 0) s)))

(defn re-pattern
  "Returns an instance of java.util.regex.Pattern, for use, e.g. in
  {:tag java.util.regex.Pattern
   :added "1.0"
   :static true}
  ([s] (re-pattern s 0))
  ([s f] (if (instance? java.util.regex.Pattern s)
          (. java.util.regex.Pattern (compile s f))))

Some notes on this:
- Most of these flags can already be added to a pattern today using a ? modifier. For instance, a pattern can be made case insensitive by adding (?i) to the start of the string. However, allowing a flags string is compatible with JavaScript (and could be implemented on ClojureScript)
- There are currently no options to define LITERAL or CANON_EQ without using java.util.regex.Pattern directly.
- There is currently no way to implement any flags in ClojureScript without using interop.
- While not all of these flags are compatible with JavaScript, the more common ones are. Similarly, JavaScript allows for flags that are not compatible with Java, so there is already a small disconnect.
- Passing 0 for the default flags is indeed what java.util.regex.Pattern(String) does.

2 Answers

+1 vote
0 votes

I don't know what was considered (or what functionality even was available) back when the re stuff was implemented.

Grepping around I see only a few usages of those flags in Clojure, so it does not seem to be a big gap.

Probably the need for ClojureScript to avoid interop is the bigger one?

I have rarely need these flags in the past, but now that I'm doing more data processing then it's starting to come up, particularly the case insensitivity flag. Was Clojure used a lot for data analysis in the past? I know there has been growing interest in it lately (which is one of the reasons I'm trying to use it, rather than just focusing on Pandas in Python).

I switch between Python and Clojure and all of the Python `re` functions take a `flags` argument. This is also a common extension in JavaScript and `sed` so when I needed it in Clojure I was surprised to discover that it wasn't there. I needed to either use interop or go to the Java documentation to learn about the `(?i)` embedded code.

I am not doing any of this work with ClojureScript, but you have a point: whenever I have to do something JVM specific I am always concerned about how this plays in ClojureScript.