Caret character differences with re-seq in CLJS vs. CLJ

Question

Caret character differences with re-seq in CLJS vs. CLJ

asked Aug 5, 2019 in ClojureScript by eneroth
edited Nov 14, 2019 by alexmiller

Hello,

I'm seeing this difference with re-seq in ClojureScript from Clojure:

;; CLJ
(re-seq #"^[a-f]" "aabcded") ;; => ("a")

;; CLJS
(re-seq #"^[a-f]" "aabcded") ;; => ("a" "a" "b" "c" "d" "e" "d")

Is this a bug?

ClojureScript version 1.10.520

(Logged as https://clojure.atlassian.net/browse/CLJS-3187)

4 Answers

Dominic Monroe · Answer 1 · 2019-08-05T15:21:02+0000

This is a bug in the re-seq implementation in ClojureScript:

(defn re-seq
  "Returns a lazy sequence of successive matches of re in s."
  [re s]
  (let [match-data (re-find re s)
        match-idx (.search s re)
        match-str (if (coll? match-data) (first match-data) match-data)
        post-idx (+ match-idx (max 1 (count match-str)))
        post-match (subs s post-idx)]
    (when match-data (lazy-seq (cons match-data (when (<= post-idx (count s)) (re-seq re post-match)))))))

The issue comes in where it recurses into re-seq with the remainder of the string, doing this means that ^[a-f] will match again against this new, shorter, string.

One solution is to make your regex sticky:

(js/RegExp. #"^." "y")

This makes subsequent uses of your regex aware of previous matches, do note that you will need to make sure you place this code carefully as it will need to be created at the correct location, it can't be global! If it were global you would run into weird state issues like this one:

(let [re (js/RegExp. #"^." "y")]
  [(re-seq re "cccc")
   (re-seq re "abbb")])
;; => [("c" "c") nil]

(which I cannot explain at all!)

An alternative implementation of re-seq might make this initial clone for you:

(defn re-seq2
  "Returns a lazy sequence of successive matches of re in s."
  [re s]
  (let [re-seq* (fn re-seq* [re s]
                  (let [match-data (re-find re s)
                        match-idx (.search s re)
                        match-str (if (coll? match-data) (first match-data) match-data)
                        post-idx (+ match-idx (max 1 (count match-str)))
                        post-match (subs s post-idx)]
                    (when match-data (lazy-seq (cons match-data (when (<= post-idx (count s)) (re-seq* re post-match)))))))]
    (re-seq* (js/RegExp. re "y") s)))

(let [re #"^."]
  [(re-seq2 re "cccc")
   (re-seq2 re "abbb")])
;; => [("c") ("a")]

eneroth · Answer 2 · 2019-08-22T10:36:51+0000

FWIW, I ended up solving my current problem by re-implementing re-seq in the following manner:

(defn re-seq [re s]
  (let [re* (js/RegExp. re "g")
        xf (comp (take-while some?)
                 (map first))]
    (sequence xf (repeatedly #(.exec re* s)))))

Once lazy-seq is removed from the equation (and "global" is switched on for the regex), it works as expected for my test cases.

Lauri Oherd · Answer 3 · 2020-05-16T19:21:59+0000

I created a patch for the Jira ticket.
Solution was to add a global flag to regular expression if there wasn’t already and call repeatedly RegExp.prototype.exec() method until there are no more matches.
Please let me know if you find any issues.

Caret character differences with re-seq in CLJS vs. CLJ

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

Caret character differences with re-seq in CLJS vs. CLJ

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories