Welcome! Please see the About page for a little more info on how this works.

+1 vote
ago in Collections by
retagged ago by

Hello,
I'm building a web crawler in Clojure that records its results in a map of maps. The data structure basically looks like this:

{"https://example.org" {:base-url "https://example.org"
                        :crawled {"/blog" {:relevant true :status 200}
                                  "/contact" {:relevant false :status 200}}
 "https://foo.com" {:base-url "https://foo.com"
                    :crawled {"/home" {:relevant true :status 200}}}

The web crawler is feed with a list of websites. However, for the sake of efficiency I want to avoid crawling already crawled websites. Therefore, I was looking for a way to persist the above data structure so that it can be consulted in subsequent runs to check if a given website was already crawled. (Of course, I have other reasons to store the results too, e.g. to do various analysis on the captured data)

First, I thought EDN would be a perfect solution. To my surprise, clojure.edn does not have write functions. A lot of advice online says to use pr-str. However, using pr-str has caveats as pointed out in https://nitor.com/en/articles/pitfalls-and-bumps-clojures-extensible-data-notation-edn (e.g. possbile truncations when print-length is set). And it's totally possible to generate invalid edn as pointed out in this comment by @alexmiller: https://ask.clojure.org/index.php/10278/should-pr-str-on-a-function-produce-valid-edn-syntax?show=10282#c10282.

Despite https://clojuredocs.org/clojure.edn claiming that it is "designed to be used in a similar way to JSON or XML" I have the impression it is more geared towards writing edn files by hand (i.e. configuration files).

Looking for alternatives, I stumbled upon Transit. However, its mainly intended "as a wire protocol for transferring data between applications". And the README further says

If storing Transit data durably, readers and writers are expected to use the same version of Transit and you are responsible for migrating/transforming/re-storing that data when and if the transit format changes.

Because in Clojure, data is code and code is data and because of the existence of EDN, I really thought persisting simple data structures like the above would be a breeze in Clojure.

Instead I see these sub-optimal options with my current knowledge:

  • Hope that pr-str produces valid EDN without loss in my particular case
  • Build a custom XML serialization
  • Use Transit
  • Use JSON?! (what about keywords etc.?!)

Is there a more idiomatic way to store and read back in simple Clojure data structures?

2 Answers

0 votes
ago by

I'd just use Transit.

0 votes
ago by

Probably the most widely used option is nippy, https://github.com/taoensso/nippy. A newer alternative is deed, https://github.com/igrishaev/deed. Another alternative is https://github.com/clojure/data.fressian.

For most clojure data, you can still use edn, but I agree that writing readable edn is trickier than it should be. Here is the function that I use:

(defn write-edn [w obj]
  (binding [*print-length* nil
            *print-level* nil
            *print-dup* false
            *print-meta* false
            *print-readably* true

            ;; namespaced maps not part of edn spec
            *print-namespace-maps* false

            *out* w]
    (pr obj)))

;; usage
(require '[clojure.java.io :as io])
(with-open [w (io/writer "my-file.edn")]
  (write-edn w {:foo :bar}))
...