Welcome! Please see the About page for a little more info on how this works.

+1 vote
in Collections by
retagged by

Hello,
I'm building a web crawler in Clojure that records its results in a map of maps. The data structure basically looks like this:

{"https://example.org" {:base-url "https://example.org"
                        :crawled {"/blog" {:relevant true :status 200}
                                  "/contact" {:relevant false :status 200}}
 "https://foo.com" {:base-url "https://foo.com"
                    :crawled {"/home" {:relevant true :status 200}}}

The web crawler is feed with a list of websites. However, for the sake of efficiency I want to avoid crawling already crawled websites. Therefore, I was looking for a way to persist the above data structure so that it can be consulted in subsequent runs to check if a given website was already crawled. (Of course, I have other reasons to store the results too, e.g. to do various analysis on the captured data)

First, I thought EDN would be a perfect solution. To my surprise, clojure.edn does not have write functions. A lot of advice online says to use pr-str. However, using pr-str has caveats as pointed out in https://nitor.com/en/articles/pitfalls-and-bumps-clojures-extensible-data-notation-edn (e.g. possbile truncations when print-length is set). And it's totally possible to generate invalid edn as pointed out in this comment by @alexmiller: https://ask.clojure.org/index.php/10278/should-pr-str-on-a-function-produce-valid-edn-syntax?show=10282#c10282.

Despite https://clojuredocs.org/clojure.edn claiming that it is "designed to be used in a similar way to JSON or XML" I have the impression it is more geared towards writing edn files by hand (i.e. configuration files).

Looking for alternatives, I stumbled upon Transit. However, its mainly intended "as a wire protocol for transferring data between applications". And the README further says

If storing Transit data durably, readers and writers are expected to use the same version of Transit and you are responsible for migrating/transforming/re-storing that data when and if the transit format changes.

Because in Clojure, data is code and code is data and because of the existence of EDN, I really thought persisting simple data structures like the above would be a breeze in Clojure.

Instead I see these sub-optimal options with my current knowledge:

  • Hope that pr-str produces valid EDN without loss in my particular case
  • Build a custom XML serialization
  • Use Transit
  • Use JSON?! (what about keywords etc.?!)

Is there a more idiomatic way to store and read back in simple Clojure data structures?

by
Note that while that caveat is on the site, there have been no changes to the Transit format since it's release 10 years ago, and there are no plans to change it. If we did change it, that would, of course, be done with great caution and care with an eye towards avoiding breakage.
by
Thanks for putting the caveat into perspective.
by
Just put the `pr` inside a `binding` that adjusts the earmuff variables to your liking, e.g., nil `*print-length*`.

2 Answers

+3 votes
by

I'd just use Transit.

+3 votes
by

Probably the most widely used option is nippy, https://github.com/taoensso/nippy. A newer alternative is deed, https://github.com/igrishaev/deed. Another alternative is https://github.com/clojure/data.fressian.

For most clojure data, you can still use edn, but I agree that writing readable edn is trickier than it should be. Here is the function that I use:

(defn write-edn [w obj]
  (binding [*print-length* nil
            *print-level* nil
            *print-dup* false
            *print-meta* false
            *print-readably* true

            ;; namespaced maps not part of edn spec
            *print-namespace-maps* false

            *out* w]
    (pr obj)))

;; usage
(require '[clojure.java.io :as io])
(with-open [w (io/writer "my-file.edn")]
  (write-edn w {:foo :bar}))
by
Thanks for listing the libs and confirming my impression that writing edn should be easier. Still, I wonder why `clojure.edn` does not support writing. Maybe because its a complex task as people would pass basically anything to the write function in the hope that valid edn falls out.
In the comment I linked in the question, @alexmiller said: "Whether to support edn printing or print/read roundtrip in a stronger way is an item under consideration for work in 1.11.".
It didn't happen, but there's hope.
by
still high on our list of things to work on. I think we would be approaching this from the perspective of improving the clojure print system (which is global) to allow more localized print control and supporting an edn-safe opt-in printer.
by
Great to hear, thanks!
...