Hello,
I'm building a web crawler in Clojure that records its results in a map of maps. The data structure basically looks like this:
{"https://example.org" {:base-url "https://example.org"
:crawled {"/blog" {:relevant true :status 200}
"/contact" {:relevant false :status 200}}
"https://foo.com" {:base-url "https://foo.com"
:crawled {"/home" {:relevant true :status 200}}}
The web crawler is feed with a list of websites. However, for the sake of efficiency I want to avoid crawling already crawled websites. Therefore, I was looking for a way to persist the above data structure so that it can be consulted in subsequent runs to check if a given website was already crawled. (Of course, I have other reasons to store the results too, e.g. to do various analysis on the captured data)
First, I thought EDN would be a perfect solution. To my surprise, clojure.edn
does not have write
functions. A lot of advice online says to use pr-str
. However, using pr-str
has caveats as pointed out in https://nitor.com/en/articles/pitfalls-and-bumps-clojures-extensible-data-notation-edn (e.g. possbile truncations when print-length is set). And it's totally possible to generate invalid edn as pointed out in this comment by @alexmiller: https://ask.clojure.org/index.php/10278/should-pr-str-on-a-function-produce-valid-edn-syntax?show=10282#c10282.
Despite https://clojuredocs.org/clojure.edn claiming that it is "designed to be used in a similar way to JSON or XML" I have the impression it is more geared towards writing edn files by hand (i.e. configuration files).
Looking for alternatives, I stumbled upon Transit. However, its mainly intended "as a wire protocol for transferring data between applications". And the README further says
If storing Transit data durably, readers and writers are expected to use the same version of Transit and you are responsible for migrating/transforming/re-storing that data when and if the transit format changes.
Because in Clojure, data is code and code is data and because of the existence of EDN, I really thought persisting simple data structures like the above would be a breeze in Clojure.
Instead I see these sub-optimal options with my current knowledge:
- Hope that
pr-str
produces valid EDN without loss in my particular case
- Build a custom XML serialization
- Use Transit
- Use JSON?! (what about keywords etc.?!)
Is there a more idiomatic way to store and read back in simple Clojure data structures?