Welcome! Please see the About page for a little more info on how this works.

0 votes
in data.csv by

One problem when using the clojure.data.csv library is that it's built upon lazy sequences which can lead to inefficiencies when processing large amounts of data, for example even before any transformation is done the base-line parsing of 1gb of data of CSV takes about 50s on my machine. Other parsers available on the JVM can parse this quantity of data in less than 4 seconds.

I'd like to discuss how we might port clojure.data.csv to use a reducer/transducer model, for improved performance and resource handling. Broadly speaking I think there are a few options:

  1. Implement this as a secondary alternative API in c.d.csv leaving the existing API and implementation as is for legacy users.
  2. Replace the API entirely with no attempt at retaining backwards compatibility.
  3. Retain the same public API contracts, whilst trying to reimplement it underneath in terms of reducers/transducers. Use transducers underneath but use sequence to retain the current parse-csv lazy-seq contract, whilst offering access into a new pure transducer/reducer based API for non legacy users or those who don't require a lazy-seq based implementation.

1 and 3 are essentially the same idea, except in 3 users get the benefit of a faster underlying implementation, there may also be other options.

I think 3, if possible, would be the best option.

Options 1 and 2 raise the question, of making no attempt at backwards compatibility or improving the experience for legacy users.

Before delving into the details of how the reducer/transducer implementation, I'm curious what the core team think of exploring
this further.

3 Answers

0 votes
by

Comment made by: jonase

Can you share this benchmark? I did some comparisons when I initially wrote the lib and I didn't see such big differences.

I think that the lazy approach is an important feature in many cases where you don't want all those gigabytes in memory.

If we add some non-lazy parsing for performance reasons I would argue it should be additions to the public api.

0 votes
by

Comment made by: rickmoynihan

I agree not loading data into memory is a huge benefit, but we shouldn't necessarily conflate that streaming property with laziness/eagerness.

By using reducers/transducers you can still stream through a CSV file row by row and consume a constant amount of memory, e.g. reducing into a count of rows wouldn't require memory to be consumed, even though it is eager. Likewise if we used a transducer will a CollReduceable CSVFile object by using transduce you could request a lazy-seq of results with sequence where the parsing itself paid no laziness tax; alternatively you could request that results are loaded into memory eagerly by transducing into a vector.

Apologies for not providing any benchmark results with this ticket; it was actually Alex Miller who suggested I write this ticket after discussing things briefly with him on slack - and he'd suggested that I needn't provide the timings because the costs of laziness are well known. Regardless, I'll tidy up the code I used to take the timings and put them into a gist or something - maybe later on today.

0 votes
by
Reference: https://clojure.atlassian.net/browse/DCSV-15 (reported by rickmoynihan)
...