One problem when using the clojure.data.csv library is that it's built upon lazy sequences which can lead to inefficiencies when processing large amounts of data, for example even before any transformation is done the base-line parsing of 1gb of data of CSV takes about 50s on my machine. Other parsers available on the JVM can parse this quantity of data in less than 4 seconds.
I'd like to discuss how we might port clojure.data.csv to use a reducer/transducer model, for improved performance and resource handling. Broadly speaking I think there are a few options:
- Implement this as a secondary alternative API in c.d.csv leaving the existing API and implementation as is for legacy users.
- Replace the API entirely with no attempt at retaining backwards compatibility.
- Retain the same public API contracts, whilst trying to reimplement it underneath in terms of reducers/transducers. Use transducers underneath but use
sequence
to retain the current parse-csv lazy-seq contract, whilst offering access into a new pure transducer/reducer based API for non legacy users or those who don't require a lazy-seq based implementation.
1 and 3 are essentially the same idea, except in 3 users get the benefit of a faster underlying implementation, there may also be other options.
I think 3, if possible, would be the best option.
Options 1 and 2 raise the question, of making no attempt at backwards compatibility or improving the experience for legacy users.
Before delving into the details of how the reducer/transducer implementation, I'm curious what the core team think of exploring
this further.