Use Reducers/Transducers for better performance & resource handling

Question

Use Reducers/Transducers for better performance & resource handling

asked Sep 15, 2016 in data.csv by jira

One problem when using the clojure.data.csv library is that it's built upon lazy sequences which can lead to inefficiencies when processing large amounts of data, for example even before any transformation is done the base-line parsing of 1gb of data of CSV takes about 50s on my machine. Other parsers available on the JVM can parse this quantity of data in less than 4 seconds.

I'd like to discuss how we might port clojure.data.csv to use a reducer/transducer model, for improved performance and resource handling. Broadly speaking I think there are a few options:

Implement this as a secondary alternative API in c.d.csv leaving the existing API and implementation as is for legacy users.
Replace the API entirely with no attempt at retaining backwards compatibility.
Retain the same public API contracts, whilst trying to reimplement it underneath in terms of reducers/transducers. Use transducers underneath but use sequence to retain the current parse-csv lazy-seq contract, whilst offering access into a new pure transducer/reducer based API for non legacy users or those who don't require a lazy-seq based implementation.

1 and 3 are essentially the same idea, except in 3 users get the benefit of a faster underlying implementation, there may also be other options.

I think 3, if possible, would be the best option.

Options 1 and 2 raise the question, of making no attempt at backwards compatibility or improving the experience for legacy users.

Before delving into the details of how the reducer/transducer implementation, I'm curious what the core team think of exploring
this further.

3 Answers

jira · Answer 1 · 2016-09-16T08:18:21+0000

Comment made by: jonase

Can you share this benchmark? I did some comparisons when I initially wrote the lib and I didn't see such big differences.

I think that the lazy approach is an important feature in many cases where you don't want all those gigabytes in memory.

If we add some non-lazy parsing for performance reasons I would argue it should be additions to the public api.

jira · Answer 2 · 2016-09-16T14:46:31+0000

Comment made by: rickmoynihan

I agree not loading data into memory is a huge benefit, but we shouldn't necessarily conflate that streaming property with laziness/eagerness.

By using reducers/transducers you can still stream through a CSV file row by row and consume a constant amount of memory, e.g. reducing into a count of rows wouldn't require memory to be consumed, even though it is eager. Likewise if we used a transducer will a CollReduceable CSVFile object by using transduce you could request a lazy-seq of results with sequence where the parsing itself paid no laziness tax; alternatively you could request that results are loaded into memory eagerly by transducing into a vector.

Apologies for not providing any benchmark results with this ticket; it was actually Alex Miller who suggested I write this ticket after discussing things briefly with him on slack - and he'd suggested that I needn't provide the timings because the costs of laziness are well known. Regardless, I'll tidy up the code I used to take the timings and put them into a gist or something - maybe later on today.

jira · Answer 3 · 2019-06-26T12:00:00+0000

Reference: https://clojure.atlassian.net/browse/DCSV-15 (reported by rickmoynihan)

Use Reducers/Transducers for better performance & resource handling

Please log in or register to add a comment.

Please log in or register to answer this question.

3 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

Use Reducers/Transducers for better performance & resource handling

Please log in or register to add a comment.

Please log in or register to answer this question.

3 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories