Welcome! Please see the About page for a little more info on how this works.

0 votes
in Clojure by
edited by

Hello,

I am trying to parse a 50MB CSV file. ~2500 rows, ~5500 columns, one column is strings (date as yyyy-mm-dd) and the rest is floats with lots of empty points. I need to be able to access all the data so would like to realize the full file, which should be possible at that size.

I've tried a few options from:

(with-open [rdr (io/reader path)] (doall (csv/read-csv rdr))))

to slightly more manual ways using line-seq and parsing the string into numbers manually.

My JVM usage on a single slurp goes up 100MB, 2x the file size. On parsing the data I go up 1-2GB depending on how it's done. If I open and parse the file several times into the same variable, memory usage keep going up and I end up with a memory error and the program fails. (I understand looking at the task manager isn't the best way to look at memory leaks, but the fact is the program fails so there is a leak somewhere)

What is the right way of opening the file? My final use case is I'll be getting a new file every day and I want a server application to open the file and crunch data every day without running out of memory and needing to restart the server.

Edit: for comparison reading that file with Python pandas will consume about 100MB of memory, and subsequent re-rereading of the file won't keep increasing memory usage.

Thanks a lot!

2 Answers

+1 vote
by
selected by
 
Best answer

You need to custom parse your fields into smaller, more compact types. If you're storing a bunch of repeated strings, use string canonicalization to reuse repeated strings (shared references). The typical way to do this is to use string pools to share references as you parse them.

I posted a short demonstration project here that goes about this 2 ways:
- using spork.util.table (my old table munging library), which is based on persistent structures and uses the techniques I mentioned
- using tech.ml.dataset, a recent effort that leverages the TableSaw table implementation for fast, memory efficient structures (mutable, but COW implementation for persistent semantics).

Both solutions readily handle a 300mb tsv with a default heap (about 2gb), although tech.ml.dataset is significantly more memory efficient in practice.

I also put a version in test that replicates your test input (to a degree, I guessed on how many empties, at 1/3). It shows ways of using spork, tech.ml (currently fails when trying to parse dates though), and lastly doing it manually with plain clojure functions. If you want minimal memory footprint (and performance), then parsing to primitives is a necessity. Unfortunately, java collections are typically boxed (aside from arrays), so you need something like FastUtils or another primitive-backed collection. Clojure primitive-backed vectors kind of help, but they still incur refernce overhead due to the trie of arrays (arrays are only at the leaves). That..or if you're just building a dense array of homogenous type from the file, you can construct a primitive array and fill it. This would be space efficient and mechanically sympathetic, particularly if you're using those floats for number crunching and can leverage the dense format.

The bottom line is that - naive parsing, particularly when retaining references to strings or boxed objects - will require exorbitant heap since java objects are memory hungry. As mentioned by Andy, newer JVMs can alleviate this a bit with compressed strings (which does effectively the same thing as string canonicalization, but at JVM level, really nice).

Note: this only applies if you have to "retain" the data....if you can reduce over it or otherwise stream compute, you can likely get away with the naive parsing for arbitrary datasets (e.g. larger than memory) without busting the heap, since the JVM will garbage collect old references right after you process them.

There some other useful libs for this kind of task iota,
and the recent pandas wrapper via libpython-clj, panthera.

A note on memory performance: the jvm by default won't release memory back to the system, so it will appear to "grow" up to the limit set by -Xmx (default is 2GB I think), and it'll appear to stay there (even after a gc cycle). If you want to know what' actually being used vs. reserved, you need to attach a profiler and looka at the heap usage stats (like jvisualvm).

by
Thank you, very helpful, the explanation and the minimal example! I'll look into the libraries too. Trying to move from Python to Clojure for much of my data processing needs in production (as opposed to exploration), this makes me appreciate all the work behind e.g. pandas!
by
Yea, there's an entire initiative right now with [libpython-clj](https://github.com/cnuernber/libpython-clj) that is aimed exactly at people like you.  It's functional now, but the devs are actively working on getting the user experience and compatibility improved, specifically to support "just wrapping" libraries like pandas, numpy, etc.  Same author as tech.ml.dataset; you may find it helpful.  There's a dev channel on the [clojurians zulip ](https://clojurians.zulipchat.com/#narrow/stream/215609-libpython-clj-dev) if you're interested.

IMO, this has been a weak spot for clojure in the past, but it's getting rectified.  A lot of folks take a "sequence of maps" approach - which is great from an API standpoint since you can just leverage all the seq/transducer libraries in a very natural workflow.  However, the data representation is severely bloated and just isn't practical for even moderate sized files.  A lot of my work with spork.util.table resulted after running into these problems, as well as seeing libraries like R's datatable just plow through similar size inputs.  I think tech.ml.dataset is on the right track though (definitely on the space and speed frontier), since it leverages some clever work in TableSaw (which is trying to be like pandas for the JVM).
by
thank you, aware of libpython-clj and quite exciting. And you're right my very first approach was the sequence of maps: with 5500 columns that was blowing up immediately from a memory standpoint. Had a look at spork, very interesting stuff,!
by
Final comment, in case you find it useful:  I did some testing with clojure.data.csv, along with the string pooling I showed and tablesaw's implementation (it uses a bytemapdictionary that promotes to shortmap with some interesting properties).

In general, naive string pooling and tablesaw's implementation appear to perform about the same.  If you pool every entry while parsing the csv, then you get about a 4x compression.  On my test data set, about a 2850779  line, 268 mb tsv file, on an i7 with a 4gb heap, I busted the heap and get a GC error if I try to realize the naive data.csv seq into a vector.  Using string pooling (still somewhat naive, but trying to cache repeated references), I'm able to build the vector in 45s, and use about 800mb of memory.  Playing with pool size and bounds can provide minor gains, but the defaults seem pretty decent.

https://gist.github.com/joinr/050a536b7ac01b50ae3dfa00cb7e5a74

Streaming and building a more efficient cached structure is probably better, but if you can afford the 0.8gb heap usage and about a minute of read time, maybe it's tenable for your use case.  Obviously, compression will vary with the dataset (e.g. if you have very sparse set of categorical data, you'll do better).
by
thank you so much Tom!
+2 votes
by

Every field in a CSV file becomes a separate Clojure/Java string in memory after parsing. Every Java string in JDK 8 requires 24 bytes for a String Object, plus 16 bytes for an array object, plus 2 bytes per char (they are stored in memory as UTF-16, 2 bytes per char, even if ASCII). The 40 bytes per field might be far larger than 2 bytes per char, depending on how many fields your CSV file has. If you use JDK 9 or later, compact strings enable a memory optimization of 1 byte per char in memory if a field contains only ASCII chars, but it does not reduce the 40 bytes per string/field.

by
thanks, helps me understand the memory usage!
...