You need to custom parse your fields into smaller, more compact types. If you're storing a bunch of repeated strings, use string canonicalization to reuse repeated strings (shared references). The typical way to do this is to use string pools to share references as you parse them.
I posted a short demonstration project here that goes about this 2 ways:
- using spork.util.table (my old table munging library), which is based on persistent structures and uses the techniques I mentioned
- using tech.ml.dataset, a recent effort that leverages the TableSaw table implementation for fast, memory efficient structures (mutable, but COW implementation for persistent semantics).
Both solutions readily handle a 300mb tsv with a default heap (about 2gb), although tech.ml.dataset is significantly more memory efficient in practice.
I also put a version in test that replicates your test input (to a degree, I guessed on how many empties, at 1/3). It shows ways of using spork, tech.ml (currently fails when trying to parse dates though), and lastly doing it manually with plain clojure functions. If you want minimal memory footprint (and performance), then parsing to primitives is a necessity. Unfortunately, java collections are typically boxed (aside from arrays), so you need something like FastUtils or another primitive-backed collection. Clojure primitive-backed vectors kind of help, but they still incur refernce overhead due to the trie of arrays (arrays are only at the leaves). That..or if you're just building a dense array of homogenous type from the file, you can construct a primitive array and fill it. This would be space efficient and mechanically sympathetic, particularly if you're using those floats for number crunching and can leverage the dense format.
The bottom line is that - naive parsing, particularly when retaining references to strings or boxed objects - will require exorbitant heap since java objects are memory hungry. As mentioned by Andy, newer JVMs can alleviate this a bit with compressed strings (which does effectively the same thing as string canonicalization, but at JVM level, really nice).
Note: this only applies if you have to "retain" the data....if you can reduce over it or otherwise stream compute, you can likely get away with the naive parsing for arbitrary datasets (e.g. larger than memory) without busting the heap, since the JVM will garbage collect old references right after you process them.
There some other useful libs for this kind of task iota,
and the recent pandas wrapper via libpython-clj, panthera.
A note on memory performance: the jvm by default won't release memory back to the system, so it will appear to "grow" up to the limit set by -Xmx (default is 2GB I think), and it'll appear to stay there (even after a gc cycle). If you want to know what' actually being used vs. reserved, you need to attach a profiler and looka at the heap usage stats (like jvisualvm).