Welcome! Please see the About page for a little more info on how this works.

0 votes
in data.csv by

Imagine we have the following CSV file

this is,"a badly" quoted, file

When trying to parse this file with clojure.data.csv/read-csv, I get the following exception

{:type java.lang.Exception
 :message "CSV error (unexpected character:  )"
 :at [clojure.data.csv$read_quoted_cell invokeStatic "csv.clj" 37]}

This file is clearly malformed, but I've seen a file like this in the wild so it would be nice if read-csv handled extra content after the quoted portion by parsing this to

["this is" "a badly quoted" " file"]    

Potential problem with this proposal:

If there's a separator inside the quotes this becomes harder to interpret. e.g.

this is,"a, badly" quoted, file

could be parsed to

["this is" "a, badly quoted " " file"]


["this is" "\"a" " badly\" quoted " " file"]

While this second interpretation seems improbable to me, I'm not sure what the "best effort" interpretation strategy is in this case

2 Answers

+1 vote
selected by
Best answer

We're not planning to support poorly formatted csv data. It's hard enough to support well formatted csv data. There are probably other (Java) libraries out there that can handle or clean up stuff like this but this is out of scope for data.csv.

For what it's worth: I ran into this issue working with CSV files exported from a vendor web application. At first a bit of frustration that the CSV could not be consumed by Clojure, and seemed to load fine by Excel and Python. But then noticed that these more permissive consumers seem to elide the first un-escaped quote. This would have caused me problems down-the-line. So erroring early arguably saved me time in the long-run. I have reported the malformed CSV problem back to the vendor of web application .  The only thing I think Clojure could have done differently is give me more info to where in the character stream the malformed pattern was detected. And sorry, maybe I missed this in my reading of the stack-trace.
I'm going to +1 on the request to report at a minimum where in the parsed file the error was encountered. Looking at the Java API hierarchy for Reader, FilterReader, and PushbackReader I see this is no easy task. Nevertheless, that really should be in the purview of data.csv - to tell the reader where the malformation is - at least the file name, row number, and column.  Unexpected character WHERE?

That would have saved me a lot of time looking for a tool to tell me just that, at which point I saw the error was unescaped quotes inside a quoted string, and had to find a separate tool to fix it. That problem is so common in the dataset with which I'm currently working that it would qualify as the one thing I wish data.csv did scrub, or at least diagnose.
0 votes

You can parse a file whose the fields are never quoted by specifying a different :quote character to read-csv. Then any actual quote marks become part of the string that read-csv gathers from the field. I suppose this technique is more often useful with tab-separated files than comma-separated files. Some producers of bizarre comma-separated files can produce tab-separated files that are, in a formal sense, equally bizarre but providentially lacking in raw values containing literal tabs, so they work out better for all sorts of downstream processing.