Welcome! Please see the About page for a little more info on how this works.

0 votes
in data.csv by

Imagine we have the following CSV file

this is,"a badly" quoted, file

When trying to parse this file with clojure.data.csv/read-csv, I get the following exception

{:type java.lang.Exception
 :message "CSV error (unexpected character:  )"
 :at [clojure.data.csv$read_quoted_cell invokeStatic "csv.clj" 37]}

This file is clearly malformed, but I've seen a file like this in the wild so it would be nice if read-csv handled extra content after the quoted portion by parsing this to

["this is" "a badly quoted" " file"]    

Potential problem with this proposal:

If there's a separator inside the quotes this becomes harder to interpret. e.g.

this is,"a, badly" quoted, file

could be parsed to

["this is" "a, badly quoted " " file"]


["this is" "\"a" " badly\" quoted " " file"]

While this second interpretation seems improbable to me, I'm not sure what the "best effort" interpretation strategy is in this case

2 Answers

0 votes
selected by
Best answer

We're not planning to support poorly formatted csv data. It's hard enough to support well formatted csv data. There are probably other (Java) libraries out there that can handle or clean up stuff like this but this is out of scope for data.csv.

0 votes

You can parse a file whose the fields are never quoted by specifying a different :quote character to read-csv. Then any actual quote marks become part of the string that read-csv gathers from the field. I suppose this technique is more often useful with tab-separated files than comma-separated files. Some producers of bizarre comma-separated files can produce tab-separated files that are, in a formal sense, equally bizarre but providentially lacking in raw values containing literal tabs, so they work out better for all sorts of downstream processing.