Welcome! Please see the About page for a little more info on how this works.

0 votes
in data.xml by

It would be great to be able to parse UTF-8 encoded files beginning with a BOM byte order mark character, as it would give better native support for XML in the wild.

Currently, I'm having a few of these xml files throw a "content not allowed in prolog" exception:
http://stackoverflow.com/questions/4569123/content-is-not-allowed-in-prolog-saxparserexception

4 Answers

0 votes
by

Comment made by: bendlas

Following your stackoverflow link, this seems to be related to a couple of java bugs, that are marked as wontfix due to expectations of existing tools and the recommendation in the tickets is for applications to deal with the BOM themselves.

Since data.xml promises to process xml from raw bytes (because it accepts InputStreams), there is a choice: Either discontinue the InputStream interface and require users to pass Readers that correctly handle their input (e.g. https://commons.apache.org/proper/commons-io/javadocs/api-2.2/org/apache/commons/io/input/XmlStreamReader.html) or use a Reader implementation that can do so, when creating an input source from a stream.

For ease of maintenance, it's tempting to go with removing the byte-based interface, but I'm open to arguments to why data.xml should deal with this.

0 votes
by

Comment made by: featheredtoast

This was more of a suggestion - After reading up about input and input streams, I can understand why this may be out of scope.

I was naive in thinking that handling input via a clojure.java.io/reader would be able to parse an xml file properly, as I was unaware of the BOM issues until I hit the exception. Even though the related JVM fix for BOMs would break backwards compatability and thus rejected, it would still be helpful if another underlying parsing library handled the input and BOMs.

At least consider adding a recommended list of readers for those unfamiliar with XML parsing in java. It is difficult to anticipate these kinds of gotchas for developers unfamiliar with BOMs, readers, and XML (such as myself), especially when the same files pass validation in other languages.

0 votes
by

Comment made by: bendlas

I'm just leaving this here, it might be a good reference to mention, when documenting / changing this: https://github.com/jimpil/clj-bom

0 votes
by
Reference: https://clojure.atlassian.net/browse/DXML-45 (reported by alex+import)
...