Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow StaxEventItemReader to auto-detect the input file encoding #4101

Closed
jdomigon opened this issue Apr 21, 2022 · 7 comments
Closed

Allow StaxEventItemReader to auto-detect the input file encoding #4101

jdomigon opened this issue Apr 21, 2022 · 7 comments
Labels
for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line in: infrastructure type: feature
Milestone

Comments

@jdomigon
Copy link

jdomigon commented Apr 21, 2022

Bug description
StaxEventItemReader uses platform default encoding to read XML files.

Environment
Spring-batch version 4.3.0. Related to #807

Steps to reproduce
Just try to parse an XML file with a declared encoding different to the platform encoding.

Expected behavior
Encoding field in StaxEventItemReader should default to null, so that the parser has the opportunity to autodetect the file encoding.

@jdomigon
Copy link
Author

Hi @benas. I've checked that just assigning null on the encoding field in StaxEventItemReader allows the eventReader to autodetect the XML file encoding.

@fmbenhassine
Copy link
Contributor

Thank you for checking. However, while reviewing this, I don't see anything wrong with using the platform's default encoding as a default encoding. I believe that is a better default than null.

The title of this issue is implying that the default encoding is used unconditionally, but that is not accurate. It is just a default value, made configurable in #807. If the default encoding is not suitable, it can be changed with the setter added in #807.

Closing this as a non issue.

@fmbenhassine fmbenhassine closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2023
@fmbenhassine fmbenhassine added status: declined Features that we don't intend to implement or Bug reports that are invalid or missing enough details and removed status: waiting-for-triage Issues that we did not analyse yet in: infrastructure type: bug labels Apr 25, 2023
@jdomigon
Copy link
Author

Hi, @fmbenhassine: XML files are a bit different to flat files, as they include a header that declares the encoding of the data that follows.

In my experience it's quite common to pass null as the encoding to the XML parser so that it can decide about it reading the header. Sometimes, it's useful to force the parser to use an encoding when the file to parse doesn't contain the encoding header or when this header is not correct, but this situation is uncommon.

Nevertheless, the configurable encoding introduced in #807, would be enough if the setter would't reject the null value, as it does currently:

	/**
	 * Set encoding to be used for the input file. Defaults to {@link #DEFAULT_ENCODING}.
	 *
	 * @param encoding the encoding to be used
	 */
	public void setEncoding(String encoding) {
		Assert.notNull(encoding, "The encoding must not be null");
		this.encoding = encoding;
	}

@fmbenhassine
Copy link
Contributor

That's a different story. Accepting null and using null as default are different design choices. I am re-opening this issue as we can make the reader accept null, but we need to make sure to address this correctly when configuring the event reader.

In my experience it's quite common to pass null as the encoding to the XML parser so that it can decide about it reading the header

Some StAX implementations might not accept nulls. Spring Batch uses Woodstox, which accepts null encoding, but since the StaxEventItemReader uses the implementation-agnostic XMLInputFactory.createXMLEventReader which does not document how to handle null encodings, I prefer not passing null.

The goal is to cover the option of letting the reader detect the encoding without passing null. I see there are two overloaded XMLInputFactory.createXMLEventReader methods to create the reader:

  /**
   * Create a new XMLEventReader from a java.io.InputStream
   * @param stream the InputStream to read from
   * @return an instance of the {@code XMLEventReader}
   * @throws XMLStreamException if an error occurs
   */
  public abstract XMLEventReader createXMLEventReader(java.io.InputStream stream)
    throws XMLStreamException;

  /**
   * Create a new XMLEventReader from a java.io.InputStream
   * @param stream the InputStream to read from
   * @param encoding the character encoding of the stream
   * @return an instance of the {@code XMLEventReader}
   * @throws XMLStreamException if an error occurs
   */
  public abstract XMLEventReader createXMLEventReader(java.io.InputStream stream, String encoding)
    throws XMLStreamException;

If we make the setter accept null, do you think that xmlInputFactory.createXMLEventReader(inputStream, null) and xmlInputFactory.createXMLEventReader(inputStream) have the same effect WRT detecting the encoding? I will give it a try, but if you can test that and share your feedback, I would be grateful. Using xmlInputFactory.createXMLEventReader(inputStream) is safer and could be added to Spring Batch. The idea is to update the StaxEventItemReader with something like this:

eventReader = this.encoding != null ?  xmlInputFactory.createXMLEventReader(inputStream, this.encoding) :
		xmlInputFactory.createXMLEventReader(inputStream);

@fmbenhassine fmbenhassine reopened this Apr 27, 2023
@fmbenhassine fmbenhassine added status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter in: infrastructure type: feature and removed status: declined Features that we don't intend to implement or Bug reports that are invalid or missing enough details labels Apr 27, 2023
@fmbenhassine fmbenhassine changed the title Platform default encoding used unconditionally in StaxEventItemReader Allow StaxEventItemReader to auto-detect the input file encoding Apr 27, 2023
@jdomigon
Copy link
Author

That's a different story. Accepting null and using null as default are different design choices. I am re-opening this issue as we can make the reader accept null, but we need to make sure to address this correctly when configuring the event reader.

That's a point!

If we make the setter accept null, do you think that xmlInputFactory.createXMLEventReader(inputStream, null) and xmlInputFactory.createXMLEventReader(inputStream) have the same effect WRT detecting the encoding?

Good question. My intuition suggests that both methods should have the same effect, but intuition and XML APIs do not always match.

I will give it a try, but if you can test that and share your feedback, I would be grateful. Using xmlInputFactory.createXMLEventReader(inputStream) is safer and could be added to Spring Batch.

Tried both methods in two different 1.8 JREs (IBM and SUN). All of them work the same way no matter you use xmlInputFactory.createXMLEventReader(inputStream, null) or xmlInputFactory.createXMLEventReader(inputStream).

I've drilled down a bit inside both code flows and all of them converge into xmlInputFactory.getXMLStreamReaderImpl(InputSource) receiving an implementation of org.xml.sax.InputSource with a null encoding. Seems that InputSource has a documented behaviour regarding encoding which matches with intuition:

 * <p>The SAX parser will use the InputSource object to determine how
 * to read XML input.  If there is a character stream available, the
 * parser will read that stream directly, disregarding any text
 * encoding declaration found in that stream.
 * If there is no character stream, but there is
 * a byte stream, the parser will use that byte stream, using the
 * encoding specified in the InputSource or else (if no encoding is
 * specified) autodetecting the character encoding using an algorithm
 * such as the one in the XML specification.  If neither a character
 * stream nor a
 * byte stream is available, the parser will attempt to open a URI
 * connection to the resource identified by the system
 * identifier.</p> 

@fmbenhassine
Copy link
Contributor

Thank you for your feedback, Appreciate it! I think using xmlInputFactory.createXMLEventReader(inputStream) is safe when the encoding is not specified, and since Spring Batch uses the implementation agnostic API, we can always defer to the implementation for specific details (ie correctly or incorrectly handling encodings).

I will plan this enhancement for the upcoming 5.0.2, by accepting null in the setter and calling xmlInputFactory.createXMLEventReader(inputStream) when the encoding is not specified.

@fmbenhassine fmbenhassine removed the status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter label Apr 27, 2023
@fmbenhassine fmbenhassine added this to the 5.0.2 milestone Apr 27, 2023
@fmbenhassine fmbenhassine added the for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line label Apr 27, 2023
@jdomigon
Copy link
Author

Backport-to-4.3.x, also appreciated, thanks!

fmbenhassine added a commit that referenced this issue Apr 27, 2023
Before this commit, it was not possible to pass a null encoding
to the StaxEventItemReader, which prevents the XML event reader
to auto-detect the file encoding.

This commits makes the encoding setter more lenient by accepting
a null value.

Resolves #4101
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line in: infrastructure type: feature
Projects
None yet
Development

No branches or pull requests

2 participants