Improve performance of `EncodingDetectingInputStream` #2535

knutwannheden · 2022-12-14T23:01:28Z

Slightly improves the performance of EncodingDetectingInputStream by refactoring the read() method and avoiding the additional byte[] copy in readFully().

Also make sure that PlainTextParser#parseInputs() reads the source text before the encoding, since the latter won't be correctly set otherwise.

Slightly improves the performance of `EncodingDetectingInputStream` by refactoring the `read()` method and avoiding the additional `byte[]` copy in `readFully()`. Also make sure that `PlainTextParser#parseInputs()` reads the source text before the encoding, since the latter won't be correctly set otherwise.

knutwannheden · 2022-12-14T23:03:30Z

Have you evaluated supporting byte buffers at the API level and working with memory-mapped files?

knutwannheden · 2022-12-16T02:07:08Z

I will try to supply some performance improvement figures for the parsing benchmark. I think it was around 10%. But for this change that parsing benchmark should be trimmed down a bit. The current two tests basically compare the results of using buffering vs no buffering when reading very small files (much smaller than the buffer size, even). I can provide a benchmark specifically for the refactored class.

jkschneider · 2022-12-17T18:44:00Z

Have you evaluated supporting byte buffers at the API level and working with memory-mapped files?

Not yet. Are you meaning as an alternative implementation for our JavaFileObject?

knutwannheden · 2022-12-18T19:26:20Z

Not yet. Are you meaning as an alternative implementation for our JavaFileObject?

We would still have to conform to the FileObject API. I have not done any high-level profiling, so I don't know how crucial this part actually is (I am assuming it will often be rather irrelevant compared to the other phases). Depending on that it could still be worthwhile to check this out.

Regarding file encoding in general, I quite like IntelliJ's approach to this: https://www.jetbrains.com/help/idea/encoding.html#single-file. In contrast, OpenRewrite will for ASCII and UTF-8 sources try to detect the character set for the entire contents, which is a bit wasteful.

At least when using the Maven plugin (not with the Gradle plugin AFAICT), the org.openrewrite.tree.ParsingExecutionContextView#setCharset() method is invoked if the project.build.sourceEncoding property is set, which will then skip the entire character set detection logic.

jkschneider merged commit d29ae3a into openrewrite:main Dec 17, 2022

jkschneider added this to the 7.34.3 milestone Dec 17, 2022

knutwannheden deleted the reading-performance branch December 17, 2022 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `EncodingDetectingInputStream` #2535

Improve performance of `EncodingDetectingInputStream` #2535

knutwannheden commented Dec 14, 2022

knutwannheden commented Dec 14, 2022

knutwannheden commented Dec 16, 2022

jkschneider commented Dec 17, 2022

knutwannheden commented Dec 18, 2022

Improve performance of EncodingDetectingInputStream #2535

Improve performance of EncodingDetectingInputStream #2535

Conversation

knutwannheden commented Dec 14, 2022

knutwannheden commented Dec 14, 2022

knutwannheden commented Dec 16, 2022

jkschneider commented Dec 17, 2022

knutwannheden commented Dec 18, 2022

Improve performance of `EncodingDetectingInputStream` #2535

Improve performance of `EncodingDetectingInputStream` #2535