This is a short report on the performance metrics obtained processing large files with a small rust/python script.
In this case the definition of "Large" is files that won't fit in memory easily (e.g. 100GB >) and require streaming / buffers.
The experiment is as follows:
- 2GB text file containing text information about objects
- Each 3 consecutive lines has information about one object (i.e. 1 line = one attribute)
- Each object is separated by one blank line
Objective is to read file, iterate through lines and write results to CSV/TSV
OBJECT 1 ATTR 1: CONTENT OBJECT 1 ATTR 2: CONTENT OBJECT 1 ATTR 3: CONTENT
OBJECT 2 ATTR 1: CONTENT OBJECT 2 ATTR 2: CONTENT OBJECT 2 ATTR 3: CONTENT
...etc
OBJECT 1 ATTR1, OBJECT 1 ATTR 2, OBJECT 1 ATTR 3
OBJECT 2 ATTR1, OBJECT 2 ATTR 2, OBJECT 2 ATTR 3
... etc
Currently there are clear optimisations required for the Rust code, as there are several string operations.
Ideally it would be possible to process the files in rust as u8 (byte) format to save time, which would accelerate the processing, but unfortunately the BufReader class doesn't seem to provide functionality to read the files as bytes directly.
The results are provided below.
Simple python implementation without any buffering, using the native python file IO read_line / write.
real 2m16.087s
user 1m4.397s
sys 0m4.352s
Rust implementation using the BufReader and BufWriter converting to string, appending and writing bytes.
real 7m28.602s
user 7m19.379s
sys 0m5.094s
Rust implementation using BufReader and Bufwriter, and using a vector to attempt single string concat.
real 8m35.463s
user 8m24.227s
sys 0m5.918s
Rust implementation using plain File reading in bytes and copying it to another location without performing and processing.
real 41m12.918s
user 22m55.845s
sys 18m15.949s