Reading large arrow files into a vaex dataframe #1845

hjazz6 · 2022-01-19T09:26:02Z

hjazz6
Jan 19, 2022

Hi,

I have multiple .arrow files, each about 1GB (total filesize is larger than my RAM). I tried to open all of them using vaex.open_many() to read them into a single dataframe, and saw that the memory usage was increasing, and it was taking longer than I expected.

So I tried just opening one file using the code below.

%%time
df = vaex.open("file1.arrow")

What I noticed was it takes about 4-5 seconds to open the file, and the free memory (as indicated by the free column returned by the command free -h) kept decreasing until it was ~1GB lesser.

I thought that when opening the arrow files, vaex would use memory-mapping and thus, won't actually use up so much memory, and it would also be faster. Is my understanding correct, or am I doing something wrong?

ETA: Based on the documentation, I thought the file would open instantly. If I time the cell using %time, it does return in microseconds, but the cell continues to run for a few seconds, as shown by %%time.

JovanVeljanoski · 2022-01-19T10:05:55Z

JovanVeljanoski
Jan 19, 2022
Maintainer

Is the schema the same amongst the different files?

1 reply

hjazz6 Jan 19, 2022
Author

Yes, they are the same.

hjazz6 · 2022-01-20T01:14:24Z

hjazz6
Jan 20, 2022
Author

Just to add a bit of info on how the arrow files were generated. I made an elasticsearch query and stored the results as a pandas dataframe (df_pd). Then I did a fillna() and set the datatype of each column (I had gotten error messages converting to arrow when there were NaN values and mixed datatype for a column). I then converted the df_pd dataframes to arrow files using vaex.

vaex_df = vaex.from_pandas(df=df_pd)
vaex_df.export("file1.arrow")

This was repeated for different ES query time periods.

This was done and completed first using a separate script, before I tried to open the arrow files to store as a vaex dataframe.

0 replies

maartenbreddels · 2022-01-21T14:46:37Z

maartenbreddels
Jan 21, 2022
Maintainer

Strange, could you make a reproducible issue, like generate some data and export that, and see how long that takes for you, so we can try the same?

2 replies

hjazz6 Jan 21, 2022
Author

Sure, I'll try to do that next week.

hjazz6 Jan 22, 2022
Author

I just remembered that my arrow files reside in a different disk partition than where my Anaconda and vaex are installed. Does this affect the loading time?
Also, someone told me that when memory-mapping a file, it is assigned virtual address space, equal to the size of the mapping. Could that be why I observed the same amount of memory decreasing when I opened a file?

hjazz6 · 2022-01-24T02:43:36Z

hjazz6
Jan 24, 2022
Author

Thanks for the suggestion on another thread to try exporting the arrow file to hdf5. I tried that, and I can now open the file in less than 300ms, and the memory usage seems to be minimal too. I'll convert all my arrow files to hdf5 then.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading large arrow files into a vaex dataframe #1845

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reading large arrow files into a vaex dataframe #1845

hjazz6 Jan 19, 2022

Replies: 4 comments · 3 replies

JovanVeljanoski Jan 19, 2022 Maintainer

hjazz6 Jan 19, 2022 Author

hjazz6 Jan 20, 2022 Author

maartenbreddels Jan 21, 2022 Maintainer

hjazz6 Jan 21, 2022 Author

hjazz6 Jan 22, 2022 Author

hjazz6 Jan 24, 2022 Author

hjazz6
Jan 19, 2022

Replies: 4 comments 3 replies

JovanVeljanoski
Jan 19, 2022
Maintainer

hjazz6 Jan 19, 2022
Author

hjazz6
Jan 20, 2022
Author

maartenbreddels
Jan 21, 2022
Maintainer

hjazz6 Jan 21, 2022
Author

hjazz6 Jan 22, 2022
Author

hjazz6
Jan 24, 2022
Author