Best way to process results from Elasticsearch #1827

hjazz6 · 2022-01-14T04:15:38Z

hjazz6
Jan 14, 2022

Hi,

I'm making Elasticsearch queries and storing them in dataframes so I can run some statistics on them (min/max, histogram, etc). There are over 1 billion rows of results, and they will likely be larger than my RAM in size (several hundred GBs). Currently, I'm querying in chunks (splitting by hour), store each hour's results in a pandas dataframe, convert it into a vaex dataframe (using from_pandas()) and concatenate it with the previous dataframes.

I'm wondering if this is the most efficient way to do it, in terms of time taken to crunch the data and memory usage. Should I write the ES results dataframes into a HDF5 file, then use vaex to read it, or is that no different from converting the pd dataframe into a vaex dataframe?

Thank you.

Answered by yohplala

Jan 14, 2022

Hello, a pandas dataframe takes RAM (or numpy array, or...), a file not. Your current approach (splitting in chunk, but keeping in RAM as pandas dataframes, or numpy array, or...) does not change memory consumption over querying as a single pandas dataframe.

A vaex dataframe obtained by converting a pandas one also takes RAM (the "same" C array or equivalent is still in RAM).
A vaex dataframe obtained by reading a hdf5 or arrow file does not, as when processing the data, it will process the data in small chunks so that it does not really take RAM.
Bests

View full answer

yohplala · 2022-01-14T06:10:25Z

yohplala
Jan 14, 2022

Hello, a pandas dataframe takes RAM (or numpy array, or...), a file not. Your current approach (splitting in chunk, but keeping in RAM as pandas dataframes, or numpy array, or...) does not change memory consumption over querying as a single pandas dataframe.

A vaex dataframe obtained by converting a pandas one also takes RAM (the "same" C array or equivalent is still in RAM).
A vaex dataframe obtained by reading a hdf5 or arrow file does not, as when processing the data, it will process the data in small chunks so that it does not really take RAM.
Bests

2 replies

hjazz6 Jan 14, 2022
Author

Thanks. So I guess the better way to reduce memory consumption would be to store the ES results in HDF5, then use vaex to read and process them.

yohplala Jan 14, 2022

You may want to check what is faster between hdf5 and arrow format, but yes, certainly.
Bests,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to process results from Elasticsearch #1827

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Best way to process results from Elasticsearch #1827

hjazz6 Jan 14, 2022

Replies: 1 comment · 2 replies

yohplala Jan 14, 2022

hjazz6 Jan 14, 2022 Author

yohplala Jan 14, 2022

hjazz6
Jan 14, 2022

Replies: 1 comment 2 replies

yohplala
Jan 14, 2022

hjazz6 Jan 14, 2022
Author