-
Hi, I'm making Elasticsearch queries and storing them in dataframes so I can run some statistics on them (min/max, histogram, etc). There are over 1 billion rows of results, and they will likely be larger than my RAM in size (several hundred GBs). Currently, I'm querying in chunks (splitting by hour), store each hour's results in a pandas dataframe, convert it into a vaex dataframe (using from_pandas()) and concatenate it with the previous dataframes. I'm wondering if this is the most efficient way to do it, in terms of time taken to crunch the data and memory usage. Should I write the ES results dataframes into a HDF5 file, then use vaex to read it, or is that no different from converting the pd dataframe into a vaex dataframe? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hello, a pandas dataframe takes RAM (or numpy array, or...), a file not. Your current approach (splitting in chunk, but keeping in RAM as pandas dataframes, or numpy array, or...) does not change memory consumption over querying as a single pandas dataframe. A vaex dataframe obtained by converting a pandas one also takes RAM (the "same" C array or equivalent is still in RAM). |
Beta Was this translation helpful? Give feedback.
Hello, a pandas dataframe takes RAM (or numpy array, or...), a file not. Your current approach (splitting in chunk, but keeping in RAM as pandas dataframes, or numpy array, or...) does not change memory consumption over querying as a single pandas dataframe.
A vaex dataframe obtained by converting a pandas one also takes RAM (the "same" C array or equivalent is still in RAM).
A vaex dataframe obtained by reading a hdf5 or arrow file does not, as when processing the data, it will process the data in small chunks so that it does not really take RAM.
Bests