How to deal with latency with large number of columns? #80

Moelf · 2021-07-09T15:48:57Z

Currently I have a column type that is lazy. It represents ~GB of stuff that needs to be read and decompressed on the fly and cached (by chunk). Turns out I can construct Table nicely and the laziness works.

However, sometimes we have 1000+ columns, in this case the compiler struggles a lot.

Is it possible to have a less-typed but same interfaced Table?

The text was updated successfully, but these errors were encountered:

andyferris · 2021-07-10T23:00:10Z

Hi @Moelf,

This is a known issue with TypedTables. TypedTables excels in situations with relatively few columns (< 20-30) and will otherwise be a burden on the compiler for wider tables, since the compiler needs to generate specialized code for each column.

In the case you have 1000+ column tables we need to use a more "dynamic" representation of the data. I had some work-in-progress on this at #66, but it is not complete. As Dictionaries.jl is maturing, it should be possible to push forward with this work (and replace FlexTable entirely, which won't solve your problem unfortunately), but I personally have only a little time to dedicate to this work (contributions are welcome, of course!).

In the meanwhile I think your best bets are to use DataFrames.jl or else keep your data as a Dictionary of AbstractVectors and manipulate it from that structure. For the latter case, the tools in Dictionaries.jl, SplitApplyCombine,jl, Indexing.jl, etc could potentially be helpful in working with "nested" data. E.g. you can get a subset of columns with getindices(columns, names) or get an entire row with getindex.(columns, i) or lazily with mapview(col -> col[i], columns) which is probably faster.

Moelf · 2021-07-10T23:13:59Z

For now TypedTables is fairly good it takes my lazy column without complain and "real work" should only require <50 columns.

Looping over typed table is blazing fast, I really appreciate the work done here.

Moelf · 2022-05-06T04:51:43Z

compiler got much faster in 1.8, I don't think this is a real concern for much

oschulz · 2022-08-31T07:10:30Z

We've also started to use TypedTables on HDF5-on-disk-columns (wrapped as arrays), seems to work well so far.

andyferris added the performance label Jan 28, 2022

andyferris changed the title ~~How to deal with latency?~~ How to deal with latency with large number of columns? Jan 28, 2022

Moelf closed this as completed May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with latency with large number of columns? #80

How to deal with latency with large number of columns? #80

Moelf commented Jul 9, 2021

andyferris commented Jul 10, 2021

Moelf commented Jul 10, 2021 •

edited

Loading

Moelf commented May 6, 2022

oschulz commented Aug 31, 2022

How to deal with latency with large number of columns? #80

How to deal with latency with large number of columns? #80

Comments

Moelf commented Jul 9, 2021

andyferris commented Jul 10, 2021

Moelf commented Jul 10, 2021 • edited Loading

Moelf commented May 6, 2022

oschulz commented Aug 31, 2022

Moelf commented Jul 10, 2021 •

edited

Loading