Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with latency with large number of columns? #80

Closed
Moelf opened this issue Jul 9, 2021 · 4 comments
Closed

How to deal with latency with large number of columns? #80

Moelf opened this issue Jul 9, 2021 · 4 comments

Comments

@Moelf
Copy link

Moelf commented Jul 9, 2021

Currently I have a column type that is lazy. It represents ~GB of stuff that needs to be read and decompressed on the fly and cached (by chunk). Turns out I can construct Table nicely and the laziness works.

However, sometimes we have 1000+ columns, in this case the compiler struggles a lot.

Is it possible to have a less-typed but same interfaced Table?

@andyferris
Copy link
Member

Hi @Moelf,

This is a known issue with TypedTables. TypedTables excels in situations with relatively few columns (< 20-30) and will otherwise be a burden on the compiler for wider tables, since the compiler needs to generate specialized code for each column.

In the case you have 1000+ column tables we need to use a more "dynamic" representation of the data. I had some work-in-progress on this at #66, but it is not complete. As Dictionaries.jl is maturing, it should be possible to push forward with this work (and replace FlexTable entirely, which won't solve your problem unfortunately), but I personally have only a little time to dedicate to this work (contributions are welcome, of course!).

In the meanwhile I think your best bets are to use DataFrames.jl or else keep your data as a Dictionary of AbstractVectors and manipulate it from that structure. For the latter case, the tools in Dictionaries.jl, SplitApplyCombine,jl, Indexing.jl, etc could potentially be helpful in working with "nested" data. E.g. you can get a subset of columns with getindices(columns, names) or get an entire row with getindex.(columns, i) or lazily with mapview(col -> col[i], columns) which is probably faster.

@Moelf
Copy link
Author

Moelf commented Jul 10, 2021

For now TypedTables is fairly good it takes my lazy column without complain and "real work" should only require <50 columns.

Looping over typed table is blazing fast, I really appreciate the work done here.

@andyferris andyferris changed the title How to deal with latency? How to deal with latency with large number of columns? Jan 28, 2022
@Moelf
Copy link
Author

Moelf commented May 6, 2022

compiler got much faster in 1.8, I don't think this is a real concern for much

@Moelf Moelf closed this as completed May 6, 2022
@oschulz
Copy link
Contributor

oschulz commented Aug 31, 2022

We've also started to use TypedTables on HDF5-on-disk-columns (wrapped as arrays), seems to work well so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants