-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce columnar chunk format #5723
Comments
If I don't understand it wrong, you expect to define the schema when writing. I don't expect to see loki go in the direction of defining schemas at write time. In my opinion, loki has achieved great success through the design of the schema defined at read time. "just give me all the logs, I want to grep". I think the general direction of loki's future development should continue to move in the direction of defining schema when reading, so as to succeed in the lowest cost logging system. like the cache, more partition definitions, more query goroutine concurrency, more query replica, and better compression ratios,etc... |
Hi @liguozhong I believe you not understand my idea clearly. I'm doing code to so called "timeseries database" now, I observed many system want to avoid the complex of merge/compaction. but they change their implement at last. no merge will have many problem:
many years ago map-reduce appeard, and now nobody use it any more, becasue it hard to write and its performance very poor, instead developers try hard to write more efficient query optimization and storage format. I believe loki will do more become better, I just discuss with community with some poor idea :) |
For reference, here is a related issue: #91 |
Is your feature request related to a problem? Please describe.
Loki's index design is quite simple, it require every series must have at least one chunk in ingester's memory, and then flush this to s3/nosql-db as one key-value directly. although this design avoid the complex of compaction in some lsm-like storage format system, but therefore loki's active series has a lower upper limit, and its query lag almost depend on disk & network bandwidth. I believe many Loki users face the problem that Loki do query so slow in many situation.
I think some way decrease the usage of network/disk bandwidth during query could significantly increase query experience.
Describe the solution you'd like
current chunk's format is a NSM-like format, when do filter, Loki must read the whole log to decide whether this log should return to user. but user's query filter in most situation is clear and key word is quite short, "I want to check user's log so filter isUID=xxx " , "I want to check whether remote_request=Kafka have something wrong", and so on. developer know this word and label, but Loki coudln't add index to them because uid/tid .etc info is a so called high cardinality label.
I recommand chunk format could implement a dsm-like column format, uid, tid, remote-system could be independent column,
when user do query, the label match may be something like {label1=value1, ... uid=xxx}. when handle this request, Loki could only read chunk's uid column, then decide which one should read the whole log.
in this situation, disk bandwidth should use much less than before, and of course it should return query faster.
when store chunk in s3, because Loki should read whole chunk data to local, this way may not have special advatange than expected, instead I recommand save chunk use parquet format, so the query could done in s3 local, Loki needn't fetch whole data to local.
generally, columnar format means Loki's log should definition schema, and the schema's field may look like a no index label.
becuase log-system user almost are developer, in the situation def a suitable schema is not a hard work.
The text was updated successfully, but these errors were encountered: