-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of storing historical value of feature in serving storage #53
Comments
@pradithya , I think you should also state why we have this functionality in the first place, and also why the impact doesn't affect that requirement. That is more important than the technical implementation details. In this case there was a need for time-series "intermediate" features to be stored in Feast, and these features are then turned into final features by the client. However, it seems like this is only a requirement in exceptional cases and that most models only require the latest value, which is why the impact of removing it is low. |
I support this. This would also mean we don't need granularities, as discussed in #17. I agree it seems that the need for time series features was unusual. And thus is probably shouldn't be supported in a way that requires so much extra effort. Especially when it has impacted the pleasantness of the serving api. If we remove them from the serving api, it will ensure that serving features matches a format that is more consumable by models. Instead of providing time-series for model creators who want to do extra feature transformations after feast, we should provide better tooling upstream. To make creating features super easy. I think because we said feature creation is out of scope for Feast, we ended up shoehorning in a workaround and now we realise we probably don't need it anyway. For developers that truly need to access to raw time series, they can still do so with the warehouse. This is a typical approach when coming up with new features. They can then create features from those and ingest them in. |
True, but I think removing granularity will require bigger changes, so let's consider it out of scope for now. |
Agreed, #17 will be dependent on this one. |
Agreed, it adds a lot of complexity.
The original requirement wasn't a workaround, it was built for a use case which seems to be an exception, where you are essentially doing "feature transformations" from within your model. Your point is still valid though, in that we should reevaluate whether this is worth all the extra complexity. It does not seem to be. Also, feature creation is meant to be upstream from Feast, but we should also evaluate what we are classifying as feature creation. Dataset construction would be happening from queries to Feast to both APIs, and those queries are not fully matured yet. It's possible that this area might require more advanced functionalities, which could be akin to doing feature transformations. In fact, this is what is currently available in the Uber Feature Store in the form of a DSL. Whether or not we actually need that or should implement that is another matter, and probably not something we should pick up soon.
Agreed. |
For this issue I am planning to simplify the serving API's request and response structure as shown below.
Old
New
Old
New
What do you guys think? @tims @zhilingc @woop |
@tims if we only update latest value per key, is there a way to prevent late data from overwriting newer data? |
We could. I don't think it's practical to lookup what is already in the stores on startup, but we could add a window in ingestion. And the select the max timestamp in the window, emitting on every feature row with a newer event timestamp. Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing? This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable. We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job. |
wouldn't it be just a missing value for that feature?
What do you mean by resources?
The problem is currently for batch data with different timestamp the resulting last value is random. |
I assume you wouldn't want to delete it? And you'd definitely want to include it, so you need to coalesce the features and detect updates. If you get an older row last, you need to write the parts that were missing in the newer row that you wrote first.
CPU and memory
For batch could choose to combine globally just fine and that would solve your issue. I'd actually prefer to do this just for batch. Thinking more, for streaming we'd actually probably want to use a session based window. But regardless, with a large backlog, you could potentially have multiple windows competing with each other. Basically it's hard. |
documenting our discussion in slack: For streaming, the way to actually make it consistent is to use a global window for the entity. This will utilises the beam model as it's intended. The tradeoff is using more resources during ingestion, dependent on the size of the entity keyspace. A global window with a combiner per key, the same approach will apply to batch we just need triggering in the global window. I think this should have a setting to disable it per import spec, so users can just make a best effort when they want to, or they are confident there are unique entity keys. |
With the merge of #88, only the latest values are being stored in the serving stores now. Historical values are not retrievable, so we can address this issue now. The only task left is changing the serving api? Feedback on the api, I don't think we need timestamps grouped per value now. Since many use cases may not care about the timestamp, we could make the responses (especially in json) prettier by separating them. So rather than the suggested:
How about this:
Or maybe even this?
|
Yes, the implementation based on first proposal is already in #87. Your last suggestion is not possible in protobuf. I like the second one though. |
Tried to render sample JSON Original proposal:
2nd one
Which one do you think it's more preferable? @tims @woop @zhilingc |
I prefer the former, but imo it shouldnt matter that much since the SDK does the parsing of these responses, right? |
yes, doesn't matter that much since SDK will wrap it. |
I'm happy to defer to you and go with the former. |
closed via #87 |
Add feast CLI command to identify resources not covered by permissions
Is your feature request related to a problem? Please describe.
Storing historical value of feature in serving store consume a lot of storage space. It becomes a real problem for limited storage such as Redis.
Describe the solution you'd like
Remove the storing feature as time series data functionality and only store the latest value of feature (which currently we already have)
Impact
Removal of code which writes feature as time series data in Redis and Bigtable.
Serving currently has 1 API which can request LAST feature or LIST of feature.
If the time series data is removed then LIST request will always return empty feature.
Currently, LAST request ignore time range filter, it has to be changed to respect it to allow filtering stale feature.
Client will have to always use LAST request. (currently it uses LIST when time range filtering is not requested)
To be discussed further
Thoughts?
@tims @zhilingc @woop @baskaranz @mansi @budi
The text was updated successfully, but these errors were encountered: