Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feast integration #410

Merged
merged 67 commits into from
Sep 20, 2021
Merged

Feast integration #410

merged 67 commits into from
Sep 20, 2021

Conversation

eapolinario
Copy link
Collaborator

@eapolinario eapolinario commented Sep 15, 2021

This PR adds an integration with Feast, the Feature Store.

TLDR

We define a workflow that showcases the usual steps of a training and serving an ML model that uses a feature store.

Detailed description

The feast_workflow is a translation of the workflow proposed in #322, but without the imperative-style of defining workflow. In other words, we define a workflow that ingests data into an offline store, then it retrieves it and uses the retrieved data for training, finally, that workflow materializes the data into an online store and uses it to run an inference.

In terms of the integration with the Feast SDK, we defined a custom provider named FlyteCustomProvider modeled after the Local provider. FlyteCustomProvider's sole responsibility is to translate references to s3 files into local files.

Note that this example relies on s3 being available. We might need to explore how to generalize this.

Assuming that the sandbox is running, in order to run this locally the env vars used by feast to access s3 need to be defined, in the root of flytesnacks run:

 ❯ FLYTE_AWS_ENDPOINT='http://localhost:30084/' FLYTE_AWS_ACCESS_KEY_ID=minio FLYTE_AWS_SECRET_ACCESS_KEY=miniostorage python cookbook/case_studies/feature_engineering/feast_integration/feast_workflow.py

TODO

  • Fix error in training. More details in the first comment below.
  • Define TypeTransformer for Feast's SDK objects, including FeatureStore, Feature, and FeatureView.
  • Decide if we should move each call to the Feast SDK to its own Task.
  • Finalize translation of online steps (materializing and retrieving)
  • Ensure bucket used by workflow, i.e. s3://feast-integration, exists before the workflow runs
  • Fill in README.rst

@eapolinario
Copy link
Collaborator Author

eapolinario commented Sep 15, 2021

The error in the train_model task as seen in the sandbox:

...
Traceback (most recent call last):

      File "/opt/venv/lib/python3.8/site-packages/flytekit/common/exceptions/scopes.py", line 203, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/root/feast_integration/feast_workflow.py", line 145, in train_model
        model.fit(X_train, y_train)
      File "/opt/venv/lib/python3.8/site-packages/sklearn/naive_bayes.py", line 209, in fit
        return self._partial_fit(X, y, np.unique(y), _refit=True,
      File "/opt/venv/lib/python3.8/site-packages/sklearn/naive_bayes.py", line 374, in _partial_fit
        if _check_partial_fit_first_call(self, classes):
      File "/opt/venv/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 339, in _check_partial_fit_first_call
        clf.classes_ = unique_labels(classes)
      File "/opt/venv/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type: %s" % repr(ys))

Message:

    Unknown label type: (array([2.       , 2.2804878, 3.       , 4.       ]),)

User error.

@eapolinario
Copy link
Collaborator Author

I just realized that we also have to ensure that the feast-integration s3 bucket has to exist before we run the workflow. I added another TODO to ensure that we create the bucket before running the workflow.

@samhita-alla
Copy link
Contributor

The error in the train_model task as seen in the sandbox:

...
Traceback (most recent call last):

      File "/opt/venv/lib/python3.8/site-packages/flytekit/common/exceptions/scopes.py", line 203, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/root/feast_integration/feast_workflow.py", line 145, in train_model
        model.fit(X_train, y_train)
      File "/opt/venv/lib/python3.8/site-packages/sklearn/naive_bayes.py", line 209, in fit
        return self._partial_fit(X, y, np.unique(y), _refit=True,
      File "/opt/venv/lib/python3.8/site-packages/sklearn/naive_bayes.py", line 374, in _partial_fit
        if _check_partial_fit_first_call(self, classes):
      File "/opt/venv/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 339, in _check_partial_fit_first_call
        clf.classes_ = unique_labels(classes)
      File "/opt/venv/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type: %s" % repr(ys))

Message:

    Unknown label type: (array([2.       , 2.2804878, 3.       , 4.       ]),)

User error.

Fixed the issue. It's working now.

project="flytesnacks",
domain="development",
name="feast_integration.feature_eng_tasks.mean_median_imputer",
version="{{ registration.version }}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumare3 The workflow is getting registered only when registering it twice. Does registration.version get populated only after the reference tasks get registered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get this to work, so reverted this change for the time being.

Also, do we really need this for this demo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it not work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it didn't. Now that the bulk of the work is done I can investigate this.

Samitha, can you expand on why you think having these tasks imported this way is important for this demo? I feel like it's already complicated enough and that we should focus on the interaction with the feast SDK.

Copy link
Contributor

@samhita-alla samhita-alla Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than importing directly, I thought reference tasks would make more sense. Also, I couldn't serialize the tasks with the relative import (I may have to use . in front of the file name).
We need not worry about the relative imports with reference tasks, so I included it.

You can remove it if that's not really required.

# Notice the use of a custom provider.
provider="custom_provider.provider.FlyteCustomProvider",
offline_store=FileOfflineStoreConfig(),
online_store=SqliteOnlineStoreConfig(path=self.config.online_store_path),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have "redis" online store?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in addition to the sqlite store? I don't see why not. We could expose that as a configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could have it in addition or replace the SQLite one. Both should work!

Comment on lines +230 to +231
store_offline_node >> load_historical_features_node
load_historical_features_node >> store_online_node
store_online_node >> retrieve_online_node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
@eapolinario eapolinario force-pushed the feast-integration--pipe-registry branch from 40e7a69 to e4e0b98 Compare September 16, 2021 18:25
@eapolinario eapolinario changed the title [wip] Feast integration Feast integration Sep 16, 2021
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
from tqdm import tqdm


class FlyteCustomProvider(LocalProvider):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samhita-alla can we add some docs here?

full_feature_names,
)

def _localize_feature_view(self, feature_view: FeatureView):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome!


@dataclass_json
@dataclass
class FeatureStore:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think this is the best way. we will have to replicate all Feast FeatureStore methods in Flyte?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am missing some point, why not a transformer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same lines, overall, the code looks neat! ✨ However, I presume there's boilerplate code—it looks like we are calling the Feast methods twice, although we don't actually do that. Is there a way we can make the user-facing code even simpler? Do we need to have a "Flyte task" per every Feast operation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two very good questions, so let's separate those out. Just to be clear:

  1. Why not cut the middleman (in this case these dataclasses) and use a type transformer for the Feast SDK FeatureStore object directly?
    • This way we won't need to maintain these dataclasses at all. This is a great idea.
  2. A Flyte Task per Feast operation.
    • The workflow we're using to showcase the integration, feast_workflow, goes through 4 distinct phases: (1) ingesting data into the offline store, (2) loading data from the offline store for training, (3) load data into the online store, and finally (4) retrieve data from the online store for model inference.
    • Having these in separate Flyte Tasks can be used to demonstrate Flyte's strenghts (reproducibility, caching, etc).
    • @samhita-alla , how are you envisioning this simplification? From the user's perspective they only interact with Feast objects once, right? We're doing this translation between dataclasses and the actual Feast FeatureStore object, but this will be gone once we have the type transformer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after putting some thought into this, I don't think we can have a transformer for the FeatureStore object because we have custom logic to update online store file. Notice how we upload or download the online store file in each of the methods of the current FeatureStore dataclass: [1], [2], and [3].

That said, all this code could be replaced with a type transformer if we could have a remote datasource for SqliteOnlineStoreConfig.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feast's registry already does this for us (Feast has S3RegistryStore), notice how we don't have to worry about uploading local changes to the registry to the remote location.

We could use a similar idea for configuring the sqlite online store.

That combined with the support for remote data sources coming in the next release will remove a lot of the custom used to plug in with Flyte (namely (1) the custom provider, (2) and the dataclass methods used to upload the online store file to s3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eapolinario How do you think a type transformer implementation could be different from dataclasses? Would that be beneficial in any way?

Concerning the user-facing code, I think having one task per Feast operation helps not think much about Feast-related code as such. The user needs to give the appropriate values, and the rest has to be taken care of by the plugin. This is a bit more complex to implement, but I think it's a deeper integration.

If we aren't inclined towards making the user-facing code much simpler, this isn't necessary. But if we want to hide some implementation details, this technique could help in doing so. Here's something that I've worked on before to write tasks for the Feast plugin: flyteorg/flytekit#577.

Copy link
Collaborator Author

@eapolinario eapolinario Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eapolinario How do you think a type transformer implementation could be different from dataclasses? Would that be beneficial in any way?

@samhita-alla , the idea is to have a type transformer for Feast's FeatureStore, this way we wouldn't have to worry about exposing the functionality of the main object in our dataclasses. In other words, we'd build the feature store object once and that would be the input of each task that had to interact with Feast.

As for making the user-facing code simpler, I see the appeal of hiding some of this behind a structure like what you proposed in flyteorg/flytekit#577, but unless we can bring more functionality to the plugin, I feel like in its current incarnation it doesn't add a lot of value. I feel like once we can have a type transfomer we can simply expose that object in each task and let the client code deal with the Feast sdk directly.

)

# Ingest the data into feast
feature_store.apply([horse_colic_entity, horse_colic_feature_view])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not update the feature store manifest? or are you saying that we are still mutating the same file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it does mutate the registry (the Feast manifest) and we write the remote path of the FlyteSchema.

Copy link
Collaborator Author

@eapolinario eapolinario Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the act of localizing specific Feast's FeatureView objects prior to important operations (like loading historical features or materializing data into the online store) that makes this work.

@kumare3
Copy link
Contributor

kumare3 commented Sep 17, 2021

This is awesome stuff. Thank you @eapolinario.

…n custom provider

Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
Signed-off-by: Eduardo Apolinario <eapolinario@users.noreply.github.com>
@kumare3 kumare3 self-requested a review September 20, 2021 22:34
@kumare3 kumare3 merged commit 40dc71f into master Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants