Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom TransformerMixin #725

Closed
ChandraLingam opened this issue Apr 22, 2019 · 13 comments
Closed

Custom TransformerMixin #725

ChandraLingam opened this issue Apr 22, 2019 · 13 comments

Comments

@ChandraLingam
Copy link

amazon-sagemaker-examples/sagemaker-python-sdk/scikit_learn_inference_pipeline/

In the abalone example, sklearn build-in transformers/encoders are used. How do we integrate a custom transfomer in the SageMaker Pipeline?

I want to add new features that are computed based on other features. When I include the below class as part of the pipeline, transform job fails with an error:
AttributeError: module 'main' has no attribute 'AddNewFeatures'

What is recommended approach for this?

from sklearn.base import TransformerMixin
class AddNewFeatures(TransformerMixin):
    def __init__(self, *featurizers):
        self.featurizers = featurizers

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        #Do transformations
        #print(type(X))
        ....
        return X

@tonybaby16
Copy link

I am facing the same issue.
Details:
Following the example - https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb

The AddNewFeatures(equivalent class in my case) class is created inside script_path = 'sklearn_abalone_featurizer.py'.
Scikit Estimator gets created successfully.
Batch transform our training data step is failing with error - sagemaker_containers._errors.ClientError: module 'main' has no attribute 'AddNewFeatures'

My guess is that error is getting thrown from within in mode_fn(in sklearn_abalone_featurizer.py)
at step preprocessor = joblib.load(os.path.join(model_dir, "model.joblib"))

@tonybaby16
Copy link

Update:
image

This seems to work.

@ChandraLingam
Copy link
Author

ChandraLingam commented May 1, 2019

After several hours of trying (including source_dir), the option that finally worked for me was:
dependencies parameter in SKLearn

script_path = 'myscript.py'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session,
    dependencies=['AddNewFeatures.py'])

@wiltonwu
Copy link
Contributor

Hi,

I apologize for the delay in response. You are exactly correct, the suggested approach is to either bring in the file through the dependencies parameter or put the file into a new directory and add that directory using the source_dir parameter. I'm going to close this issue as it has been resolved. Please reopen and comment if necessary!

@DanyalAndriano
Copy link

I know this issue is closed, but I have the same problem. @ChandraLingam, please can I ask what did you include in your AddNewFeatures.py? I'm new to SageMaker, so trying to figure this all out still.

@DanyalAndriano
Copy link

@wiltonwu how exactly would I add the script to a new directory (and which directory) and then bring it in with source_dir?

@pranidhii
Copy link

Update: image

This seems to work.

The link doesnt have any content !

@tonybaby16
Copy link

Update: image
This seems to work.

The link doesnt have any content !

https://stackoverflow.com/questions/54314876/aws-sagemaker-sklearn-entry-point-allow-multiple-script

@tthpham
Copy link

tthpham commented Nov 23, 2020

Hi @ChandraLingam and @wiltonwu,
I tried your proposition but always have the issue AttributeError: module 'main' has no attribute 'DataTransformer' when publishing an end point.

Here's my settings:

estimator = SKLearn(
entry_point="script.py",
role=role_name,
train_instance_count=1, # training instance count
train_instance_type=instance_type, # training instance type
output_path=f's3://{bucket}/{prefix}/output', # S3 location for output data
sagemaker_session=sess,
framework_version='0.23-1',
base_job_name=base_job_name,
hyperparameters={'data_path': dataset_to_train},
dependencies=['DataTransformer.py'],
source_dir='s3://mybucket/pyscripts/source.tar.gz')

The files script.py and DataTransformer.py is zipped and uploaded on S3, the 'source_dir' points to the .tar.gz file.
How would I modify my script to make it work?

@karthikph007
Copy link

I am facing the same issue.
Details:
Following the example - https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb

The AddNewFeatures(equivalent class in my case) class is created inside script_path = 'sklearn_abalone_featurizer.py'.
Scikit Estimator gets created successfully.
Batch transform our training data step is failing with error - sagemaker_containers._errors.ClientError: module 'main' has no attribute 'AddNewFeatures'

My guess is that error is getting thrown from within in mode_fn(in sklearn_abalone_featurizer.py)
at step preprocessor = joblib.load(os.path.join(model_dir, "model.joblib"))

Am working on same "sklearn_abalone_featurizer.py" and end up with sagemaker_containers._errors.ClientError: module 'main' has no attribute 'AddNewFeatures'. Could you share your solution how you resolved it.

I have followed with solution mentioned in this link https://stackoverflow.com/questions/54314876/aws-sagemaker-sklearn-entry-point-allow-multiple-script but no result. Still stuck with same error

@tonybaby16
Copy link

@karthikph007
What worked for me is to add source_dir parameter like below. script is a folder which contained any helper classes that abc.py needed to import in. Hope this helps.

sklearn_preprocessor = SKLearn(
entry_point= 'abc.py',
source_dir = 'script',
role=role,
train_instance_type="ml.c4.xlarge",
sagemaker_session=sagemaker_session)

@karthikph007
Copy link

karthikph007 commented Jun 25, 2021

@tonybaby16
Does script folder must contain helper classes in AddNewFeatures.py file or requirement.txt file?
I have tried with creating AddNewFeatures.py file and direct it using source_dir parameter but still end up getting same error.
AddNewFeatures.py:
from sklearn.pipeline import Pipeline

class DataframeFunctionTransformer():
def init(self, func):
self.func = func

     def transform(self, input_df, **transform_params):
           return self.func(input_df)

     def fit(self, X, y=None, **fit_params):
          return self

def process_dataframe(input_df):

input_df["text"] = input_df["text"].map(lambda t: t.upper())

return input_df 

@karthikph007
Copy link

I have resolved issue by creating DataframeFunctionTransformer.py with class and import it as a module in both training and testing.

from package.DataframeFunctionTransformer import DataframeFunctionTransformer, process_dataframe

Reference taken from:
https://stackoverflow.com/questions/56260720/import-custom-modules-in-amazon-sagemaker-jupyter-notebook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants