-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoML: Add a TablesClient for automl-tables specific behavior. #8720
Conversation
Additional test, docs & proposed cleanup needs to happen on top of this.
#16) * update create_model to allow user to specify included or excluded columns * made minor changes stylistically and with added ValueError outputs
* added two new func: set time, get table address * changed indentation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@jonathan1920 you need to sign the CLA, or resign the commits with the correct @google.com address |
@merla18 Please take a look. At a high level, it looks like this custom layer does the following:
Project ID on clientIIRC there have been proposals over the years for storing project on the client. As this manual layer implements this feature, I'm wondering: is this a feature that has any active efforts for GAX/core? Calling RPC methods with shorthand for resource names/identifiersAre there any other Python Google Cloud Client Libraries that implement something like this? I'm a little worried about consistency. When learning GCP APIs, part of the learning curve is resources (their "names" which are fully qualified identifiers -vs- display names which many resources have -vs- identifiers which are often server-side generated but sometimes client provided). This layer simplifies working with resources a lot, it looks very elegant. But learning this interface might confuse developers when they use this interface and then try using any of the other GCP API Python Cloud Client Libraries. Looks elegant! I'd like to find out more, though. Especially because this layer simplifies working with GCP API resources in general, which I love. /cc @sirtorry @andrewferlitsch @dizcology @lukesneeringer @crwilcox @JustinBeckwith |
Rebecca
What's most notable to me on the AutoML APIs is that they are designed in a
procedural style instead of an OOP style. They lack abstraction,
encapsulation and polymorphism (like Keras vs pre-TF 2.0 --hence why TF 2.0
adopted Keras for the model API). Looking at the above bullets, first
thoughts are:
- things like project ID, etc are attributes of the client object.
(encapsulation)
- shorthand would be attribute aliases. (abstraction)
- CRUD ops sound like overloaded methods (polymorphism).
…On Fri, Jul 19, 2019 at 2:00 PM Rebecca Taylor ***@***.***> wrote:
@merla18 <https://github.com/merla18> Please take a look.
At a high level, it looks like this custom layer does the following:
- Instantiate client with Project ID so the client knows the Project ID
- Treat the resources' Display Names as a shorthand identifier
- CRUD operations can accept any of the following to identify a
resource:
- Full name (projects/X/locations/Y/resource_type/*ID*)
- The resource's ID only (project ID stored on client)
- The resource's Display Name
- Python pb object representing the resource, itself
- This is branded for Tables but most of the functionality would work
across the board for AutoML
- The identifier shorthand functionality would also work for any
resourceful GCP API
Project ID on client
IIRC there have been proposals over the years for storing project on the
client.
As this manual layer implements this feature, I'm wondering: is this a
feature that has any active efforts for GAX/core?
Calling RPC methods with shorthand for resource names/identifiers
Are there any other Python Google Cloud Client Libraries that implement
something like this?
I'm a little worried about consistency. When learning GCP APIs, part of
the learning curve is resources (their "names" which are fully qualified
identifiers -vs- display names which many resources have -vs- identifiers
which are often server-side generated but sometimes client provided).
This layer simplifies working with resources a lot, it looks very elegant.
But learning this interface might confuse developers when they use this
interface and then try using any of the other GCP API Python Cloud Client
Libraries.
Looks elegant! I'd like to find out more, though. Especially because this
layer simplifies working with GCP API resources *in general*, which I
love.
------------------------------
/cc @sirtorry <https://github.com/sirtorry> @andrewferlitsch
<https://github.com/andrewferlitsch> @dizcology
<https://github.com/dizcology> @lukesneeringer
<https://github.com/lukesneeringer> @crwilcox
<https://github.com/crwilcox> @JustinBeckwith
<https://github.com/JustinBeckwith>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8720?email_source=notifications&email_token=AFOVC2GVPDOBZ7JNVMKST53QAITP3A5CNFSM4IFLYT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MYBVQ#issuecomment-513376470>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFOVC2AWRXQIYPKDSFPRGPDQAITP3ANCNFSM4IFLYT2A>
.
|
@beccasaurus I recommend we do a review and see the best "wrapper" approach that can be applied uniformly across all the APIs. For example, it maybe better to provide an attribute to a generic Client type that would indicate the type of dataset/model (Tables, etc) vs. subclassing (i.e., TablesClient is a subclass of Client). |
That's a great point:
The real difficultly in using the original client is that a lot of these identifiers are hidden behind multiple API calls & string transformations. For example, to update a column (a necessary and required step), you need to know first the fully-qualified name of the dataset, use that to lookup the fully-qualified name of the primary table, and then search the primary table for the column in question. At that point, you need to extract the API-assigned identifier from the column's fully qualified name (something of the form This API works great when you're a browser & always have the entire state of all your resources in memory. However, when you're just getting started in a jupyter notebook and want to quickly run some experiments using autoML, you're out of luck. Would the "correct" solution be to rewrite all the APIs with both client users & browsers in mind? Maybe. But that doesn't cover the future work we want to implement, namely integration with pandas, integration with matplotlib, and integration with IPython. On top of that, we have complaints from both internal users who are experts in GCP, as well as users who are brand new to the platform (and these complaints don't even address the lack of the above mentioned enhancements). Overall, there is just too much complexity in the client to perform very simple operations. In the end, the tables team wants a polished, first-class, data scientist-friendly client, as this is what our users are asking for. Consequently, we're ready to put in the effort to maintain this client. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Updated |
✅ Green light from my end. We followed up about Project ID / Resource Objects offline. I'm +1 to merge this when it's considered ready. @busunkim96 Could you add @sirtorry as a reviewer? |
I opened a PR on sloth to add @sirtorry to yoshi-python. You should be able to add comments for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests lgtm
automl/tests/system/gapic/v1beta1/test_system_tables_client_v1.py
Outdated
Show resolved
Hide resolved
I've made the list system tests more stringent. Anything blocking us from merging? |
Friendly ping |
@sirtorry Please merge it when you're ready. You should have write permissions to the repo. |
Thanks all! 🙂 |
…leapis#8720) * Checking in staged client helper code Additional test, docs & proposed cleanup needs to happen on top of this. * update create_model to allow user to specify included or excluded col… (#16) * update create_model to allow user to specify included or excluded columns * made minor changes stylistically and with added ValueError outputs * Update doc gen & module structure. Add unit & system tests * added two new func: set time, get table address (#23) * added two new func: set time, get table address * changed indentation * Add system tests * Address linter & python2.7 import errors * Passes **kwargs through to client & implements missing methods * Support BQ as input/output in batch_predict * Address first round of feedback * Switch to pytest.raises, fix .rst formatting exception * Make list system tests more stringent
* Checking in staged client helper code Additional test, docs & proposed cleanup needs to happen on top of this. * update create_model to allow user to specify included or excluded col… (#16) * update create_model to allow user to specify included or excluded columns * made minor changes stylistically and with added ValueError outputs * Update doc gen & module structure. Add unit & system tests * added two new func: set time, get table address (#23) * added two new func: set time, get table address * changed indentation * Add system tests * Address linter & python2.7 import errors * Passes **kwargs through to client & implements missing methods * Support BQ as input/output in batch_predict * Address first round of feedback * Switch to pytest.raises, fix .rst formatting exception * Make list system tests more stringent
AutoML tables is heavily used by datascientists in jupyter notebooks. Based on feedback we've gotten, the existing client is hard to use, obtuse, not notebook-friendly. This wrapper client aims to fix that.
Using the client (see integration tests for more detailed snippets):
Testing
Most testing is done in unit tests, however we have some integration tests covering the common user flows as well (create dataset -> import data -> train -> predict). To speed up the testing, we create resources the first time tests are run, and run tests off of those in subsequent runs. This mitigates the minimum 1h+ model creation & deployment time.
Future work
In no particular order, we plan to add:
Sorry for the large change, most of it is doc strings and unit tests. We were recommended to check in the full basic functionality first.