Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use drop_duplicates() instead of groupby (about 1.5~2x faster) #1617

Merged
merged 2 commits into from
Jun 4, 2021

Conversation

rightx2
Copy link
Contributor

@rightx2 rightx2 commented Jun 3, 2021

What this PR does / why we need it:
df.drop_duplicates() is much faster than groupby() + reset_index().

You can test it with the below codes (You can change the number of unique number for each column):

df = pd.DataFrame({
    "a": [np.random.randint(0, 100) for _ in range(1000)],
    "b": [np.random.randint(0, 100) for _ in range(1000)],
    "c": [np.random.randint(0, 100) for _ in range(1000)],
})

%%timeit
df.groupby(['a', 'b']).last().reset_index()

%%timeit
df.drop_duplicates(['a', 'b'], keep="last", ignore_index=True, inplace=True)

Which issue(s) this PR fixes:
No issues related. It's sort of a little performance improvement

Does this PR introduce a user-facing change?:

NONE

@feast-ci-bot
Copy link
Collaborator

Hi @rightx2. Thanks for your PR.

I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@achals
Copy link
Member

achals commented Jun 3, 2021

/ok-to-test

@codecov-commenter
Copy link

codecov-commenter commented Jun 3, 2021

Codecov Report

Merging #1617 (ed34cdf) into master (99ee2ce) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1617      +/-   ##
==========================================
- Coverage   83.64%   83.64%   -0.01%     
==========================================
  Files          67       67              
  Lines        5816     5814       -2     
==========================================
- Hits         4865     4863       -2     
  Misses        951      951              
Flag Coverage Δ
integrationtests 83.55% <100.00%> (-0.01%) ⬇️
unittests 77.84% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/file.py 96.66% <100.00%> (-0.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99ee2ce...ed34cdf. Read the comment docs.

@achals
Copy link
Member

achals commented Jun 3, 2021

/retest

rightx2 added 2 commits June 4, 2021 06:44
Signed-off-by: rightx2 <rightx2@gmail.com>
Signed-off-by: rightx2 <rightx2@gmail.com>
@rightx2 rightx2 force-pushed the feature/replace_groupby branch from ec4c15d to ed34cdf Compare June 3, 2021 21:44
@achals
Copy link
Member

achals commented Jun 4, 2021

/lgtm

@achals
Copy link
Member

achals commented Jun 4, 2021

/assign @woop

@achals
Copy link
Member

achals commented Jun 4, 2021

Thanks for the PR @rightx2! I dont have land permissions yet but @woop should be able to get this merged in.

@achals
Copy link
Member

achals commented Jun 4, 2021

/lgtm

Copy link
Member

@achals achals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably gotta approve again after being added to the OWNERS file

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achals, rightx2

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 024737c into feast-dev:master Jun 4, 2021
woop pushed a commit that referenced this pull request Jun 7, 2021
* Use drop_duplicates() instead of groupby (about 1.5~2x faster)

Signed-off-by: rightx2 <rightx2@gmail.com>

* Lint

Signed-off-by: rightx2 <rightx2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants