Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AgentSet.groupby #2220

Merged
merged 30 commits into from
Aug 22, 2024
Merged

Add AgentSet.groupby #2220

merged 30 commits into from
Aug 22, 2024

Conversation

quaquel
Copy link
Member

@quaquel quaquel commented Aug 17, 2024

This PR adds pandas style groupby to AgentSet (a first draft for this was #2092, but this PR has stalled). It also adds, as in pandas, a GroupBy helper class to enable method chaining. The resulting API is illustrated below.

Key Changes

  • adds a group_by method to AgentSet; this method enables pandas style group_by operations on the agents in the set.
  • adds a GroupBy helper class; this helper class is returned by the new group_by method and, as in pandas, has a few methods to enable method chaining in a split, apply, combine style syntax.

usage examples

# eppstein civil violence
# will give you a dict with the number of agents for each condition
# the result_type optional kwarg allows finer control over the group data structure
# which enhances performance if you don't need the fancy stuff in agentset
# map returns a dict with group_name and the return of whatever function is applied to each group
some_agentset.groupby("condition", result_type="list").map(len)

# random activation by type
groups = some_agentset.groupby(type)
for group, agents in groups: # we can iterate over the groups
    agents.shuffle(inplace=True).do("step")

# we can easily get the number of groups
len(groups)

#  we can get a dict with the group_name and the groups
groups.groups

# one liner random activation by type
# do returns the underlying agent set
# note how do and appply can take either a callable or a string analogous to AgentSet.do
some_agentset.groupby(type).do("shuffle", inplace=True).do("step")

@quaquel quaquel added the enhancement Release notes label label Aug 17, 2024
Copy link

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
Schelling small 🔴 +6.1% [+5.8%, +6.5%] 🔵 -0.3% [-0.5%, -0.1%]
Schelling large 🔴 +5.8% [+5.1%, +6.5%] 🔵 +0.3% [-0.3%, +0.9%]
WolfSheep small 🔵 +3.5% [+2.3%, +4.7%] 🔵 +2.4% [+2.1%, +2.8%]
WolfSheep large 🔴 +3.7% [+3.3%, +4.0%] 🔵 +1.2% [+0.8%, +1.6%]
BoidFlockers small 🔵 +2.9% [+2.2%, +3.6%] 🔵 +1.9% [+1.3%, +2.7%]
BoidFlockers large 🔵 +3.2% [+2.8%, +3.7%] 🔵 +2.0% [+1.3%, +2.7%]

@EwoutH EwoutH added feature Release notes label and removed enhancement Release notes label labels Aug 17, 2024
@EwoutH
Copy link
Member

EwoutH commented Aug 19, 2024

specifically enable user choice between agentsets and lists. Lists are faster but should be used with care

It seems that this is a recurring issue.

I have no idea if something like the following is possible in Python, but ideally you would have a fast construct that some function can accept and some functions can output aside from the agentset. As soon as you match or chain two of such functions, they recognize they can take the "fast path" and avoid compiling an recompiling an agentset.

But that's on a high level, no idea if it's possible/feasible to implement in our case in Python.

@EwoutH
Copy link
Member

EwoutH commented Aug 19, 2024

Apparently this is a concept called "Deferred Execution". Basically, you can do something like this:

  • Instead if directly performing each operation, you chain all operations that need to be performed in an iterator
  • Once you have gathered all operations, you check for each out if they support the fast data construct as both input and output
  • If there are any matches, flip a boolean for those operations
  • They you execute them as the original chain

You need some wrapper and/or decorator magic to make this work, and it adds a lot of code complexity. But then it could prevent unnecessary conversions.

It's kind of related to the lazy execution concept, but now you're able to look forward instead of only backwards.

@quaquel
Copy link
Member Author

quaquel commented Aug 19, 2024

It seems that this is a recurring issue.

Yes it is

Apparently this is a concept called "Deferred Execution". Basically, you can do something like this:

I had a look but this seems a lot of trouble to get working. You need to use reflection to figure out all calls and then complex logic to figure out what do to. The short term and easier solution is to allow users to control this explicitly. With the inplace argument on for example shuffle we are already doing so. It is just a matter of documenting it carefully and making clear that it can be beneficial to profile this carefully.

@quaquel quaquel marked this pull request as ready for review August 19, 2024 20:25
@quaquel
Copy link
Member Author

quaquel commented Aug 19, 2024

This is ready for review. All tests are passing and the docstring is in place. The list/agentset stuff has been raised by Ewout and I believe giving users control is the least worst solution at the moment.

What remains is the split apply combine stuff. Split/groupby is there. Apply is available on the GroupBy helper class. I am less sure on the need and desirability of an explicit control for combine. In pandas this is needed because everything needs to be transformed back to a DataFrame/Series. We don't have that need in MESA.

@quaquel quaquel requested review from EwoutH and rht August 19, 2024 20:28
@EwoutH
Copy link
Member

EwoutH commented Aug 19, 2024

Thanks a lot for this. Will take a close look tomorrow. If you have the chance, could you maybe add a few more user API examples in the PR start?

Copy link

Performance benchmarks:

Model Size Init time [95% CI] Run time [95% CI]
Schelling small 🔵 -2.5% [-2.8%, -2.3%] 🔵 +0.6% [+0.4%, +0.8%]
Schelling large 🔵 -2.6% [-2.9%, -2.3%] 🔵 +2.6% [+1.2%, +4.2%]
WolfSheep small 🔵 -3.4% [-4.2%, -2.6%] 🔵 +0.5% [+0.2%, +0.8%]
WolfSheep large 🟢 -3.6% [-3.8%, -3.4%] 🔵 -0.2% [-0.5%, +0.1%]
BoidFlockers small 🟢 -6.9% [-7.5%, -6.2%] 🔵 -1.4% [-2.0%, -0.7%]
BoidFlockers large 🟢 -6.4% [-6.9%, -5.9%] 🔵 -0.9% [-1.4%, -0.5%]

mesa/agent.py Outdated
"""Apply the specified callable to each group

Args:
callable (Callable): The callable to apply to each group, it will be called with the group as first argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to also pass the group key here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Apply in pandas goes over all groups and if you want to do it on a single group you can just do

groupes = some_agentset.groupby("condition")
callable(groupes.get_group["quiet"])

@EwoutH
Copy link
Member

EwoutH commented Aug 19, 2024

On initial look this is a clean and solid start, especially API wise. I don't expect much work to be needed, we can extend what we find missing in later PRs.

Tomorrow I will play with some models to get a real feel for this function.

@quaquel
Copy link
Member Author

quaquel commented Aug 20, 2024

On initial look this is a clean and solid start, especially API wise. I don't expect much work to be needed, we can extend what we find missing in later PRs.

I came up with a couple of simple additions to GroupBy.apply. First, it now only takes a callable, this should be changed to a callable or string so you can call methods on the underlying agentset if desired. Next, it might be nice to control the return type to be either the return of the callable/str or the GroupBy instance itself (as we do in AgentSet in places). This allows for chaining like

# random activation by type
some_agentset.groupby(type).apply("shuffle", returntype="groupby", inplace=true).apply("do")

@EwoutH
Copy link
Member

EwoutH commented Aug 20, 2024

Thanks. I think the best to keep group_by properly scoped is to have a few very clear use cases. Data collection (with aggregation) is clearly one, RandomActivationByType could be another one.

What I find weird about that scheduler by the way, is that is shuffles the order of the agent groups and the agents within the group both by default:

def step(self, shuffle_types: bool = True, shuffle_agents: bool = True) -> None:

Which means the same group can go twice in a row, etc. Not a weird option to have, but strange by default.

Luckily the AgentSet is way more flexible. A shuffle_groups() method might be useful in GroupBy if you want to replicate this behavior.

But all based on clear use cases.

What would be interesting is to see if it allows us to make our current examples better or easier. For example, can something in Game of Life be improved with group by? Or in Boltzmann wealth?

mesa/agent.py Show resolved Hide resolved
@Corvince
Copy link
Contributor

I think this PR adds a nice feature, but I would advice to carefully reconsider the return types. Currently the flow is like the following, if we consider all possible return types (including the ones proposed for apply)

agentset.group_by("foo").apply(bar)
AgentSet -> GroupBy[str, AgentSet | list] -> Dict[str, Any] | GroupBy[str, AgentSet | list]

It took me quite some time to actually figure this out, because you really have to think through what can be returned by each function. Compare this to Pandas

dataframe.groupby("foo").apply(bar)
DataFrame -> DataFrameGroupBy[str, DataFrame] -> DataFrame

I would recommend following a similar path and staying as close to our AgentSet as possible. I think defining return types is also a bit unpythonic. Instead of agentset.groupby(foo, response_type="list") it is more pythonic to just use list(agentset.groupby(foo).

Also please note the different spelling of group_by (this PR) and groupby (pandas)

@quaquel
Copy link
Member Author

quaquel commented Aug 20, 2024

I think this PR adds a nice feature, but I would advice to carefully reconsider the return types. Currently the flow is like the following, if we consider all possible return types (including the ones proposed for apply)

I have been somewhat concerned about this as well, but you drive home the importance of sorting it out. Its even more confusing then I thought.

I think defining return types is also a bit unpythonic. Instead of agentset.groupby(foo, response_type="list") it is more pythonic to just use list(agentset.groupby(foo).

Here I disagree somewhat. Pandas has control for returntype in various places E.g. DataFrame.apply has a way of specifying the result_type. We also do some of it in AgentSet already (i.e., AgentSet.do). The main problem is that in pandas, groupby will allways go back to a dataframe. I don't think there is such a obvious default in our case.

One possible solution is to only add group_by (renamed as @Corvince points out to groupby) and don't add the GroupBy class with the apply function.

@rht
Copy link
Contributor

rht commented Aug 21, 2024

One possible solution is to only add group_by (renamed as @Corvince points out to groupby) and don't add the GroupBy class with the apply function.

I thought this was because Polars had decided to use group_by instead of groupby. In dplyr, it is group_by. I'm not sure of Polars' reasoning, but it seems to be for consistency with dplyr and that Polars is a cross-language library. I'm fine with either way.

mesa/agent.py Outdated Show resolved Hide resolved
@quaquel quaquel force-pushed the agentset_group_by branch from 332645b to eaf0a5b Compare August 21, 2024 13:46
@quaquel
Copy link
Member Author

quaquel commented Aug 21, 2024

I have added __len__, seperated apply and do, and removed get_groups. Unittests have also been updated. This is ready for a last review.

@quaquel quaquel changed the title Add AgentSet.group_by Add AgentSet.groupby Aug 21, 2024
@quaquel
Copy link
Member Author

quaquel commented Aug 22, 2024

Based on conversation in #2237, changed GroupBy.apply to GroupBy.map.

@quaquel quaquel merged commit a48dd80 into projectmesa:main Aug 22, 2024
11 of 12 checks passed
@EwoutH
Copy link
Member

EwoutH commented Aug 22, 2024

Thanks for the work, both this PR and the do split ended up great!

EwoutH pushed a commit to EwoutH/mesa that referenced this pull request Sep 24, 2024
Adds Agentset.groupby and Groupby Helper class
@quaquel quaquel deleted the agentset_group_by branch November 4, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Release notes label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants