Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Agent data collection #348

Closed
Corvince opened this issue Jan 5, 2017 · 59 comments
Closed

Multi-Agent data collection #348

Corvince opened this issue Jan 5, 2017 · 59 comments
Milestone

Comments

@Corvince
Copy link
Contributor

Corvince commented Jan 5, 2017

Dear all,

This is actually how I came across this issue: I wanted to activate different agent types sequentially (but both randomly), so I used two different schedulers, but this broke the data collection. Currently, agent reporters get their agents hard-coded from model.scheduler.agents, assuming it exists and failing if your scheduler is named differently. One way to fix this would be to (optionally?) supply the scheduler to the DataCollector.

The downside is that if you have agents that are not part of any schedule you still can't collect data for them. That's already a problem right now, so it wouldn't worsen the situation, but maybe someone has a better long-term solution to this?

Also, if you use the same scheduler, there seems to be no way to collect data from different agents. If you for example want to count the wealth of some of your agents, but not all agents have a wealth attribute, it fails.

@dmasad
Copy link
Member

dmasad commented Jan 8, 2017

That's an interesting question.

My inclination is to say that since the scheduler keeps track of time in a model, models should have just one scheduler. Check out the custom scheduler in the Wolf-Sheep example (if you haven't already) to see how you can activate different types of agents in different orders.

That doesn't solve the problem of how to handle data collection for heterogeneous attributes, though. The easy but ugly way is to give all agents the attribute, and just set some of them to 0, or None, or some similar 'N/A' value.

I think the better way would be to give the DataCollector a default 'collect attribute' behavior, which would also let us get away from some of the ugliness with passing lambdas, etc. As part of that, the DataCollector would ensure that agents had the appropriate attribute, and only collect it if they did.

@Corvince
Copy link
Contributor Author

Corvince commented Jan 9, 2017

Aha! I did not look into the example, because I thought the order is not important for the wolf-sheep example. And I did not think about implementing a custom scheduler. But after reconsidering I strongly agree that there should be exactly one scheduler per model.

Regarding the DataCollector, a "collect_attribute" would indeed be nice and more intuitive than the lambda functions. But I would still lean towards additionally defining the agent-type. Staying with the Wolf-Sheep example, one might be interested only in the position of the wolves, but querying the pos-attribute would still query all agents.

@Corvince
Copy link
Contributor Author

Corvince commented Feb 8, 2017

To advance on this, in the datacollection module I modified the _new_agent_reporter function and added the _collect_attribute function as follows:

def _new_agent_reporter(self, reporter_name, reporter_function=None):
    """ Add a new agent-level reporter to collect.

    Args:
        reporter_name: Name of the agent-level variable to collect.
        reporter_function: Function object that returns the variable when
                           given an agent object.

    """
    if isinstance(reporter_function, str):
        reporter_function = self._collect_attribute(reporter_function)
    self.agent_reporters[reporter_name] = reporter_function
    self.agent_vars[reporter_name] = []

def _collect_attribute(self, attribute):
    """ return a reporter function that gets an attribute with the name of
    the reporter, if an agent has that attribute
    """
    def reporter_function(agent):
        if hasattr(agent, attribute):
            return getattr(agent, attribute)
    return reporter_function

So instead of calling the agent-reporter with something like {"position": lambda a: a.pos}, we can use {"position": "pos"}. Maybe a bit less ugly and with the added benefit that it only collects the attribute if it is available.

What do you think?

@Corvince
Copy link
Contributor Author

I just stumbled upon this issue again in my current model and I still think my last comment offers a nice solution. If you think this is a good way I will create a test for this and submit a PR.

@ihopethiswillfi
Copy link
Contributor

ihopethiswillfi commented Mar 22, 2018

I ran into this as well and used another approach.

Example:

class King(Agent):
    self.uid
    self.wealth

class Bird(Agent):
   self.uid
   self.color

What I did was simply altering the uids. E.g. all kings would have uids ['k0', 'k1', ...] and birds would have ['b0', 'b1', ...]. You get the idea.

The agent reporter would then look like:
"wealth": lambda x: x.wealth if x.uid[:1] == 'k' else None

I didn't really explore the above solution from Corvince. But I just wanted to add mine here, which seems to work well for me and is stupidly simple.

An alternative to changing uids would be to simply add a property to the class. Something like Agent.type = 'king', and then verifying this when you collect data.

@Corvince
Copy link
Contributor Author

Corvince commented May 6, 2019

Nowadays the data-collector simply returns "None" if an attribute doesn't exist

@Corvince Corvince closed this as completed May 6, 2019
@philip928lin
Copy link

@Corvince
I encounter attribute error given that some of the agents do not have certain attributes.
"AttributeError: 'Aquifer' object has no attribute 'satisfaction'"
According to your previous response, I expect the model return None instead of raising error.
Could you help me to inspect this issue?

@rht
Copy link
Contributor

rht commented Sep 13, 2023

There is https://github.com/projectmesa/mesa/pull/1702/files, which adds an optional argument exclude_none_values.

exclude_none_values: Boolean of whether to drop records which values
            are None, in the final result.

This is the only documentation of the new feature, and so there hasn't been proper guide written for this yet.

If you

  1. enable that option
  2. replace your function with lambda a: getattr(a, "satisfaction", None)

It should automatically ignores agents that don't have satisfaction as an attribute. Pull request to document this feature in https://github.com/projectmesa/mesa/blob/main/docs/useful-snippets/snippets.rst would be appreciated.

@philip928lin
Copy link

philip928lin commented Sep 13, 2023

Thank you for your instruction.
Those instructions let the code run without an error message, but the output of the agent dataframe is not correct.
In the figure below, Agent ID with "w" is one agent type, and "agt" is another agent type.
None should appear in the "Sa" column which is filled with "w1" and "w2" that should be in the "ID" column.

Do you mind to guide me again?

[update]
It turns output I should disable "exclude_none_values" and only implement the lambda function you suggested.
But, this also means that mesa did not automatically return none with the attribute that does not exist. We still need to manually do this by using the lambda function.

image

@rht
Copy link
Contributor

rht commented Sep 14, 2023

That sounds like a serious bug. Do you have a minimal code reproducer, so I can tinker with it and fix it?

@philip928lin
Copy link

Hi @rht,
Definitely~
Please see the following.
However, I would hope that this return None feature could be the default of the mesa without using lambda.

import mesa

class MyAgent1(mesa.Agent):
    def __init__(self, name, model):
        super().__init__(name, model)
        self.name = name
        self.agt_type = "agt_type1"
        self.satisfication = 1

    def step(self, run=True):
        pass

class MyAgent2(mesa.Agent):
    def __init__(self, name, model):
        super().__init__(name, model)
        self.name = name
        self.agt_type = 'agt_type2'

    def step(self, run=True):
        pass

class MyModel(mesa.Model):
    def __init__(self, n_agents):
        super().__init__()
        self.schedule = mesa.time.BaseScheduler(self)
        for i in range(n_agents):
            self.schedule.add(MyAgent1(f"A{i}", self))
            self.schedule.add(MyAgent2(f"B{i}", self))
        
        self.datacollector = mesa.DataCollector(
            model_reporters={},
            agent_reporters={"satisfication": lambda a: getattr(a, "satisfication", None),
                             "unique_id": lambda a: getattr(a, "unique_id", None)},                 
            exclude_none_values=True
        )
    def step(self):
        self.schedule.step()
        self.datacollector.collect(self)

m = MyModel(3)
m.step()
m.step()
m.step()

df_agts = m.datacollector.get_agent_vars_dataframe()

@Corvince
Copy link
Contributor Author

@Corvince
I encounter attribute error given that some of the agents do not have certain attributes.
"AttributeError: 'Aquifer' object has no attribute 'satisfaction'"
According to your previous response, I expect the model return None instead of raising error.
Could you help me to inspect this issue?

Mesa only automatically returns None if your data collector only consists of string collectors. I.e. for your model

agent_reporters={"satisfication": "satisfication", 
                             "unique_id": "unique_id"}                

At some point this was planned to be the main way to collect Attributes (because it is also the fastest), but custom functions are still around so I kind of closed this issue to early.

@Corvince
Copy link
Contributor Author

Thank you for your instruction.
Those instructions let the code run without an error message, but the output of the agent dataframe is not correct.
In the figure below, Agent ID with "w" is one agent type, and "agt" is another agent type.
None should appear in the "Sa" column which is filled with "w1" and "w2" that should be in the "ID" column.

Do you mind to guide me again?

[update]
It turns output I should disable "exclude_none_values" and only implement the lambda function you suggested.
But, this also means that mesa did not automatically return none with the attribute that does not exist. We still need to manually do this by using the lambda function.

image

Yes this is indeed a bug. Looking at the code of #1702 now we can see that it completely removes None values. But of course that leaves blank spaces in the Data frame, so the values move to the left (because there are too few values to fill the DF and no indication where values should go). Not sure how that option was supposed to work. @rht?

You can also see in the discussion of #1702 that the feature alone wasn't meant to remove the need for handling Attribute errors on the user side. Sorry for that.

@rht
Copy link
Contributor

rht commented Sep 16, 2023

It looks like we have to make a choice:

  • using gettattr(..., None) without excluding None gives the correct DF, but the agent records have lots of None's
  • with excluding None enabled, the DF is wrong, because the order and size of the record doesn't match the reporter, but at least the agent records are small

We could use dict instead of tuple for individual agent records, so the order and size are not important, but a dict consumes more RAM than a tuple.

I think one solution would be to transpose the data collection

agent_records = {}
for k, func in self.agent_reporters.items():
    record = tuple((agent.unique_id, r) for agent in model.schedule.agents if r := func(agent) is not None)
    agent_records[k] = record

The records can be merged into 1 DF via the unique_id as an index.

@rht rht reopened this Sep 16, 2023
@Corvince
Copy link
Contributor Author

Maybe I am being dumb right now, but what was the purpose of removing None values in the first place? It doesn't facilitate multi-agent data collection. So I think the choice should be obvious to not exclude None values but receive the correct DF. What would be the advantage in receiving a wrong dataframe, just to save one from the slight inconvenience (?) of having lots of None values. Am I missing something here?

@rht
Copy link
Contributor

rht commented Sep 16, 2023

It wasn't raised as a GH issue, but @tpike3 encountered OOM when running Sugarscape G1MT on Colab (see #1702 (comment)). I suppose the storage problem here is due to Python's dict of list growing with list size even though the constituents are None's. Another option: maybe a DF for self._agent_records consumes less RAM, where a row is added to the DF for every data collection step.

@Corvince
Copy link
Contributor Author

It wasn't raised as a GH issue, but @tpike3 encountered OOM when running Sugarscape G1MT on Colab (see #1702 (comment)). I suppose the storage problem here is due to Python's dict of list growing with list size even though the constituents are None's. Another option: maybe a DF for self._agent_records consumes less RAM, where a row is added to the DF for every data collection step.

But then we really need to clear things up. Because until now I thought the remove None value function is somehow related to multi agent data collection, because it is discussed here and in #1702, which "fixed" #1419, which was also related to multi agent data collection.

If this is just about resolving that memory issue then this needs to be further investigated. Because it sounds very strange that removing some None values solves this. None values themselves take up nearly no memory. And I don't know which "dict of list" you are referring to, but yes something like that must be going on. But it still sounds fishy, since colab has 13GB of ram, more than most consumer hardware. So I wonder why this hasn't been encountered previously.

But right now we should focus on resolving the bug found by @philip928lin because that might really mess up some people's research.

@tpike3
Copy link
Member

tpike3 commented Sep 17, 2023

@Corvince I had a very long explanation, but as I am digging in I am finding inconsistencies in my understanding, so I will need to dig into this some more. Regardless, when updating the sugarscape with traders the memory issue became the code was collecting ~2500 none values each step for the sugar and space which start to break the Colab memory. Sugarscape examples are here](https://github.com/SFIComplexityExplorer/Mesa-ABM-Tutorial/tree/main). I still need to updates for Mesa 2.0 but I think I will need to work through this issue first.

Short version appreciating None take up a very small amount of memory, when you have agents at each grid cell like plant life and collect against them it still becomes problematic.

@Corvince
Copy link
Contributor Author

Its still hard to imagine, would be great if you could look into this. For reference (and for the fun of it), a simple list of 2500 None values consumes only 20kB, even if you collect that 1000 times its still only 20MB. More appropriate would be a list with unique_id and step, which can be approximated by

x = [[i, 1, None] for i in range(2500)]

Using

from pympler import asizeof
asizeof.asizeof(x)

we find that this list of lists consumes 300kB. So after 1000 steps we are at 300MB. Thats still quite far away from Colabs 13GB of RAM.

@rht
Copy link
Contributor

rht commented Sep 18, 2023

This is the original agent_records

((1, 'A0', 1, 'A0'), (1, 'B0', None, 'B0'), (1, 'A1', 1, 'A1'), (1, 'B1', None, 'B1'), (1, 'A2', 1, 'A2'), (1, 'B2', None, 'B2'))

exclude_none_values only works if the agent_records is organized this way instead

{'satisfication': (('A0', 1), ('B0', None), ('A1', 1), ('B1', None), ('A2', 1), ('B2', None)), 'unique_id': (('A0', 'A0'), ('B0', 'B0'), ('A1', 'A1'), ('B1', 'B1'), ('A2', 'A2'), ('B2', 'B2'))}

where the tuple element can be dropped while safely retaining which agents have which values.

@rht
Copy link
Contributor

rht commented Sep 18, 2023

@Corvince
Copy link
Contributor Author

Corvince commented Sep 18, 2023

You should measure/debug on the actual agent records object at https://colab.research.google.com/github/SFIComplexityExplorer/Mesa-ABM-Tutorial/blob/main/Session_19_Data_Collector_Agent.ipynb.

Thank you for the link, I couldn't find the right version. In your link I only had to change for _, x, y in self.grid.coord_iter(): to for _, (x, y) in self.grid.coord_iter(): to make it work.

Analyzing the actual agent_records object gave me 310MB of memory usage and for the None-removed version 9MB. So that was very nice to see my approximation of 300MB being exactly true.

But this also shows that while yes, removing None can save lots of space compared to the full dataset, no that doesn't prevent the model from being run on colab. I could easily store 10 model runs in colabs memory. @tpike3 I realized that having multiple tabs of colab open each session shares the same memory. So maybe you were simply doing too much colab work at the same time?
We should also keep in mind that we are collectiong more than 4 millions data points here. I think 300MB isn't that bad for that given that most models collect much fewer data points.

This is the original agent_records

((1, 'A0', 1, 'A0'), (1, 'B0', None, 'B0'), (1, 'A1', 1, 'A1'), (1, 'B1', None, 'B1'), (1, 'A2', 1, 'A2'), (1, 'B2', None, 'B2'))

exclude_none_values only works if the agent_records is organized this way instead

{'satisfication': (('A0', 1), ('B0', None), ('A1', 1), ('B1', None), ('A2', 1), ('B2', None)), 'unique_id': (('A0', 'A0'), ('B0', 'B0'), ('A1', 'A1'), ('B1', 'B1'), ('A2', 'A2'), ('B2', 'B2'))}

where the tuple element can be dropped while safely retaining which agents have which values.

At first this looks nice and I like the semantics here of retaining what value is being collected. But I am afraid this won't scale very well. For this small example your version has a larger memory footprint (with None being removed of course), due to the dictionary overhead. That probably goes away with larger size, but it doesn't scale with collecting more attributes, because you always have to store the unique_id with each data value. For example:

('A0', 'a', 'b', 'c', 'd') 

would become

{'A': ('A0', 'a'), {'B': ('A0', 'b'), {'C': ('A0', 'c'), {'D': ('A0', 'd'), 

Which can easily take up more memory. So it will really depend on how many None values you have.

Also I am worried that we need additional code to put the dataframe back together and this will further complicate the code. And the reason to favor #1702 instead of #1701 was to have simpler code. That goes away for something that could also be done after the fact by simply calling df.dropna(). So I think this really depends on if we run out of memory or not. But we would need to have a reproducer for that .

@rht
Copy link
Contributor

rht commented Sep 18, 2023

That makes a lot of sense now. Maybe it was a coincidence that @tpike3 's memory usage was relieved by freeing ~300 MB, in each sessions (as such, it could be gigabytes)?

Regarding with multi-agent data collection, @philip928lin already had the correct DF by using getattr(agent, "attr", None) without exclude_none_values, and without any additional feature needed in the library. I vote to remove exclude_none_values since it is not usable, but at the same time not includedto merge #1701, because it's optional at this point.

@tpike3
Copy link
Member

tpike3 commented Sep 18, 2023

I just went back thru and actually found another change in Mesa 2.0 that broke the tutorial I need to go back in and fix. So i will try and get to that this weekend.

However, if you run session 20 (batch_run) and comment out line 204 in the the Model cell

#agent_trades = [agent for agent in agent_trades if agent[2] is not None]

This results in GBs of memory usage with one colab open.

You also need to change the instantiation of the sugar and spice landscape (lines 92 to 105) to ...

for _,pos in self.grid.coord_iter():
      max_sugar = sugar_distribution[pos[0],pos[1]]
      if max_sugar > 0: 
        sugar = Sugar(agent_id, self, pos, max_sugar)
        self.schedule.add(sugar)
        self.grid.place_agent(sugar, pos)
        agent_id += 1
    
      max_spice = spice_distribution[pos[0],pos[1]]
      if max_spice > 0: 
        spice = Spice(agent_id, self, pos, max_spice)
        self.schedule.add(spice)
        self.grid.place_agent(spice, pos)
        agent_id += 1    

@Corvince , @rht, @philip928lin let me know what to thinking on either something I am messing up or the best way to move forward.

@Corvince
Copy link
Contributor Author

@tpike3 I can confirm that batch run leads to excessive memory usage, although it doesn't actually start that many model runs. I need to investigate this further but my first impression is that something is off with batch_run.

@tpike3
Copy link
Member

tpike3 commented Sep 18, 2023

@tpike3 I can confirm that batch run leads to excessive memory usage, although it doesn't actually start that many model runs. I need to investigate this further but my first impression is that something is off with batch_run.

Thanks @Corvince I am wondering that to, maybe it wasn't the datacollector but batch_run. I l always behind, but I will dabble with it as well.

@EwoutH
Copy link
Member

EwoutH commented Nov 10, 2023

I'm a big fan of Corvince's latest proposal. I think it's both elegant and adds a huge amount of capability and flexibility!

@jackiekazil @tpike3 @rht I'm really curious what you think!

@EwoutH
Copy link
Member

EwoutH commented Nov 18, 2023

I would like to give implementing @Corvince’s proposal a go. But before doing that, I need to know if there is broader support or if we need to go a different direction, like my previous proposal or otherwise.

I would also really hope we can move this forward. “No”, “I disagree”, “I don’t have time”, “this shouldn’t be a priority” are all legit answers. But please just communicate anything, than everyone knows how the deck is stacked.

EwoutH added a commit to EwoutH/mesa that referenced this issue Dec 17, 2023
Tracks agents in the model with a defaultdict.

This PR adds a new `agents` dictionary to the Mesa `Model` class, enabling native support for handling multiple agent types within models. This way all modules can know which agents and agents type are in the model at any given time, by calling `model.agents`.

NetLogo has had agent types, called [`breeds`](https://ccl.northwestern.edu/netlogo/docs/dict/breed.html), built-in from the start. It works perfectly in all NetLogo components, because it's a first class citizen and all components need to be designed to consider different breeds.

In Mesa, agent types are an afterthought at best. Almost nothing is currently designed with multiple agent types in mind. That has caused several issues and limitations over the years, including:

- projectmesa#348
- projectmesa#1142
- projectmesa#1162

Especially in scheduling, space and datacollection, lack of a native, consistent construct for agent types severely limits the possibilities. With the discussion about patches and "empty" this discussion done again. You might want empty to refer to all agents or only a subset of types or single type. That's currently cumbersome to implement.

Basically, by always having dictionary available of which agents of which types are in the model, you can always trust on a consistent construct to iterate over agents and agent types.

- The `Model` class now uses a `defaultdict` to store agents, ensuring a set is automatically created for each new agent type.
- The `Agent` class has been updated to leverage this feature, simplifying the registration process when an agent is created.
- The `remove` method in the `Agent` class now uses `discard`, providing a safer way to remove agents from the model.
tpike3 pushed a commit that referenced this issue Dec 18, 2023
Tracks agents in the model with a defaultdict.

This PR adds a new `agents` dictionary to the Mesa `Model` class, enabling native support for handling multiple agent types within models. This way all modules can know which agents and agents type are in the model at any given time, by calling `model.agents`.

NetLogo has had agent types, called [`breeds`](https://ccl.northwestern.edu/netlogo/docs/dict/breed.html), built-in from the start. It works perfectly in all NetLogo components, because it's a first class citizen and all components need to be designed to consider different breeds.

In Mesa, agent types are an afterthought at best. Almost nothing is currently designed with multiple agent types in mind. That has caused several issues and limitations over the years, including:

- #348
- #1142
- #1162

Especially in scheduling, space and datacollection, lack of a native, consistent construct for agent types severely limits the possibilities. With the discussion about patches and "empty" this discussion done again. You might want empty to refer to all agents or only a subset of types or single type. That's currently cumbersome to implement.

Basically, by always having dictionary available of which agents of which types are in the model, you can always trust on a consistent construct to iterate over agents and agent types.

- The `Model` class now uses a `defaultdict` to store agents, ensuring a set is automatically created for each new agent type.
- The `Agent` class has been updated to leverage this feature, simplifying the registration process when an agent is created.
- The `remove` method in the `Agent` class now uses `discard`, providing a safer way to remove agents from the model.
@EwoutH
Copy link
Member

EwoutH commented Dec 18, 2023

Now that #1894 is merged, we can take a further look at datacollection. #1911 might also help, maybe we can use a data collection as input for the datacollector (or use a similar API).

@EwoutH
Copy link
Member

EwoutH commented Mar 28, 2024

Note that this discussion is largely continued here:

@EwoutH
Copy link
Member

EwoutH commented Aug 15, 2024

I'm posting back here instead of #1944, because it directly follows a proposal here:

I'm inclined to say that @Corvince was closest with his API:

dc = DataCollector(
    items={
        "wolf_vars": collect(
            target=model.get_agents_of_type(Wolf),
            attributes={
                "energy": "energy",
                "healthy": lambda a: a.energy > 5,
            aggegates={
                "mean_energy": ("energy", np.mean()),
                "number_healty": ("healthy", sum()),
        ),
    }
)

This would return the following dictionary:

{
    "wolf_vars": {
        "attributes": {
            "agent_id": [1, 2, 3],       # List of agent IDs
            "energy": [3, 7, 10],        # Energy levels of each wolf
            "healthy": [False, True, True],  # Whether each wolf is healthy (energy > 5)
        },
        "aggregates": {
            "mean_energy": 6.67,         # Mean energy of all wolves
            "number_healthy": 2          # Number of healthy wolves
        }
    }
}

Implementation wise, this could roughly look like:

class DataCollector:
    def __init__(self, items):
        self.items = items
        self.data = {
            key: {
                "attributes": {},
                "aggregates": {}
            }
            for key in items
        }

    def collect(self, model):
        for item_name, item_details in self.items.items():
            attributes = item_details['attributes']
            aggregates = item_details['aggregates']

            # Collect agent IDs
            self.data[item_name]["attributes"]["agent_id"] = agents.get("unique_id")
            
            # Collect attributes for each agent
            for attr_name, attr_func in attributes.items():
                if str(attr_func):
                   # Use AgentSet.get()
                else:
                   # Use AgentSet.apply()
                
            # Collect aggregates
            for agg_name, (attr_name, agg_func) in aggregates.items():
                values = self.data[item_name]["attributes"][attr_name]
                self.data[item_name]["aggregates"][agg_name] = agg_func(values)

I think this gives a huge amount of flexibility, while offering a logical code path: First collect the raw agent data, than aggegrate if needed.

A nice benefit is that AgentSet.get() or AgentSet.apply() only needs to be applied once per variable.

@EwoutH
Copy link
Member

EwoutH commented Aug 15, 2024

One thing which could be considered is not running aggerates per collect() function, but once per DataCollector object. This way you could in theory combine aggerates from different collect() targets and the model

@rht
Copy link
Contributor

rht commented Aug 15, 2024

The new ideas that I can incorporate into #2199:

  • instead of model-level reporter, it is called an aggregate instead. I agree this is semantically clearer
  • perf: for each groups, there should be only a single loop to computes all the aggregates/attributes

I still find the API to be too hectic, too verbose for casual users to intuitively remember. Unless there is a key feature that the simple API in #2199 can't cover. And hence why I implemented the way I did in #2199 and ditched the fancy measure classes. Reminder that there is not much time left on the drawing board. Only ~2 weeks left.

@Corvince
Copy link
Contributor Author

One challenge I keep encountering is with the terminology we're using. I believe we're conflating data collection and data analysis too often, which muddies the distinction between the two and distracts us from what we want to achieve.

In my current job, I’ve had to re-evaluate various libraries, focusing on what makes some more user-friendly than others. I’ve found that the deciding factor in terms of ease of use is having sensible defaults combined with the ability to fully customize under the hood. An intuitive API for data collection, in my view, would look something like this:

data = run_model(model)

However, this kind of simplicity is missing from the current API. The reason I advocate for this approach is that I typically prefer to collect as much data as possible during the model run and perform the analysis afterward, either through custom functions or with built-in Mesa functions. By default, I believe we should automatically collect all agent and model attributes (and possibly every property) at every step. Aggregates, by their nature, can be calculated post-run. Expressions like "healthy": lambda a: a.energy > 5 either belong in the analysis phase—meaning they don't need to be calculated during runtime—or they are intrinsic to the model and should therefore be treated as their own attribute or property.

I anticipate concerns about the potential performance impact of this approach. However, I don't think this will be a significant issue for most models. Data collection should be implemented at a low level, with more "expensive" convenience functions like .todf() being applied afterward. Of course, users should still have the ability to fully customize data collection by providing:

data = run_model(model, data_collector=DataCollector(...))

This way, the API can focus on being flexible rather than overly concise. It doesn’t need to be memorized for every model but can be something that users opt into when they have specific requirements.

@quaquel
Copy link
Member

quaquel commented Aug 16, 2024

I broadly agree with the vision of @Corvince. However, for me, there is still a fundamental difference between the various attributes within a model/agent and what data one wants to collect about the model (and whether to collect this data over time). So, I am not in favor of just collecting everything by default. Rather, I want users to declare explicitly within the model/agent that a given attribute is collectable/observable. Next, outside the model, the user can specify how to collect it (over time or at the end of the runtime). This is still in line with having a simple and concise API that is incredibly flexible and so small that it is easy to remember. So you would get something like this

class MyModel(model):
    gini = Observable()
    
    # rest of model goes here

class MyAgent(Agent):
    wealth = Observable()

   # rest of agent goes here

model = MyModel()

collectors = [AgentCollector(MyAgent.wealth),
              ModelCollector(MyModel.gini)]

model.run(ticks=100)

@Corvince
Copy link
Contributor Author

I still don't fully understand this sentiment. If I declare an attribute, it must play a decisive role for my model - otherwise I wouldn't need it. And so of course it seems advisable to keep its data. There is nothing lost in keeping more data than I might actually analyze. The other way around is much more annoying. If I want to analyze something additional, that I haven't thought about before, I now have to rerun my model - just to collect additional data. The same logic applies to aggregating - I would never do this at runtime. Its not possible to disaggregate later, while aggregating can be done as late as needed.

But for other reasons I am much more in favor of exploring observables/signals as main building blocks for mesa models now. And then I agree that their declaration could be a nice entry point also for declaring how/if they should be collected. But I wouldn't tie this too much into the data collection discussion. But I also agree that for my envision API an easy way to declare included/excluded attributes would be desirable. And doing this directly at the attribute level would be viable.

@quaquel
Copy link
Member

quaquel commented Aug 16, 2024

I still don't fully understand this sentiment. If I declare an attribute, it must play a decisive role for my model - otherwise, I wouldn't need it.

Yes, but from this, it does not follow that all attributes are part of the outcomes that you, as an analyst, are interested in. I see models as objects on which one performs experiments. It is good practice to carefully design the experimental setup. This includes specifying the variables to explicitly control and the data to gather for this experiment. Collecting simply everything, which I see a lot of people do, in my experience, often devolves into data dredging and bad science. I hope this clarifies where I am coming from.

But for other reasons I am much more in favor of exploring observables/signals as main building blocks for mesa models now.

I started exploring psygnal, and it seems like a nice library on which to build. For example, it solves the problem of observing changes to collections. I hope to find some time in the coming weeks to rerig my datacollection branch on top of this.

@Corvince
Copy link
Contributor Author

Philosophically I understand your point much better now, but practically speaking I think it applies much earlier.
Say I want to study the population dynamics of the wolf sheep model. I could give every animal some random name. Surely I don't need to collect the name, because it's unrelated to the population dynamics. But then why should I include it in the first place? A good model should only include variables that are related to the research question. But then I also want to measure them. I can't think of a good example where I would include some variables in the model, but where I don't want to evaluate their impact on the result.

The only distinction I immediately see is between private and public variables. In some examples there are attributes that hold some private state (e.g. countdown for the grass patches in the wolf sheep example). Those should probably be declared as private variables and need not be collected. But as soon as they are used for some form of interaction (i.e. public), they should be measured.

@quaquel
Copy link
Member

quaquel commented Aug 16, 2024

The only distinction I immediately see is between private and public variables. In some examples there are attributes that hold some private state (e.g. countdown for the grass patches in the wolf sheep example). Those should probably be declared as private variables and need not be collected. But as soon as they are used for some form of interaction (i.e. public), they should be measured.

Yes, this is what I was getting at. However, the cutoff is not between private and public persé. In the wolf-sheep example, the position is critical to the model's functioning, but for analysis (note: not visualization) purposes, this is not so relevant. In contrast, in Eppstein, the agent's internal state (wanting to protest but not daring to do so) can be quite relevant to tracking for analysis purposes.

But I also agree that for my envisioned API, an easy way to declare included/excluded attributes would be desirable. And doing this directly at the attribute level would be viable.

It might be quite easy to automatically track all attributes declared as "Observable" by default, as you envisioned in your API. This would still involve some additional collector classes that are part of the basic model class or part of some run_experiment function (as in your sketched API). In my view, data collection is independent of the model and, conceptually does not belong inside a model class). So you would get something like

class MyModel(model):
    gini = Observable()
    
    # rest of model goes here

class MyAgent(Agent):
    wealth = Observable()

   # rest of agent goes here

model = MyModel()

data  = run_experiment(model, ticks=100)

Here, data would be some new Data class from which you can grab, say, the agent-level data. It could have some convenience methods for converting to, e.g., data frames, but it would leave it up to the user to decide how to make the results persistent.

@rht
Copy link
Contributor

rht commented Aug 16, 2024

class MyModel(model):
    gini = Observable()
    
    # rest of model goes here

class MyAgent(Agent):
    wealth = Observable()

   # rest of agent goes here

Unless it is a requirement/constraint of the psygnal syntax, marking both as an Observable can already be done altogether in the old data collector API ({"Gini": "gini"}). The advantage of the latter is that you can express your model completely separate from how it is going to be observed. Any observation code is localized to the datacollector object initialization. And data analysis/visualization in another isolated area. But one could also say that the latter is similar to Numba type annotation (isolated in a decorator @njit(int32(int32, int32))) instead of Mypy/PEP 484/Cython inline annotation.

With the current existing data collector API, one could already move it to outside of the model specification. And it just needs the wrapper run_experiment that does data collection after each model step. This can be implemented at any time without a breaking change. So the remaining issue is how to specify which attributes to be collected, the groups, and if the API can accommodate psygnal under the hood.

@EwoutH
Copy link
Member

EwoutH commented Aug 18, 2024

Thanks for this extensive and insightful discussion. I will try to wrap my mind some further around the Observable() ideas.

On thing that's interesting is that with #2219 being merged, the AgentSet can now do practically anything we originally wanted to do in the DataCollector:

  • Select a subset of agents (target/filter): AgentSet.select()
  • Apply a function: AgentSet.do()
  • Retrieve one or multiple attributes: AgentSet.get()

Some simple aggerates are also already possible, with len(). A good standard way to aggerate agent properties could still be beneficial though.

@EwoutH
Copy link
Member

EwoutH commented Sep 23, 2024

@Corvince with the agenttype_reporters being added in #2300, and recognizing that we still wat to make a future iteration on the DataCollector (#1944), can we close this issue as completed?

class MyModel(Model):
    def __init__(self):
        super().__init__()
        self.datacollector = DataCollector(
            agent_reporters={"life_span": "life_span"},
            # The new agenttype_reporters argument
            agenttype_reporters={
                Wolf: {"sheep_eaten": "sheep_eaten"},
                Sheep: {"wool": "wool_amount"},
                Animal: {"energy": "energy"}  # Collects from all animals
            }
        )

@Corvince
Copy link
Contributor Author

Yes! Finally 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants