Questions Evaluation #176

st3lzer · 2023-07-04T07:23:42Z

Hello,
first of all, I want to thank you for your detailed code. I had the same problem as #175, but now I have soem questions regarding the ensemble evaluation:

How does the ensemble evaluation work in conceptual terms and why does it improve the performance (scores and duration of the evaluation)?
How long does an evaluation on your system typically take, and what CPU do you use for training and evaluation?
In question about validation dataset and evaluation using pretrained models #175 you stated that you evaluate the 6 models on the NEAT routes. How long does this evluation typically take?

Thank you!

Kait0 · 2023-07-04T09:33:15Z

We have 3 models we input the sensor data to each of them and average the 3 predictions in the end. Table 5 shows when ensembling helps. As for the why question, ensembling is a general machine learning technique. There is good theory on this you can look this up online in some survey or deep learning lecture.
This highly depends how much GPUs are available (we use 2080ti computers). We usually use as many as we can get on our compute cluster. For this repo, I think if we get all 108 GPUs then a longest6 evaluation takes around 6 hours. With 32 maybe in 12 hours or so.
We have some more information on how to evaluate in our latest work
Again a couple of hours if you parallelize all models and routes (which we did). We stopped doing the NEAT evaluations since we observed that they don't make much of a difference and cost a lot of compute (using epoch 31 works just fine).

st3lzer · 2023-07-04T14:27:45Z

Thanks for the answer! Unfortunately, another problem occured now: In the json-file, many routes get the result "Failed - Agent couldn't be set up", while others are "Completed". I never got this error when I evaluated only a single model at a time. I am running CARLA 0.9.10.1 in a Docker container with the installed additional maps on a A100 GPU, 1TB RAM and 128 Core CPU.

I startet the Docker with the following command:
docker run -d --name CARLA10_Server --privileged --gpus all --net=host -v /tmp/.X11-unix:/tmp/.X11-unix:rw carlasim/carla:0.9.10.1 /bin/bash -c 'SDL_VIDEODRIVER=offscreen ./CarlaUE4.sh --world-port=5442 -carla-rpc-port=5540 -opengl -RenderOffScreen -nosound'

Do you have any explaination for this?

Kait0 · 2023-07-04T15:54:24Z

Well you have to look at the python error logs to see exactly why the code failed during agent setup.

Having some failures on larger clusters is normal I think.
For (nonreproducible) cases where this happens rarely we use a script to detect crashed routes and rerun them.

st3lzer · 2023-07-05T11:21:10Z

The cases are all nonreproducable but it might be the timout that is too low with 60 seconds for a busy cluster. I think I might try 240 seconds for --timeout. Here sare some of the error logs why the agent setup failed:

Traceback (most recent call last):
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/leaderboard_evaluator_local.py", line 289, in _load_and_run_scenario
self.agent_instance = getattr(self.module_agent, agent_class_name)(args.agent_config)
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/autoagents/autonomous_agent.py", line 45, in init
self.setup(path_to_conf_file)
File "/beegfs/work/stelzer/SyncTransfuser/team_code_transfuser/submission_agent.py", line 95, in setup
state_dict = torch.load(os.path.join(path_to_conf_file, file), map_location='cuda:0')
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/torch/serialization.py", line 712, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/torch/serialization.py", line 1049, in _load
result = unpickler.load()
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/torch/serialization.py", line 1019, in persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/torch/serialization.py", line 997, in load_tensor
storage = zip_file.get_storage_from_record(name, numel, torch._UntypedStorage).storage()._untyped()
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/leaderboard_evaluator_local.py", line 131, in _signal_handler
raise RuntimeError("Timeout: Agent took too long to setup")
RuntimeError: Timeout: Agent took too long to setup

Traceback (most recent call last):
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/leaderboard_evaluator_local.py", line 289, in _load_and_run_scenario
self.agent_instance = getattr(self.module_agent, agent_class_name)(args.agent_config)
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/autoagents/autonomous_agent.py", line 45, in init
self.setup(path_to_conf_file)
File "/beegfs/work/stelzer/SyncTransfuser/team_code_transfuser/submission_agent.py", line 97, in setup
net.load_state_dict(state_dict, strict=False)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1605, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for LidarCenterNet:
While copying the parameter named "_model.image_encoder.features.s3.b9.conv1.conv.weight", whose dimensions in the model are torch.Size([576, 576, 1, 1]) and whose dimensions in the checkpoint are torch.Size([576, 576, 1, 1]), an exception occurred : ('Timeout: Agent took too long to setup',).

Traceback (most recent call last):
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/leaderboard_evaluator_local.py", line 289, in _load_and_run_scenario
self.agent_instance = getattr(self.module_agent, agent_class_name)(args.agent_config)
File "/beegfs/work/stelzer/SyncTransfuser/leaderboard/leaderboard/autoagents/autonomous_agent.py", line 45, in init
self.setup(path_to_conf_file)
File "/beegfs/work/stelzer/SyncTransfuser/team_code_transfuser/submission_agent.py", line 92, in setup
net = LidarCenterNet(self.config, 'cuda', self.backbone, image_architecture, lidar_architecture, use_velocity)
File "/beegfs/work/stelzer/SyncTransfuser/team_code_transfuser/model.py", line 564, in init
self._model = TransfuserBackbone(config, image_architecture, lidar_architecture, use_velocity=use_velocity).to(self.device)
File "/beegfs/work/stelzer/SyncTransfuser/team_code_transfuser/transfuser.py", line 32, in init
self.lidar_encoder = LidarEncoder(architecture=lidar_architecture, in_channels=in_channels)
File "/beegfs/work/stelzer/SyncTransfuser/team_code_transfuser/transfuser.py", line 440, in init
self._model = timm.create_model(architecture, pretrained=False)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/factory.py", line 74, in create_model
model = create_fn(pretrained=pretrained, **kwargs)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/regnet.py", line 458, in regnety_032
return _create_regnet('regnety_032', pretrained, **kwargs)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/regnet.py", line 350, in _create_regnet
**kwargs)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/helpers.py", line 453, in build_model_with_cfg
model = model_cls(**kwargs) if model_cfg is None else model_cls(cfg=model_cfg, **kwargs)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/regnet.py", line 260, in init
self.add_module(stage_name, RegStage(prev_width, **stage_args, se_ratio=se_ratio))
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/regnet.py", line 224, in init
downsample=proj_block, drop_block=drop_block, drop_path=drop_path, **block_kwargs)
File "/home/stelzer/work/anaconda3/envs/tfuse/lib/python3.7/site-packages/timm/models/regnet.py", line 153, in init
self.conv3 = ConvBnAct(bottleneck_chs, out_chs, kernel_size=1, **cargs)

Kait0 · 2023-07-05T11:40:21Z

We typically use --timeout 600 to avoid these problems.

st3lzer · 2023-07-08T10:26:45Z

I have one final question regarding this topic: Your advices worked (Thank you!) and the evaluation of your provides models brought the following resuts:
"values": [
"46.932",
"77.395",
"0.626",
"0.015",
"0.603",
"0.010",
"0.029",
"0.143",
"0.000",
"0.000",
"0.014",
"0.358"
]
I wonder whether the rather low RC compared to your published results is normal or if there might be something wrong with my CARLA server. Also the values for IS and "Agent blocked" are quite high compared to yours which is obvously conntected to the low RC. Interestingly also the expert shows the high value for "Agent blocked" when I evaluate it:
"values": [
"77.278",
"88.344",
"0.889",
"0.015",
"0.043",
"0.000",
"0.000",
"0.082",
"0.000",
"0.000",
"0.018",
"0.313"
]
Do you think this is normal?

Kait0 · 2023-07-08T10:50:57Z

Hm two things come to mind that might be the problem.

Your numbers look a bit like they were not parsed with the result_parser.
The auxiliary metrics (everything except DS, RC, and IS) from the leaderboard are known to be incorrect (was only fixed for leaderboard 2). If you use the result parser it will recompute the numbers with the correct math.

As for the blocked metric. Do you use the leaderboard client from this repository for evaluation or a different one?
The blocked metric was changed at the end of 2020, we use the newer version (wrote a bit on that here).

For the expert the DS, RC and IS look the same. If the results from TransFuser are a retrained model than I think it's possible that you happended to end up with a model that is more passive and gets blocked instead of pushing other cars out of the way (which gets more RC and more collisions so lower IS).

st3lzer · 2023-07-08T11:11:47Z

The result parser gives the following results:
Avg. driving score,46.932
Avg. route completion,77.395
Avg. infraction penalty,0.626
Collisions with pedestrians,0.04797903002993056
Collisions with vehicles,1.7992136261223959
Collisions with layout,0.02398951501496528
Red lights infractions,0.09595806005986111
Stop sign infractions,0.43181127026937505
Off-road infractions,0.0
Route deviations,0.0
Route timeouts,0.04797903002993056
Agent blocked,0.3598427252244792

I use this repository except that I run the evaluation in a Docker with a 0.9.10.1 CARLA image to which I added the additional maps. I use your "leaderboard_evaluator_local.py".

For the expert the DS, RC and IS look the same. If the results from TransFuser are a retrained model than I think it's possible that you happended to end up with a model that is more passive and gets blocked instead of pushing other cars out of the way (which gets more RC and more collisions so lower IS).

This makes sense, but the results are from the three models that you provided. Shouldn't the evaluation result then be more similar to yours?

Kait0 · 2023-07-08T12:12:36Z

The numbers look reasonable now with the result parser I think.
CARLA leaderboard 1.0 evaluations are quite random because the traffic and physics are random.
In your evaluation you got less vehicle collisions than in ours, but higher agent blocked (which is usually blocked by a vehicle).
Depending whether TF happens to push a car out of the way (vehicle infraction) or get blocked you can end up with different aux metrics but similar driving score.
I think in Coaching a Teachable Student they also rerun it and got 46 DS, 84 RC and 0.57 IS which is something inbetween the results you got and the results we got.

You can check that your simulator runs with the -opengl option but I think there is likely no problem with you setup anymore.

st3lzer · 2023-07-08T12:27:43Z

I started CARLA in the docker with the -opengl option (see third comment in this issue). Your explaination makes sense, thank you very much for the quick responses!

dd-xx-dot · 2024-12-18T09:01:42Z

Have your evaluation results improved? To achieve a relatively good score, is it necessary to use the model from the 31st epoch for evaluation and integrate three models together for the evaluation?

Kait0 closed this as completed Jul 7, 2023

st3lzer mentioned this issue Jul 23, 2023

Bad evaluation results after training #180

Closed

JinRanYAO mentioned this issue Sep 4, 2023

Can't reproduce the same RC on longest6 Benchmark #191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Evaluation #176

Questions Evaluation #176

st3lzer commented Jul 4, 2023

Kait0 commented Jul 4, 2023

st3lzer commented Jul 4, 2023

Kait0 commented Jul 4, 2023

st3lzer commented Jul 5, 2023

Kait0 commented Jul 5, 2023

st3lzer commented Jul 8, 2023

Kait0 commented Jul 8, 2023

st3lzer commented Jul 8, 2023 •

edited

Loading

Kait0 commented Jul 8, 2023

st3lzer commented Jul 8, 2023

dd-xx-dot commented Dec 18, 2024

Questions Evaluation #176

Questions Evaluation #176

Comments

st3lzer commented Jul 4, 2023

Kait0 commented Jul 4, 2023

st3lzer commented Jul 4, 2023

Kait0 commented Jul 4, 2023

st3lzer commented Jul 5, 2023

Kait0 commented Jul 5, 2023

st3lzer commented Jul 8, 2023

Kait0 commented Jul 8, 2023

st3lzer commented Jul 8, 2023 • edited Loading

Kait0 commented Jul 8, 2023

st3lzer commented Jul 8, 2023

dd-xx-dot commented Dec 18, 2024

st3lzer commented Jul 8, 2023 •

edited

Loading