Comparing different submissions #80

eceisik · 2021-07-11T13:49:27Z

eceisik
Jul 11, 2021

Dear organizers,

We compared different methods for the challenge, we have realized that the same conditions result in different scores. For example, even for the first round first collaborator training aggregation metrics are different (significant difference).

Do you plan to use seeds to simulate the same experimental conditions for different submissions? (for train-validation data split, weight initialization, etc) Or do you plan to run the experiments many times and take the average? Do you have any suggestions for us for a fair comparison?

By the way, we already fixed np.random and random seeds but it seems that they are not enough. Do you fix the seeds for the PyTorch and Cuda codes on internal codes?

In summary, what is the best way to compare different methods objectively? If we handle it by only fixing seeds, it would be great due to the runtimes.

Thank you,
Ece

sarthakpati · 2021-07-11T15:41:08Z

sarthakpati
Jul 11, 2021
Maintainer

That's an excellent suggestion! I believe we were planning on putting a seed for torch and numpy, but I'll tag @alexey-gruzdev @psfoley @msheller for clarification.

0 replies

eceisik · 2021-07-14T13:33:32Z

eceisik
Jul 14, 2021
Author

Dear Organizers,
Is there any progress on this issue?
Best,
Ece

0 replies

eceisik · 2021-07-14T14:28:59Z

eceisik
Jul 14, 2021
Author

Dear Organizers,

You may consider adding these lines to the internal codes. I think that it is also good for comparing team results at the end of the challenge. But I am not sure that they are enough :)

 torch.manual_seed(torch_manual_seed)
 torch.cuda.manual_seed_all(torch_manual_seed)
 torch.backends.cudnn.deterministic = True
 torch.backends.cudnn.benchmark = False

(maybe random_state can be fixed for train_test_split )

Best,
Ece

4 replies

sarthakpati Jul 14, 2021
Maintainer

This is a very nice suggestion! I have created an issue to track this.

eceisik Jul 16, 2021
Author

Sorry for the late response, thank you for your help.

psfoley Jul 22, 2021
Maintainer

@eceisik Thank you for raising this issue and your suggestion. We just merged a PR that makes results consistent across experiments for a given hardware configuration (the DataLoader was an additional source of non-determinism that had to be fixed in the FeTS Algorithms package). I should note that if a different device/platform is used for result verification (i.e. GPU vs CPU), there is still some potential that results could differ. In the case that there is a major discrepancy, we will do our best to reproduce results on a similar hardware configuration as time allows.

eceisik Jul 22, 2021
Author

Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing different submissions #80

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Comparing different submissions #80

eceisik Jul 11, 2021

Replies: 3 comments · 4 replies

sarthakpati Jul 11, 2021 Maintainer

eceisik Jul 14, 2021 Author

eceisik Jul 14, 2021 Author

sarthakpati Jul 14, 2021 Maintainer

eceisik Jul 16, 2021 Author

psfoley Jul 22, 2021 Maintainer

eceisik Jul 22, 2021 Author

eceisik
Jul 11, 2021

Replies: 3 comments 4 replies

sarthakpati
Jul 11, 2021
Maintainer

eceisik
Jul 14, 2021
Author

eceisik
Jul 14, 2021
Author

sarthakpati Jul 14, 2021
Maintainer

eceisik Jul 16, 2021
Author

psfoley Jul 22, 2021
Maintainer

eceisik Jul 22, 2021
Author