Fix scoring #292

ujh · 2016-05-13T17:48:54Z

Run the playouts for 10-30s for use in final_score and final_status_list. It's still WIP as some tests don't pass yet and I also don't quite know if this is better as before. I'll have to run the 13x13 benchmark on master and this branch to see if this branch is better at predicting the score (i.e. closer to GnuGo).

Fixes #280, #291

iopq · 2016-05-13T17:54:37Z

That's not how other engines do it. They let you have an analyze command like reg_genmove or donplayouts 1000 and THEN you score.

ujh · 2016-05-13T18:35:25Z

I came up with that solution for the case where the bot is winning a game which means it's doing very few playouts for the last move. My assumption is that this is the reason for the bad scoring. But all this is only based on a feeling. I'm currently running 50 games on 13x13 for both the master branch and this one. Then we will know the following:

In how many cases are GnuGo and Iomrascálaí disagreeing on master, i.e. is there even a big problem in the first place.
Is this better than it was before.

coveralls · 2016-05-17T08:02:32Z

Coverage decreased (-1.6%) to 91.702% when pulling 16e5479 on fix-scoring into fd63ad4 on master.

coveralls · 2016-05-17T11:38:56Z

Coverage decreased (-1.2%) to 92.015% when pulling 0778a5b on fix-scoring into fd63ad4 on master.

ujh · 2016-05-21T15:26:20Z

I ran a few games on 13x13 with both master and this branch. For master it looks like this:

14.4% same score as GnuGo (35 of 243, ± 4.39 at 95%, ± 4.48 at 99%)

For this branch it's quite a bit better:

66.34% same score as GnuGo (136 of 205, ± 6.44 at 95%, ± 6.57 at 99%)

It's of course not perfect but I think the improvement is big enough that it makes sense to merge this.

iopq · 2016-05-21T23:09:21Z

But I like to set my own time limits for search, how do I get the previous functionality back?

Also, are the games where there are score disagreements played to the end, or resigned by Iomrascalai?

ujh · 2016-05-22T08:28:54Z

I unfortunately can't check as I'm on holiday without a computer. But can't you also use search and pass in a custom stop closure?

ujh · 2016-05-22T08:30:05Z

Oh, and these numbers are for games that weren't resigned.

iopq · 2016-05-22T15:01:28Z

I'm using the executable, right now I just do a search for x seconds first (by changing the timer) and then ask for a score.

ujh · 2016-05-22T17:15:40Z

I see. We could introduce a custom GTP command that allows you to set the number of seconds it should search for final_score.

iopq · 2016-05-22T17:21:17Z

I've seen a command like donplayouts 10000 to do ten thousand playouts for this purpose. That way final_score just uses whatever the playouts are already done instead of doing additional playouts. Of course this also applies to a command like uct_gfx which shows the statistics for all the tree moves. I'm working on uct_gfx already and I don't feel like I should have to make special version of the command to populate the tree. Instead I could just do donplayouts 1000 to do uct_gfx OR final_score depending on my needs.

ujh · 2016-05-22T17:33:12Z

The issue with that is that not enough playouts are done when the win rate goes above 80%. So we end up with almost no data at the end of the game. That's why I introduced this code to do some "extra" playouts to get a better prediction of which stoned are dead.

iopq · 2016-05-22T17:34:16Z

So just run the donplayouts 10000 command first? Or, since we don't have one, time_settings 60 0 0 reg_genmove final_score should do the trick.

ujh · 2016-05-23T11:37:48Z

Let's discuss this once I'm back from my holiday. I think we just have totally different use cases.

ujh · 2016-05-31T13:19:19Z

Alright, so I'm back from holiday. My current use case is as follows: If Iomrascálaí is winning a game it stops the search early during the last moves so it doesn't run many playouts. If we then use these few playouts for scoring after the game is over we get the score wrong quite a lot as dead stones aren't marked as dead. That's why I do this somewhat hackish thing of removing the "end of game marker" and run the search for 10-30 seconds. Then final_score and final_status_list produce better results. It may even be useful to not include pass moves in this search.

What exactly is your use case? To me it seems to be similar to the imrscl-ownership command which depends on the currently existing search tree. Is this what you're doing?

iopq · 2016-05-31T13:25:15Z

Yes, I do the exact same thing. My solution is to first run time_settings 0 0 10 and then genmove b (or reg_genmove b) to get the exact amount of search time I want. I don't see why search time has to be built-in to the command itself since you can control the time more finely. Then I do imsrcl-ownership and count the game and ALSO show the owned points on screen.

My ideal solution is to implement donplayouts 10000 and just do 10k playouts (or however many you need)

ujh · 2016-05-31T13:40:36Z

OK, does imrscl-ownership trigger the search I've implemented for final_score? If so then that's of course a bug. Otherwise I don't quite see the connection between your command and my changes here as you don't seem to use final_score. If this of course does cause the search to kick in that would be bad.

BTW, I'm not against implementing donplayouts but I wonder what the connection to this PR here is.

iopq · 2016-05-31T13:42:27Z

it doesn't, but if someone wants to use final_score they should be able to set their own number of playouts

it makes no sense to merge this and then undo it after donplayouts is implemented

ujh · 2016-05-31T13:54:19Z

I see. For me the main reason of existence for final_score and final_status_list are for a smooth end game. That requires us (the developers) to tune the parameters to achieve the best results. Personally I'd rather implement another similar command (and donplayouts) that can be used as an analyze command. To me they have rather different uses. One it tuned to ideally produce a correct score after the game is over. And the other is used to give insights into the current state of the search tree.

ujh · 2016-05-31T14:05:23Z

So would it be enough to separately implement donplayouts? We could of course also add a flag that circumvents the extra search if no play or genmove happened between donplayouts and final_score or final_status_list.

iopq · 2016-05-31T14:48:45Z

If donplayouts is implemented, why bother with mixing search and counting? Seems like those are separate concerns. If you really want a separate command it could just call out to donplayouts and final_score functions instead of adding all of this extra code to do the same thing with separate boolean flags.

ujh · 2016-05-31T18:31:18Z

Alright, I'm convinced. I've added the imrscl-donplayouts GTP command and I now run 10,000 playouts if necessary. I will have to check tomorrow if there are still bugs in there (esp. with when to automatically run the playouts and when not) but I thought that maybe you wanted to have a look already.

I'll have to see if 10,000 is enough or if I have to increase the number of course.

ujh · 2016-06-01T06:47:23Z

So, it all seems to work as expected. If you run imrscl-donplayouts then final_score and final_status_list will return immediately. After a genmove or kgs-genmove_cleanup it will then reset and do 10,000 playouts. I will have to measure the improvement again and maybe change the value to something higher than 10,000 of course, but first I wanted to check if this is what you had in mind.

Some of the tests run on all cores. So don't run the test runner in parallel as well.

A bug in the board/GTP setup that builds an empty tree and no search is happening and therefore a pass is generated as the first move.

If we do it the other way round we always pass as the tree appears to contain no children.

iopq · 2016-06-01T07:09:48Z

Yeah, this is what I had in mind. We can tweak the implementations later if we need to adjust how it works.

ujh · 2016-06-01T07:19:00Z

Cool. And thanks for the discussion! I think we ended up with a better solution. 👍

ujh added bug wip 3 - Review labels May 13, 2016

ujh added this to the 0.3.2 milestone May 13, 2016

ujh force-pushed the fix-scoring branch from b513bbb to cebf517 Compare May 20, 2016 13:08

ujh added 2 commits June 1, 2016 08:48

Remove ununsed import

9e46446

Test cases for when the program crashes after loadsgf

f106bd4

ujh added 16 commits June 1, 2016 08:48

Calculate ownership for scoring

5c410b7

Add tests for crashes when scoring after loading an SGF file

3324e76

More tests that crash the bot

74be7a4

Reset game over status to allow proper scoring

a69acee

Don't run tests in parallel

0d91aad

Some of the tests run on all cores. So don't run the test runner in parallel as well.

Output percentage of scores GnuGo and Iomrscálaí are agreeing on

cc0126f

Fix tests

76bba7a

Remove debug output

39abd10

Play more games on 13x13 and 19x19

117c7bf

Test for a bug in this scoring rewrite

b5e0c0f

A bug in the board/GTP setup that builds an empty tree and no search is happening and therefore a pass is generated as the first move.

Remove SGF file for removed test

f7a56fc

Setup tree before checking if it's empty

87300fb

If we do it the other way round we always pass as the tree appears to contain no children.

Test that scoring works on games without legal moves

a930009

Don't run tests in parallel on AppVeyor

6407d50

Add imrscl-donplayouts command

0174beb

ujh force-pushed the fix-scoring branch from 4a13a84 to 0174beb Compare June 1, 2016 06:49

Lower time limit for 9x9 benchmarks

360688f

ujh removed the wip label Jun 1, 2016

ujh merged commit 1493764 into master Jun 2, 2016

ujh deleted the fix-scoring branch June 2, 2016 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scoring #292

Fix scoring #292

ujh commented May 13, 2016

iopq commented May 13, 2016

ujh commented May 13, 2016

coveralls commented May 17, 2016

coveralls commented May 17, 2016

ujh commented May 21, 2016

iopq commented May 21, 2016

ujh commented May 22, 2016 •

edited

Loading

ujh commented May 22, 2016 •

edited

Loading

iopq commented May 22, 2016

ujh commented May 22, 2016 •

edited

Loading

iopq commented May 22, 2016 •

edited

Loading

ujh commented May 22, 2016 •

edited

Loading

iopq commented May 22, 2016 •

edited

Loading

ujh commented May 23, 2016 •

edited

Loading

ujh commented May 31, 2016

iopq commented May 31, 2016 •

edited

Loading

ujh commented May 31, 2016

iopq commented May 31, 2016

ujh commented May 31, 2016

ujh commented May 31, 2016

iopq commented May 31, 2016

ujh commented May 31, 2016

ujh commented Jun 1, 2016

iopq commented Jun 1, 2016

ujh commented Jun 1, 2016

Fix scoring #292

Fix scoring #292

Conversation

ujh commented May 13, 2016

iopq commented May 13, 2016

ujh commented May 13, 2016

coveralls commented May 17, 2016

coveralls commented May 17, 2016

ujh commented May 21, 2016

iopq commented May 21, 2016

ujh commented May 22, 2016 • edited Loading

ujh commented May 22, 2016 • edited Loading

iopq commented May 22, 2016

ujh commented May 22, 2016 • edited Loading

iopq commented May 22, 2016 • edited Loading

ujh commented May 22, 2016 • edited Loading

iopq commented May 22, 2016 • edited Loading

ujh commented May 23, 2016 • edited Loading

ujh commented May 31, 2016

iopq commented May 31, 2016 • edited Loading

ujh commented May 31, 2016

iopq commented May 31, 2016

ujh commented May 31, 2016

ujh commented May 31, 2016

iopq commented May 31, 2016

ujh commented May 31, 2016

ujh commented Jun 1, 2016

iopq commented Jun 1, 2016

ujh commented Jun 1, 2016

ujh commented May 22, 2016 •

edited

Loading

ujh commented May 22, 2016 •

edited

Loading

ujh commented May 22, 2016 •

edited

Loading

iopq commented May 22, 2016 •

edited

Loading

ujh commented May 22, 2016 •

edited

Loading

iopq commented May 22, 2016 •

edited

Loading

ujh commented May 23, 2016 •

edited

Loading

iopq commented May 31, 2016 •

edited

Loading