Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in global kmeans #13809

Merged
merged 5 commits into from
Feb 1, 2025
Merged

Conversation

MBkkt
Copy link
Collaborator

@MBkkt MBkkt commented Jan 24, 2025

Changelog category

  • Not for changelog (changelog entry is not required)

Additional information

  • Fix bug in combination of sample and reshuffle steps of global kmeans (tests in kqp + new asserts)
  • Add pragma option to control recall/speed of request (same as pg_vector, alloydb approaches)
  • Better vector index temporary build tables sharding (avoid empty shards, avoid too small count of shards)
  • More detailed vector index build progress

@MBkkt MBkkt self-assigned this Jan 24, 2025

This comment was marked as outdated.

This comment was marked as outdated.

@MBkkt MBkkt force-pushed the mbkkt/global-kmeans-bugfix branch from 67b923b to 8a31f17 Compare January 27, 2025 08:22

This comment was marked as outdated.

This comment was marked as outdated.

Copy link

github-actions bot commented Jan 29, 2025

2025-01-29 11:15:54 UTC Pre-commit check linux-x86_64-release-asan for 82b1fcf has started.
2025-01-29 11:16:22 UTC Artifacts will be uploaded here
2025-01-29 11:20:13 UTC ya make is running...
🟡 2025-01-29 12:23:17 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
11718 11654 0 19 12 33

2025-01-29 12:24:46 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-29 12:40:51 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
97 (only retried tests) 61 0 2 6 28

2025-01-29 12:41:09 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-01-29 12:53:30 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
60 (only retried tests) 29 0 2 1 28

🟢 2025-01-29 12:53:39 UTC Build successful.
🟢 2025-01-29 12:54:13 UTC ydbd size 3.6 GiB changed* by +15.5 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 3d42b93 merge: 82b1fcf diff diff %
ydbd size 3 866 543 448 Bytes 3 866 559 296 Bytes +15.5 KiB +0.000%
ydbd stripped size 1 352 096 912 Bytes 1 352 101 936 Bytes +4.9 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Jan 29, 2025

2025-01-29 11:16:50 UTC Pre-commit check linux-x86_64-relwithdebinfo for 82b1fcf has started.
2025-01-29 11:17:42 UTC Artifacts will be uploaded here
2025-01-29 11:21:01 UTC ya make is running...
🟡 2025-01-29 12:22:26 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
26047 23496 0 3 2419 129

2025-01-29 12:25:39 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-01-29 12:40:44 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
199 (only retried tests) 82 0 0 0 117

🟢 2025-01-29 12:40:53 UTC Build successful.
🟢 2025-01-29 12:41:17 UTC ydbd size 2.1 GiB changed* by +51.8 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 3d42b93 merge: 82b1fcf diff diff %
ydbd size 2 225 162 488 Bytes 2 225 215 512 Bytes +51.8 KiB +0.002%
ydbd stripped size 470 416 752 Bytes 470 433 200 Bytes +16.1 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@MBkkt MBkkt marked this pull request as ready for review January 31, 2025 09:22
Copy link

github-actions bot commented Jan 31, 2025

2025-01-31 09:34:25 UTC Pre-commit check linux-x86_64-release-asan for 0649de0 has started.
2025-01-31 09:34:37 UTC Artifacts will be uploaded here
2025-01-31 09:37:36 UTC ya make is running...
🟡 2025-01-31 10:45:50 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
11737 11675 0 21 6 35

2025-01-31 10:46:58 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-31 10:59:53 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
98 (only retried tests) 63 0 3 3 29

2025-01-31 11:00:03 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-01-31 11:14:00 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
62 (only retried tests) 28 0 1 2 31

🟢 2025-01-31 11:14:08 UTC Build successful.
🟢 2025-01-31 11:14:40 UTC ydbd size 3.6 GiB changed* by +22.5 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: dcb89a5 merge: 0649de0 diff diff %
ydbd size 3 870 408 176 Bytes 3 870 431 256 Bytes +22.5 KiB +0.001%
ydbd stripped size 1 353 716 976 Bytes 1 353 729 808 Bytes +12.5 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Jan 31, 2025

2025-01-31 09:38:58 UTC Pre-commit check linux-x86_64-relwithdebinfo for 0649de0 has started.
2025-01-31 09:39:09 UTC Artifacts will be uploaded here
2025-01-31 09:42:11 UTC ya make is running...
🟡 2025-01-31 10:44:06 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
26072 23522 0 3 2419 128

2025-01-31 10:46:32 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-01-31 10:59:45 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
199 (only retried tests) 77 0 0 0 122

🟢 2025-01-31 10:59:52 UTC Build successful.
🟢 2025-01-31 11:00:12 UTC ydbd size 2.1 GiB changed* by +51.3 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 59315fe merge: 0649de0 diff diff %
ydbd size 2 227 464 424 Bytes 2 227 516 944 Bytes +51.3 KiB +0.002%
ydbd stripped size 471 399 312 Bytes 471 415 568 Bytes +15.9 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Comment on lines 571 to 572
const auto cluster = std::string_view{row}.substr(sizeof(ui16) + sizeof(ui32));
Y_DEBUG_ABORT_UNLESS(cluster == TSerializedCellVec{row}.GetCells().at(0).AsBuf());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Это что за аццкие преобразования?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну это корректный способ (но более медленный так как нужно парсить строку): нам приходит строка в случае kmeans sample стадии в виде TSerializedCellVec с одной колонкой внутри.

Она приходит в таком виде потому sample написан обобщенно, чтобы можно было если нужно использовать не для векторного индекса.

А остальные стадии на вход уже ожидают данные в правильном формате (просто строка с embedding-ом)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Можно это вынести куда-то в функции типа Parse/Serialize, чтобы они были рядом и подчеркивалась их симметричность/связь?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну Serialize общий, грубо говоря он зависит от SampleRequest , а deserialize тут конкретный для вектор индекса который знает какой message он отправляет

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Кажется просто не понимаю как должно выглядеть то что ты предлагаешь

@MBkkt MBkkt requested a review from a team as a code owner January 31, 2025 13:12
Copy link

github-actions bot commented Jan 31, 2025

2025-01-31 13:13:38 UTC Pre-commit check linux-x86_64-release-asan for 219547d has started.
2025-01-31 13:13:50 UTC Artifacts will be uploaded here
2025-01-31 13:17:07 UTC ya make is running...
🟡 2025-01-31 14:47:28 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13075 13002 0 25 15 33

2025-01-31 14:49:25 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-31 15:03:55 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
113 (only retried tests) 74 0 1 9 29

2025-01-31 15:04:09 UTC ya make is running... (failed tests rerun, try 3)
🟢 2025-01-31 15:16:04 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
62 (only retried tests) 30 0 0 5 27

🟢 2025-01-31 15:16:13 UTC Build successful.
🟢 2025-01-31 15:16:44 UTC ydbd size 3.6 GiB changed* by +14.8 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 5f5960b merge: 219547d diff diff %
ydbd size 3 870 957 544 Bytes 3 870 972 728 Bytes +14.8 KiB +0.000%
ydbd stripped size 1 353 825 264 Bytes 1 353 829 968 Bytes +4.6 KiB +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Jan 31, 2025

2025-01-31 13:13:39 UTC Pre-commit check linux-x86_64-relwithdebinfo for 219547d has started.
2025-01-31 13:13:52 UTC Artifacts will be uploaded here
2025-01-31 13:17:12 UTC ya make is running...
🟡 2025-01-31 15:47:13 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
27315 24763 0 10 2416 126

2025-01-31 15:52:22 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-31 16:04:42 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
206 (only retried tests) 75 0 8 0 123

2025-01-31 16:04:57 UTC ya make is running... (failed tests rerun, try 3)
🔴 2025-01-31 16:14:29 UTC Some tests failed, follow the links below.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
195 (only retried tests) 68 0 8 0 119

🟢 2025-01-31 16:14:38 UTC Build successful.
🟢 2025-01-31 16:15:03 UTC ydbd size 2.1 GiB changed* by +49.6 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 5f5960b merge: 219547d diff diff %
ydbd size 2 227 744 920 Bytes 2 227 795 744 Bytes +49.6 KiB +0.002%
ydbd stripped size 471 436 112 Bytes 471 451 920 Bytes +15.4 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

CyberROFL
CyberROFL previously approved these changes Jan 31, 2025
Copy link

github-actions bot commented Jan 31, 2025

2025-01-31 20:59:06 UTC Pre-commit check linux-x86_64-release-asan for 1735246 has started.
2025-01-31 20:59:18 UTC Artifacts will be uploaded here
2025-01-31 21:02:35 UTC ya make is running...
🟡 2025-01-31 22:33:37 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
13075 13010 0 23 8 34

2025-01-31 22:34:43 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-31 22:46:52 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
102 (only retried tests) 72 0 1 3 26

2025-01-31 22:47:00 UTC ya make is running... (failed tests rerun, try 3)
🟢 2025-01-31 22:58:52 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
52 (only retried tests) 24 0 0 4 24

🟢 2025-01-31 22:59:01 UTC Build successful.
🟢 2025-01-31 22:59:28 UTC ydbd size 3.6 GiB changed* by +22.5 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 4b004f1 merge: 1735246 diff diff %
ydbd size 3 871 027 584 Bytes 3 871 050 576 Bytes +22.5 KiB +0.001%
ydbd stripped size 1 353 839 984 Bytes 1 353 851 856 Bytes +11.6 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Jan 31, 2025

2025-01-31 20:59:10 UTC Pre-commit check linux-x86_64-relwithdebinfo for 1735246 has started.
2025-01-31 20:59:21 UTC Artifacts will be uploaded here
2025-01-31 21:02:29 UTC ya make is running...
🟡 2025-01-31 22:23:02 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
27316 24733 0 4 2446 133

2025-01-31 22:25:21 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-01-31 22:37:37 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
230 (only retried tests) 105 0 0 0 125

🟢 2025-01-31 22:37:44 UTC Build successful.
🟢 2025-01-31 22:38:03 UTC ydbd size 2.1 GiB changed* by +49.7 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 4b004f1 merge: 1735246 diff diff %
ydbd size 2 227 778 992 Bytes 2 227 829 896 Bytes +49.7 KiB +0.002%
ydbd stripped size 471 441 072 Bytes 471 456 944 Bytes +15.5 KiB +0.003%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@MBkkt MBkkt merged commit 5d71ae4 into ydb-platform:main Feb 1, 2025
12 checks passed
azevaykin pushed a commit to azevaykin/ydb that referenced this pull request Feb 3, 2025
lberserq pushed a commit to lberserq/ydb that referenced this pull request Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants