Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix YQL knn udf docs #8425

Merged
merged 10 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
* [DateTime](../../datetime.md)
* [Digest](../../digest.md)
* [Histogram](../../histogram.md)
* [Hyperscan](../../hyperscan.md)
* [Ip](../../ip.md)
* [Knn](../../knn.md)
* [Math](../../math.md)
* [Pcre](../../pcre.md)
* [Pire](../../pire.md)
* [Re2](../../re2.md)
* [String](../../string.md)
* [Unicode](../../unicode.md)
* [DateTime](../../datetime.md)
* [Url](../../url.md)
* [Ip](../../ip.md)
* [Knn](../../knn.md)
* [Yson](../../yson.md)
* [Digest](../../digest.md)
* [Math](../../math.md)
* [Histogram](../../histogram.md)
* [Yson](../../yson.md)
79 changes: 77 additions & 2 deletions ydb/docs/en/core/yql/reference/yql-core/udf/list/knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Approximate methods do not perform a complete search of the source data. Due to
This document provides an [example of approximate search](#approximate-search-examples) using scalar quantization. This example does not require the creation of a secondary vector index.

**Scalar quantization** is a method to compress vectors by mapping coordinates to a smaller space.
{{ ydb-short-name }} support exact search for `Float`, `Int8`, `Uint8`, `Bit` vectors.
This module supports exact search for `Float`, `Int8`, `Uint8`, `Bit` vectors.
So, it's possible to apply scalar quantization from `Float` to one of these other types.

Scalar quantization decreases read/write times by reducing vector size in bytes. For example, after quantization from `Float` to `Bit,` each vector becomes 32 times smaller.
Expand All @@ -45,7 +45,7 @@ It is recommended to measure if such quantization provides sufficient accuracy/r
## Data types

In mathematics, a vector of real or integer numbers is used to store points.
In {{ ydb-short-name }}, vectors are stored in the `String` data type, which is a binary serialized representation of a vector.
In this module, vectors are stored in the `String` data type, which is a binary serialized representation of a vector.

## Functions

Expand All @@ -57,7 +57,9 @@ Conversion functions are needed to serialize vectors into an internal binary rep

All serialization functions wrap returned `String` data into [Tagged](../../types/special.md) types.

{% if backend_name == "YDB" %}
The binary representation of the vector can be stored in the {{ ydb-short-name }} table column. Currently {{ ydb-short-name }} does not support storing `Tagged`, so before storing binary representation vectors you must call [Untag](../../builtins/basic#as-tagged).
{% endif %}

#### Function signatures

Expand Down Expand Up @@ -123,6 +125,7 @@ Error: Failed to find UDF function: Knn.CosineDistance, reason: Error: Module: K

## Еxact search examples

{% if backend_name == "YDB" %}
### Creating a table

```sql
Expand All @@ -142,9 +145,25 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"));
```
{% else %}
### Data declaration

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Exact search of K nearest vectors

{% if backend_name == "YDB" %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
Expand All @@ -154,23 +173,45 @@ WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% endif %}

### Exact search of vectors in radius R

{% if backend_name == "YDB" %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM Facts
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% else %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% endif %}

## Approximate search examples

This example differs from the [exact search example](#еxact-search-examples) by using bit quantization.

This allows to first do a approximate preliminary search by the `embedding_bit` column, and then refine the results by the original vector column `embegging`.

{% if backend_name == "YDB" %}
### Creating a table

```sql
Expand All @@ -191,6 +232,22 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding, embedding_bit)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"), Untag(Knn::ToBinaryStringBit($vector), "BitVector"));
```
{% else %}
### Data declaration

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
Knn::ToBinaryStringBit($vector) AS embedding_bit, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Scalar quantization

Expand Down Expand Up @@ -219,6 +276,7 @@ Approximate search algorithm:
* an approximate list of vectors is obtained;
* we search this list without using quantization.

{% if backend_name == "YDB" %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
Expand All @@ -234,3 +292,20 @@ WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
$TargetEmbeddingBit = Knn::ToBinaryStringBit($Target);
$TargetEmbeddingFloat = Knn::ToBinaryStringFloat($Target);

$Ids = SELECT id FROM AS_TABLE($facts)
ORDER BY Knn::CosineDistance(embedding_bit, $TargetEmbeddingBit)
LIMIT $K * 10;

SELECT * FROM AS_TABLE($facts)
WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% endif %}
12 changes: 6 additions & 6 deletions ydb/docs/en/core/yql/reference/yql-core/udf/list/toc_base.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
items:
- name: Overview
href: index.md
- { name: DateTime, href: datetime.md }
- { name: Digest, href: digest.md }
- { name: Histogram, href: histogram.md }
- { name: Hyperscan, href: hyperscan.md }
- { name: Ip, href: ip.md }
- { name: Knn, href: knn.md }
- { name: Math, href: math.md }
- { name: Pcre, href: pcre.md }
- { name: Pire, href: pire.md }
- { name: Re2, href: re2.md }
- { name: String, href: string.md }
- { name: Unicode, href: unicode.md }
- { name: DateTime, href: datetime.md }
- { name: Url, href: url.md }
- { name: Ip, href: ip.md }
- { name: Knn, href: knn.md }
- { name: Yson, href: yson.md }
- { name: Digest, href: digest.md }
- { name: Math, href: math.md }
- { name: Histogram, href: histogram.md }
79 changes: 77 additions & 2 deletions ydb/docs/ru/core/yql/reference/yql-core/udf/list/knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ LIMIT 10;
В данном документе приведен [пример приближенного поиска](#примеры-приближенного-поиска) с помощью скалярного квантования, не требущий построения вторичного векторного индекса.

**Скалярное квантование** это метод сжатия векторов, когда множество координат отображаются в множество меньшей размерности.
{{ ydb-short-name }} поддерживает точный поиск по `Float`, `Int8`, `Uint8`, `Bit` векторам.
Этот модуль поддерживает точный поиск по `Float`, `Int8`, `Uint8`, `Bit` векторам.
Соответственно, возможно скалярное квантование из `Float` в один из этих типов.

Скалярное квантование уменьшает время необходимое для чтения/записи, поскольку число байт сокращается в разы.
Expand All @@ -46,7 +46,7 @@ LIMIT 10;
## Типы данных

В математике для хранения точек используется вектор вещественных или целых чисел.
В {{ ydb-short-name }} вектора хранятся в строковом типе данных `String`, который является бинарным сериализованным представлением вектора.
В этом модуле вектора представлены типом данных `String`, который является бинарным сериализованным представлением вектора.

## Функции

Expand All @@ -58,8 +58,10 @@ LIMIT 10;

Все функции сериализации упаковывают возвращаемые данные типа `String` в [Tagged](../../types/special.md) тип.

{% if backend_name == "YDB" %}
Бинарное представление вектора можно сохранить в {{ ydb-short-name }} колонку.
В настоящий момент {{ ydb-short-name }} не поддерживает хранение `Tagged` типов и поэтому перед сохранением бинарного представления векторов нужно извлечь `String` с помощью функции [Untag](../../builtins/basic#as-tagged).
{% endif %}

#### Сигнатуры функций

Expand Down Expand Up @@ -125,6 +127,7 @@ Error: Failed to find UDF function: Knn.CosineDistance, reason: Error: Module: K

## Примеры точного поиска

{% if backend_name == "YDB" %}
### Создание таблицы

```sql
Expand All @@ -144,9 +147,25 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"));
```
{% else %}
### Декларация данных

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Точный поиск K ближайших векторов

{% if backend_name == "YDB" %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
Expand All @@ -156,22 +175,44 @@ WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% endif %}

### Точный поиск векторов, находящихся в радиусе R

{% if backend_name == "YDB" %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM Facts
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% else %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% endif %}

## Примеры приближенного поиска

Данный пример отличается от [примера с точным поиском](#примеры-точного-поиска) использованием битового квантования.
Это позволяет сначала делать грубый предварительный поиск по колонке `embedding_bit`, а затем уточнять результаты по основной колонке с векторами `embedding`.

{% if backend_name == "YDB" %}
### Создание таблицы

```sql
Expand All @@ -192,6 +233,22 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding, embedding_bit)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"), Untag(Knn::ToBinaryStringBit($vector), "BitVector"));
```
{% else %}
### Декларация данных

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
Knn::ToBinaryStringBit($vector) AS embedding_bit, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Скалярное квантование

Expand Down Expand Up @@ -220,6 +277,7 @@ SELECT ListMap($FloatList, $MapInt8);
* получается приближенный список векторов;
* в этом списке производим поиск без использования квантования.

{% if backend_name == "YDB" %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
Expand All @@ -235,3 +293,20 @@ WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
$TargetEmbeddingBit = Knn::ToBinaryStringBit($Target);
$TargetEmbeddingFloat = Knn::ToBinaryStringFloat($Target);

$Ids = SELECT id FROM AS_TABLE($facts)
ORDER BY Knn::CosineDistance(embedding_bit, $TargetEmbeddingBit)
LIMIT $K * 10;

SELECT * FROM AS_TABLE($facts)
WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% endif %}
Loading