[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

jecsand838 · 2024-12-04T03:34:06Z

Why are these changes needed?

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back.

Key Features and Benefits:

Seamless Integration: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation.
Custom Query Support: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance.
User-Friendly API: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

PLEASE NOTE: This PR is a continuation of #48817, which was closed without merging.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Connor Sanders <connor@elastiflow.com>

jecsand838 · 2024-12-04T04:25:50Z

@alexeykudinkin This is the fresh PR for the ClickHouse datasource.

The current state of the code here addresses your latest feedback:

Let's make all of these kwargs explicit and typed (adding to the func signature)

Let's annotate as @PublicAPI(stability="alpha") to it to make it clear this isn't a stable API yet

Can we please also add a test generating the full query (so that we certain e2e flow works as expected)

Also here's the stacked follow up PR that adds support for filtering: jecsand838#2

alexeykudinkin · 2024-12-05T01:14:20Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+            offset += num_rows
+        return read_tasks
+
+    def _get_estimate(self, query_type: str) -> Optional[int]:


nit: I'd suggest to restructure it like following

Create 2 methods for _get_estimate_size, _get_estimate_count

Make both of them use _execute_query common method (accepting target query)

Made this change in my latest commit.

alexeykudinkin · 2024-12-05T01:16:13Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+            import pyarrow as pa
+
+            client = self._init_client()
+            query = f"SELECT * FROM ({self._query}) LIMIT {num_rows} OFFSET {offset}"


@jecsand838 does Clickhouse guarantee to always traverse the rows in the same order?

If that's not the case we'd have to revert back to reading the whole dataset in 1 task unfortunately (we've been recently bitten by this for SQL datasource)

@alexeykudinkin You have to explicitly define an ORDER BY clause to guarantee order. Let me know if that's acceptable (I can update the docs and make order_by a required parameter) or if we need to revert still.

Actually there maybe a way to get the behavior we want in ClickHouse. I'll look into it and get back.

So, what we can do is following:

If users want parallelism we will require order-by to be specified

If they don't care/can't provider order-by then will be read as 1 task

Does that make sense?

@alexeykudinkin That makes sense. I went ahead and attempted those changes. Let me know what you think.

alexeykudinkin · 2024-12-05T01:23:01Z

python/ray/data/tests/test_clickhouse.py

+        with mock.patch(
+            "ray.data.block.BlockAccessor.for_block", return_value=mock_block_accessor
+        ):


Problem with this approach is that this will mock this static method for all invocations not just 1 you intend to.

Instead let's just introduce a method producing estimates you need (size, num of rows) and mock that here

Made this change in my latest commit.

…ble ClickHouse datasource parallelism, improved clickhouse_test mock Signed-off-by: Connor Sanders <connor@elastiflow.com>

Signed-off-by: Connor Sanders <connor@elastiflow.com>

alexeykudinkin

Thank you very much for your contribution @jecsand838!

alexeykudinkin · 2024-12-10T19:51:45Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+        return self._execute_query(self._estimates["count"])
+
+    def _get_estimate_size(self) -> Optional[int]:
+        return self._execute_query(self._estimates["size"])


nit: Let's just use constants to avoid indirection

Suggested change

return self._execute_query(self._estimates["size"])

return self._execute_query(_SIZE_ESTIMATE_QUERY)

@alexeykudinkin I actually went down somewhat of a rabbit hole with this one. I went ahead and implemented a query template system using constants, which should be more aligned with what you're wanting.

I also went ahead and pivoted from using LIMIT / OFFSET clauses to OFFSET / FETCH clauses in the queries. The impact on ClickHouse's CPU overhead was substantial.

To showcase this, here are two comparisons I ran locally using Docker:

Red is LIMIT / OFFSET
Yellow is OFFSET / FETCH

Evaluation 1

Evaluation 2

Overall the Dataset execution was about ~2% faster using OFFSET / FETCH as well.

Here's a screenshot of one of the OFFSET / FETCH executions as evidence that the approach is working:

Wow, great analysis @jecsand838!

Offset is known to be an operation having non-trivial overhead on the DB and this tweak LGTM

python/ray/data/_internal/datasource/clickhouse_datasource.py

alexeykudinkin · 2024-12-10T19:58:41Z

python/ray/data/_internal/datasource/clickhouse_datasource.py

+            )
+            parallelism = 1
+        num_rows_per_block = num_rows_total // parallelism
+        num_blocks_with_extra_row = num_rows_total % parallelism


Please add a comment elaborating what you're doing here

I added more comments to the code. Both there and in other places.

python/ray/data/tests/test_clickhouse.py

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

…H clauses in ClickHouse datasource, and improved comments Signed-off-by: Connor Sanders <connor@elastiflow.com>

…t#49060) Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of ray-project#48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <connor@elastiflow.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

…t#49060) Greetings from ElastiFlow! This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back. Key Features and Benefits: 1. **Seamless Integration**: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation. 2. **Custom Query Support**: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance. 3. **User-Friendly API**: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction. Tested locally with a ClickHouse table containing ~12m records. <img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM" src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b"> PLEASE NOTE: This PR is a continuation of ray-project#48817, which was closed without merging. --------- Signed-off-by: Connor Sanders <connor@elastiflow.com> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

Added ClickHouse Datasource on fresh branch

b73033d

Signed-off-by: Connor Sanders <connor@elastiflow.com>

jecsand838 requested a review from a team as a code owner December 4, 2024 03:34

This was referenced Dec 4, 2024

[Data] Add ClickHouse datasource filtering jecsand838/ray#2

Merged

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

Closed

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Dec 5, 2024

alexeykudinkin reviewed Dec 5, 2024

View reviewed changes

jecsand838 added 2 commits December 8, 2024 21:51

restructured get_estimate method, made order_by field required to ena…

8f4ea38

…ble ClickHouse datasource parallelism, improved clickhouse_test mock Signed-off-by: Connor Sanders <connor@elastiflow.com>

Merge branch 'master' into clickhouse_datasource_clean

6ece387

Signed-off-by: Connor Sanders <connor@elastiflow.com>

jecsand838 requested a review from alexeykudinkin December 9, 2024 03:54

jcotant1 added the data Ray Data-related issues label Dec 9, 2024

Merge branch 'master' into clickhouse_datasource_clean

05eac45

alexeykudinkin approved these changes Dec 10, 2024

View reviewed changes

jecsand838 and others added 6 commits December 10, 2024 17:23

Update python/ray/data/_internal/datasource/clickhouse_datasource.py

501cea6

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Update python/ray/data/tests/test_clickhouse.py

3eb3edd

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Update python/ray/data/tests/test_clickhouse.py

247a971

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Update python/ray/data/tests/test_clickhouse.py

5a6b14d

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Added query templates, replaced LIMIT OFFSET clauses with OFFSET FETC…

f42d992

…H clauses in ClickHouse datasource, and improved comments Signed-off-by: Connor Sanders <connor@elastiflow.com>

Merge branch 'master' into clickhouse_datasource_clean

3370c5f

alexeykudinkin self-assigned this Dec 11, 2024

Merge branch 'master' into clickhouse_datasource_clean

b56f4dc

bveeramani merged commit bcc067f into ray-project:master Dec 12, 2024
5 checks passed

jecsand838 mentioned this pull request Jan 1, 2025

[Data] Add filters parameter to read_clickhouse #49526

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

jecsand838 commented Dec 4, 2024 •

edited

Loading

jecsand838 commented Dec 4, 2024

alexeykudinkin Dec 5, 2024

jecsand838 Dec 9, 2024

alexeykudinkin Dec 5, 2024

jecsand838 Dec 5, 2024 •

edited

Loading

jecsand838 Dec 5, 2024

alexeykudinkin Dec 5, 2024

jecsand838 Dec 9, 2024

alexeykudinkin Dec 5, 2024

jecsand838 Dec 9, 2024

alexeykudinkin left a comment

alexeykudinkin Dec 10, 2024

jecsand838 Dec 11, 2024

alexeykudinkin Dec 11, 2024

alexeykudinkin Dec 10, 2024

jecsand838 Dec 11, 2024

	return self._execute_query(self._estimates["size"])
	return self._execute_query(_SIZE_ESTIMATE_QUERY)

[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

Conversation

jecsand838 commented Dec 4, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

jecsand838 commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jecsand838 Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jecsand838 commented Dec 4, 2024 •

edited

Loading

jecsand838 Dec 5, 2024 •

edited

Loading