Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add read_clickhouse API to read ClickHouse Dataset #49060

Merged
merged 11 commits into from
Dec 12, 2024

Conversation

jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Dec 4, 2024

Why are these changes needed?

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back.

Key Features and Benefits:

  1. Seamless Integration: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation.
  2. Custom Query Support: Users can specify custom columns, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance.
  3. User-Friendly API: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

Screenshot 2024-11-20 at 3 52 42 AM

PLEASE NOTE: This PR is a continuation of #48817, which was closed without merging.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Connor Sanders <connor@elastiflow.com>
@jecsand838
Copy link
Contributor Author

@alexeykudinkin This is the fresh PR for the ClickHouse datasource.

The current state of the code here addresses your latest feedback:

Let's make all of these kwargs explicit and typed (adding to the func signature)

Let's annotate as @PublicAPI(stability="alpha") to it to make it clear this isn't a stable API yet

Can we please also add a test generating the full query (so that we certain e2e flow works as expected)

Also here's the stacked follow up PR that adds support for filtering: jecsand838#2

@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Dec 5, 2024
offset += num_rows
return read_tasks

def _get_estimate(self, query_type: str) -> Optional[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suggest to restructure it like following

  1. Create 2 methods for _get_estimate_size, _get_estimate_count
  2. Make both of them use _execute_query common method (accepting target query)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in my latest commit.

import pyarrow as pa

client = self._init_client()
query = f"SELECT * FROM ({self._query}) LIMIT {num_rows} OFFSET {offset}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jecsand838 does Clickhouse guarantee to always traverse the rows in the same order?

If that's not the case we'd have to revert back to reading the whole dataset in 1 task unfortunately (we've been recently bitten by this for SQL datasource)

Copy link
Contributor Author

@jecsand838 jecsand838 Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin You have to explicitly define an ORDER BY clause to guarantee order. Let me know if that's acceptable (I can update the docs and make order_by a required parameter) or if we need to revert still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there maybe a way to get the behavior we want in ClickHouse. I'll look into it and get back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what we can do is following:

  1. If users want parallelism we will require order-by to be specified
  2. If they don't care/can't provider order-by then will be read as 1 task

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin That makes sense. I went ahead and attempted those changes. Let me know what you think.

Comment on lines 160 to 162
with mock.patch(
"ray.data.block.BlockAccessor.for_block", return_value=mock_block_accessor
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with this approach is that this will mock this static method for all invocations not just 1 you intend to.

Instead let's just introduce a method producing estimates you need (size, num of rows) and mock that here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in my latest commit.

…ble ClickHouse datasource parallelism, improved clickhouse_test mock

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
@jcotant1 jcotant1 added the data Ray Data-related issues label Dec 9, 2024
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution @jecsand838!

return self._execute_query(self._estimates["count"])

def _get_estimate_size(self) -> Optional[int]:
return self._execute_query(self._estimates["size"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's just use constants to avoid indirection

Suggested change
return self._execute_query(self._estimates["size"])
return self._execute_query(_SIZE_ESTIMATE_QUERY)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin I actually went down somewhat of a rabbit hole with this one. I went ahead and implemented a query template system using constants, which should be more aligned with what you're wanting.

I also went ahead and pivoted from using LIMIT / OFFSET clauses to OFFSET / FETCH clauses in the queries. The impact on ClickHouse's CPU overhead was substantial.

To showcase this, here are two comparisons I ran locally using Docker:

Red is LIMIT / OFFSET
Yellow is OFFSET / FETCH

  • Evaluation 1
evalStats1
  • Evaluation 2
evalStats2

Overall the Dataset execution was about ~2% faster using OFFSET / FETCH as well.

Here's a screenshot of one of the OFFSET / FETCH executions as evidence that the approach is working:

Screenshot 2024-12-10 at 11 31 05 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great analysis @jecsand838!

Offset is known to be an operation having non-trivial overhead on the DB and this tweak LGTM

)
parallelism = 1
num_rows_per_block = num_rows_total // parallelism
num_blocks_with_extra_row = num_rows_total % parallelism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment elaborating what you're doing here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more comments to the code. Both there and in other places.

python/ray/data/tests/test_clickhouse.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_clickhouse.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_clickhouse.py Outdated Show resolved Hide resolved
jecsand838 and others added 6 commits December 10, 2024 17:23
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
…H clauses in ClickHouse datasource, and improved comments

Signed-off-by: Connor Sanders <connor@elastiflow.com>
@alexeykudinkin alexeykudinkin self-assigned this Dec 11, 2024
@bveeramani bveeramani merged commit bcc067f into ray-project:master Dec 12, 2024
5 checks passed
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Dec 12, 2024
…t#49060)

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which
provides a convenient way to read data from ClickHouse into Ray
Datasets. The ClickHouseDatasource is particularly useful for users who
are working with large datasets stored in ClickHouse and want to
leverage Ray's distributed computing capabilities for AI and ML
use-cases. We found this functionality useful while evaluating ML
technologies and wanted to contribute this back.

Key Features and Benefits:
1. **Seamless Integration**: The ClickHouseDatasource allows for
seamless integration of ClickHouse data into Ray workflows, enabling
users to easily access their data and apply Ray's powerful parallel
computation.
2. **Custom Query Support**: Users can specify custom columns, and
orderings, allowing for flexible query generation directly from the Ray
interface, which helps in reading only the necessary data, thereby
improving performance.
3. **User-Friendly API**: The connector abstracts the complexity of
setting up and querying ClickHouse, providing a simple API that allows
users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

<img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM"
src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b">

PLEASE NOTE: This PR is a continuation of
ray-project#48817, which was closed without
merging.

---------

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Dec 17, 2024
…t#49060)

Greetings from ElastiFlow!

This PR introduces a new ClickHouseDatasource connector for Ray, which
provides a convenient way to read data from ClickHouse into Ray
Datasets. The ClickHouseDatasource is particularly useful for users who
are working with large datasets stored in ClickHouse and want to
leverage Ray's distributed computing capabilities for AI and ML
use-cases. We found this functionality useful while evaluating ML
technologies and wanted to contribute this back.

Key Features and Benefits:
1. **Seamless Integration**: The ClickHouseDatasource allows for
seamless integration of ClickHouse data into Ray workflows, enabling
users to easily access their data and apply Ray's powerful parallel
computation.
2. **Custom Query Support**: Users can specify custom columns, and
orderings, allowing for flexible query generation directly from the Ray
interface, which helps in reading only the necessary data, thereby
improving performance.
3. **User-Friendly API**: The connector abstracts the complexity of
setting up and querying ClickHouse, providing a simple API that allows
users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

<img width="1340" alt="Screenshot 2024-11-20 at 3 52 42 AM"
src="https://github.com/user-attachments/assets/2421e48a-7169-4a9e-bb4d-b6b96f7e502b">

PLEASE NOTE: This PR is a continuation of
ray-project#48817, which was closed without
merging.

---------

Signed-off-by: Connor Sanders <connor@elastiflow.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants