Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingestr: CLI tool to copy data between any databases #86

Open
michabbb opened this issue Feb 28, 2024 · 12 comments
Open

ingestr: CLI tool to copy data between any databases #86

michabbb opened this issue Feb 28, 2024 · 12 comments
Labels
poke Something that has been poked into, but either failed or stalled

Comments

@michabbb
Copy link

Problem Statement

switching from any database to crate is not always easy, tools such https://github.com/bruin-data/ingestr could help a lot
also: syncing data to crate is not that easy as well. would be nice to see a tool that is able to sync data from MySQL to crate because at the moment I need tons of bash scripts to make this work (mysqldump -> transform to json -> crash.....)

Possible Solutions

support this project: https://github.com/bruin-data/ingestr 😏

Considered Alternatives

No response

@matriv matriv transferred this issue from crate/crate Feb 28, 2024
@matriv
Copy link

matriv commented Feb 28, 2024

@michabbb Thx for opening this issue, it belongs better to this repo, as it's an integration tool.
Please also take a look at: https://github.com/crate/cratedb-flink-jobs where you can use the power of flink to migrate data between many different datasources and CrateDB.

@amotl
Copy link
Member

amotl commented Feb 28, 2024

Hi Michael,

thank you for writing in. ingestr indeed looks like an excellent tool, thanks for sharing. Can you elaborate what it would take to make it support CrateDB? We see ingestr might be based on SQLAlchemy, so if all plays well, it might work out of the box already, at least from a connectivity perspective?

With kind regards,
Andreas.

@michabbb
Copy link
Author

@amotl I don't know the tool, I just saw it today the first time and my very first thought was: where is crate? 😏 So I just wanted to let you know, that's all

@amotl
Copy link
Member

amotl commented Feb 28, 2024

All right, thanks. We will look into it!

@amotl amotl added the enhancement New feature or request label Feb 28, 2024
@amotl amotl changed the title support this project to make transition to crate easier: bruin-data/ingestr ingestr Feb 28, 2024
@michabbb
Copy link
Author

@matriv, thanks for the note, but this would force someone to learn Flink and Kafka first, and build the whole infrastructure for it; that´s a monster for someone who sees that the first time, sorry 🙈 so having a tool that is able to sync with only one command line, would be a blessing. that's why I wanted to mention ingestr here 😏

@matriv
Copy link

matriv commented Feb 29, 2024

You don't need Kafka if you migrate from let's say MySQL, just two jdbc connections, one for MySQL as a source and one to CrateDB as a sink. Of course, a tool like your suggestion would ease things, I just wanted to point to a current alternative. Thx again for bringing this up.

@amotl
Copy link
Member

amotl commented Feb 29, 2024

One variant of our dream tool for conducting effortless data I/O would have a convenient and fluent Python API and command line interface like how ingestr is doing it, with the power of Apache Flink under the hood.

In this spirit, we are occasionally exploring that topic over at those projects, both being descendants or spin-offs of Kotori in one way or another.

As we see @bruin-data is doing an excellent job over there (thanks, @karakanb!), we will surely adjust our efforts, and will certainly start building upon and eventually contributing to ingestr on corresponding "universal I/O" matters, instead of rolling our own adapter/dialect broker.

@karakanb
Copy link

Thanks for tagging me folks, and thanks a lot for your kind words on ingestr! ❤️

I didn't know about crate before, but based on what I understand from the repository description if it is PostgreSQL-compatible, that means a regular Postgres ingestion through ingestr can work out-of-the-box with crate as well, I'd encourage you to try it out.

if that doesn't work: ingestr utilizes dlt destinations for ease of use, and we can always develop custom destinations. The better way forward would be -if you'd be willing- to contribute directly to dlt, which means both ingestr, as well as other users of dlt, can benefit from that directly. If, for any reason, contributing to dlt is not an option, I'd happily accept a PR on the ingestr side.

@amotl
Copy link
Member

amotl commented Feb 29, 2024

Dear Burak,

thanks for your quick reply, and also for telling us about dlt. We had it on the radar, but now that we learn from you that this is the home of corresponding pipeline elements you are using in ingestr, we will add a focus to improve compatibility with CrateDB, either by contributing, or by adding corresponding documentation in one way or another.

With kind regards,
Andreas.

@amotl amotl changed the title ingestr ingestr: CLI tool to copy data between any databases May 7, 2024
@amotl amotl added pitch A pitch for adding a new integration item, after evaluating it, and writing a tutorial and removed enhancement New feature or request labels May 28, 2024
@amotl
Copy link
Member

amotl commented Nov 10, 2024

Hi again,

in order to approach the integration with dlt/ingestr, we just tried to run the canonical dlt example 1 with PostgreSQL details from 2 on CrateDB.

pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination=dlt.destinations.postgres("postgresql://crate:@localhost/"),
    dataset_name='doc',
)

Unfortunately, it trips right away.

psycopg2.errors.InternalError_: line 26:19: extraneous input 'UNIQUE' expecting {',', ')'}

That probably means we should contribute an adapter for CrateDB to dlt?

With kind regards,
Andreas.

Footnotes

  1. https://github.com/dlt-hub/dlt#quick-start

  2. https://dlthub.com/docs/dlt-ecosystem/destinations/postgres

@amotl
Copy link
Member

amotl commented Nov 10, 2024

When trying to use the SQLAlchemy destination like:

destination=dlt.destinations.sqlalchemy("crate://crate:@localhost/")

the example bails out with:

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage sync with exception:

<class 'dlt.destinations.exceptions.DatabaseTransientException'>
(crate.client.exceptions.ProgrammingError) RelationUnknown[Relation 'doc._dlt_pipeline_state' unknown]

It looks like the process does not aim to create the _dlt_pipeline_state table.

@amotl
Copy link
Member

amotl commented Nov 10, 2024

When running the example on PostgreSQL, we have been able to discover a few DDL statements 1. Looking at the particular one for creating _dlt_pipeline_state:

CREATE TABLE IF NOT EXISTS "doc"."_dlt_pipeline_state" (
  "version" bigint  NOT NULL,
  "engine_version" bigint  NOT NULL,
  "pipeline_name" varchar  NOT NULL,
  "state" varchar  NOT NULL,
  "created_at" timestamp with time zone  NOT NULL,
  "version_hash" varchar  ,
  "_dlt_load_id" varchar  NOT NULL,
  "_dlt_id" varchar UNIQUE NOT NULL);

and running it on CrateDB after removing the UNIQUE constraint, we can observe a classic problem that CrateDB does not support column names starting with underscores.

InvalidColumnNameException["_dlt_load_id" conflicts with system column pattern]

Footnotes

  1. We don't know why the DDL statements are not automatically issued when using the SQLAlchemy destination with the CrateDB dialect. Most probably, we are doing something wrong.

@amotl amotl added poke Something that has been poked into, but either failed or stalled and removed pitch A pitch for adding a new integration item, after evaluating it, and writing a tutorial labels Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
poke Something that has been poked into, but either failed or stalled
Projects
None yet
Development

No branches or pull requests

4 participants