Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Server implementation #5

Merged
merged 11 commits into from
Apr 26, 2024
Merged

SQL Server implementation #5

merged 11 commits into from
Apr 26, 2024

Conversation

RobThree
Copy link
Contributor

@RobThree RobThree commented Apr 22, 2024

This PR adds a Microsoft SQL Server implementation.

I have 'ported' the Postgres SQL statements to T-SQL as closely as I could. One thing I ran into is that the rowversion mechanism differs quite much between the two. As far as I could see the Outbox doesn't seem to use the rowversion anywhere(?), and I tried to keep the ProcessManager as close to the Postgres version as possible. One of the issues I had was that a rowversion (or timestamp) in Sql Server is an 8 byte (64 bit) value, however, the VersionedState requires an int for the version and since there's no (safe) way to convert a 64 bit value to a 32 bit value I opted to 'DIY' the rowversion; the MERGE statement should be atomic.

It's also not very clear to me what use the date_added_utc and date_updated_utc fields have, but maybe I'm missing some implementation details that Postgres has that I'm not aware of. I did keep (and populate) them in the Sql Server variant where applicable.

Also, there weren't (and aren't) any integrationtests as far as I can see for the Outbox. Again, I tried to keep this as close as possible to the Postgres variant but I haven't tested it (yet).

In the integrationtests project I renamed the KafkaFlowFixture to a more specific PostgreKafkaFlowFixture and then copied it to a SqlServerKafkaFlowFixture and changed it to work with SqlServer. Same goes for the LifecycleProcessManagerTests which I renamed to PostgresLifecycleProcessManagerTests and copied to SqlServerLifecycleProcessManagerTests and changed to work with SqlServer. This may need some work to de-duplicate (DRY) some of the integrationtests.

So, things that need attention:

  • Rowversion
  • Outbox isn't tested
  • Integrationtests code duplication

Since Sql Server (or rather: the SqlClient) handles connection pooling internally we don't need to set up a pool like in the Postgres variant. I did, however, have to create a small shared, common library called Contrib.KafkaFlow.SqlServer which is used by both KafkaFlow.Outbox.SqlServer and KafkaFlow.ProcessManagers.SqlServer. This library only contains a few extension methods to configure Sql Server usage and a SqlServerOptions record so we don't have to pass a "bare" connectionstring to the SqlServerOutboxBackend and SqlServerProcessManagersStore classes.

Other than that, I think this should at least be a good start for a SqlServer implementation.

⚠️ I couldn't get the initdb.sh script to trigger under the mssql container. I have a few solutions/workarounds but I am still deciding which one to use. Your opinion / help would be very much appreciated. For now you'll need to run the /docker-entrypoint-initdb.d/initdb.sh script manually before running the integrationtests. Fixed in 7385102

Oh, one more thing: It isn't very clear to me how you'd use this library in an existing project. I assume the developer is expected to create the outbox and processes tables "manually"? I would expect the table(s) creation to be handled either on demand by invoking some method or automatically the first time the database is approached or something?

Finally: I have some other ideas / plans:

  • I'd like to get rid of a lot of compiler warnings (4 currently) and messages (27 currently) by 'modernizing' the code a little (like using primary constructors, collection initializers etc)
    EDIT: Most of this is done (see this branch). I'll create a new PR for this once this one has been merged.
  • I'd like to try lowering the requirement for .Net to, say .Net 7, 6, or even better: .Net Standard 2.0/2.1 or so, without too much effort or breaking stuff for broader platform support.
    EDIT: Done; I managed to lower the requirement to .Net Standard 2.0 (see this branch or this commit). This means that even .Net Framework developers can use this library. I'll create a new PR for this once this one and the 'modernize' one (above) has been merged.
  • I referred to code duplication earlier in the integrationtests, but the actual implementations of the Postgres and SQL Server variants are also very close implementation-wise; I'd like to see if that could be relieved or solved by introducing some interfaces and making the code a little more generic.
    EDIT: Done. See this branch
  • Use a schema as per @kchenery remarks: Done, see this branch
  • This project should have a LICENSE. Not sure what you prefer, but KafkaFlow is MIT and I usually also use MIT; but that's ofcourse 100% up to you.

@RobThree
Copy link
Contributor Author

Sorry, just a tiny little nudge to let you know I have two more PR's ready (see the checkboxes above) and have updated the PR description above. Take your time. Thanks!

@AlexeyRaga
Copy link
Owner

Thanks @RobThree, awesome work! Really appreciate!
I'll see if I can get someone with much better SQL Server knowledge than my own to look at the PR and maybe to make some suggestions around your questions regarding VersionedState.

I know that in PG xmin is 32 bits and I see the complication here...
One thing that we could do is to make VersionedState to be 64bit, and unsafe-cast it to 32 bit for PG implementation... Under the premise that in practice PG will never exceed 32 bits...

But perhaps your "custom" solution would be sufficient enough, too! :)

Thanks again, and I tope to be back to you soon!

@AlexeyRaga
Copy link
Owner

As for "how to provision schemas", we typically do it from CI/CD pipelines. dbup is one way of doing it, but, really, depends.
What I did not want to do is to auto-provision these schemas from code.

Because giving DDL permissions to the service feels just wrong, and the migration story gets too complicated too quickly...

Of course, any suggestion to how to make it better is appreciated!

@RobThree
Copy link
Contributor Author

RobThree commented Apr 24, 2024

What I did not want to do is to auto-provision these schemas from code.

Understand completely.

Of course, any suggestion to how to make it better is appreciated!

I'm fine with leaving it as-is, no problem. However, a compromise could be to have developers opt-in and call some InitDb() method for example which would create (or migrate) the required tables. But either way, I'm fine with whatever you prefer. Another suggestion would be to at least document the required tables and provide (links to) scripts in the documentation.

One thing that we could do is to make VersionedState to be 64bit, and unsafe-cast it to 32 bit for PG implementation... Under the premise that in practice PG will never exceed 32 bits...

I haven't thought this through (yet), nor looked into the feasibility, but maybe we can make it an abstract base class and derive a "PgVersionedState" and "SqlServerVersionedState" from them...
Nope, that won't work. And I'm like you not very comfortable unsafe casting the version. I'll think about it some more.

Copy link
Contributor

@kchenery kchenery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobThree I think it would be good to tidy up the schema files. We shouldnt create the DB here - that should be up to the test fixture (I recommend Test Containers to simplify it too).

@AlexeyRaga I have a question about ordering for the outbox which might require a bit of a think/change for the rowversion column.

@@ -0,0 +1,5 @@
IF NOT EXISTS(SELECT * FROM sys.databases WHERE name = 'kafkaflowtest')
BEGIN
CREATE DATABASE [kafkaflowtest]
Copy link
Contributor

@kchenery kchenery Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be creating the schema here, not the database. This might work for tests... but wont work for production environments. We cant assume the name of the DB there.

For tests - I'd be looking at using MSSQL Test Containers. @AlexeyRaga would you consider that + DbUp possibly embedded in the code startup (or similar)?

Back to here - I think this should be:

IF NOT EXISTS(SELECT * FROM sys.schemas WHERE name = 'outbox')
BEGIN
	CREATE SCHEMA [outbox]
END

There'll be similar changes below too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to take the test containers route.
I have never done it myself though, so can't speak from experience :)

/* Table */
IF OBJECT_ID(N'[kafkaflowtest].[dbo].[outbox]', N'U') IS NULL
BEGIN
CREATE TABLE [kafkaflowtest].[dbo].[outbox](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't specify the DB here. And if we change to outbox as the schema (to keep in line with the PG implementation) that would change to [outbox].[outbox]

e.g:

IF OBJECT_ID(N'[outbox].[outbox]', N'U') IS NULL
BEGIN
    CREATE TABLE [outbox].[outbox](

/* Table */
IF OBJECT_ID(N'[kafkaflowtest].[dbo].[processes]', N'U') IS NULL
BEGIN
CREATE TABLE [kafkaflowtest].[dbo].[processes](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above with the outbox table regarding the database name and schema name.

@@ -0,0 +1,5 @@
IF NOT EXISTS(SELECT * FROM sys.databases WHERE name = 'kafkaflowtest')
BEGIN
CREATE DATABASE [kafkaflowtest]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be setting the schema and not the DB name. As per my comment for the outbox table

@RobThree
Copy link
Contributor Author

RobThree commented Apr 24, 2024

@kchenery As per your database/schema remarks: I agree, will fix that. I am currently working on separating the actual database implementation from the rest of the logic and I'm close to done with that.

As for the rowversion: in refactoring I found another issue with the rowversion: It is returned by nPgSql as uint but can't be set as uint? I may be mistaken though, looking into it as we speak.

I'll update here when I'm done with the code-deduplication / separating RDBMS specific stuff.

@RobThree
Copy link
Contributor Author

RobThree commented Apr 24, 2024

Would anyone have an objection to changing the sequence_id from SERIAL to BIGSERIAL? As an alternative I can use int for SQL Server instead of bigint. But I think using a bigint / BIGSERIAL is better?

@RobThree
Copy link
Contributor Author

RobThree commented Apr 24, 2024

I am currently working on separating the actual database implementation from the rest of the logic and I'm close to done with that.

Done.

@kchenery As per your database/schema remarks: I agree, will fix that.

Schema is fixed here, see 45abefa

So, summarizing: I updated the PR request and have 4 branches ready to PR after this one (I could make it a huge PR, but I don't think anyone would enjoy that 😜 ). In order, they are: modernize (no functional changes), lower-framework-support changes required .Net version to .Net Standard 2.0, code-deduplication refactors RDBMS specific stuff to "repositories" (better name suggestions welcome) and mssql-use-schema to use schema's in the integration tests for MSSQL.

What's left is:

  • Can we switch to BIGSERIAL (which will make things easier and level the playingfield). Alternatively I can switch to using integers for the Sql Server implementation
  • @kchenery's question about rowversion. I'm happy to work on that once we make a decision
  • See PR description: look into adding a LICENSE?

@AlexeyRaga
Copy link
Owner

But I think using a bigint / BIGSERIAL is better?

I think it'd be a good change. AFAIK SERIAL is not reset when the data is deleted from the table, so reaching 2000K and then getting an overflow error seems possible with time.

I think we should do it (and write a migration script).

What do you think @kchenery ?

@kchenery
Copy link
Contributor

kchenery commented Apr 25, 2024

DB change is trivial... its basically just this migration script:

ALTER TABLE outbox.outbox ALTER COLUMN sequence_id TYPE BIGINT

I'm unsure how long that'll take on a large table... but peoples outboxes should be mostly empty.

I'll submit a PR for you for that. I haven't looked what code you'd have to change and you'll want to do that first.
Update: See #7

@RobThree
Copy link
Contributor Author

RobThree commented Apr 25, 2024

I have already changed the existing scripts and code. I propose we (work on) merge the branches up to now and then see where we stand? (Because I'm kind of losing oversight otherwise 😓 ). It should be alright - as long as you agree with the other changes I made - and then we can add a migration and make final changes (if any) before a new release? Or am I getting ahead of myself?

@AlexeyRaga
Copy link
Owner

I'm unsure how long that'll take on a large table... but peoples outboxes should be mostly empty.

It should mostly be empty... But to my knowledge sequences don't get reset even if the rows are deleted from the database... Am I wrong?

Also, as @RobThree mentioned, shouldn't it be BIGSERIAL and not BIGINT?

@kchenery
Copy link
Contributor

kchenery commented Apr 25, 2024

Deletes do not reset serial values. Truncates do not either (unlike SQL Server which does reset the identity values).

And no, it should be BIGINT. I know cause I had to test it 😄 When you do the change, GUIs will still show it as BIGSERIAL though. Guessing a bit - you're changing the underlying data type and not so much its behaviour.

@AlexeyRaga
Copy link
Owner

OK, we've merged the BIGINT/BIGSERIAL thing in.

@RobThree I guess that we can merge this one too once Kent's suggestions about database/schema are addressed.

@RobThree
Copy link
Contributor Author

Should be good. Though this still doesn't use a schema (which is in another branch later); have you had a chance to look at the other changes I made?

@AlexeyRaga
Copy link
Owner

Not really, I only at what is in this PR...

Can we have the database/schema changes as a part of the PR? Looks like you already have it here so shouldn't be a great trouble?

@RobThree
Copy link
Contributor Author

No, shouldn't be a problem, I just need to rebase some stuff. Let me see what I can do.

@RobThree
Copy link
Contributor Author

Done. Should be good to go.

@AlexeyRaga AlexeyRaga merged commit f87a685 into AlexeyRaga:main Apr 26, 2024
@AlexeyRaga
Copy link
Owner

Thanks @RobThree, AWESOME contribution! Really appreciated!

@AlexeyRaga
Copy link
Owner

Unfortunately, integration tests are failing...

@AlexeyRaga
Copy link
Owner

Fixed! The new set of packages is released!

@RobThree RobThree deleted the mssql branch April 26, 2024 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants