Retry backend execute on concurrent append #303

JCZuurmond · 2024-09-30T13:53:18Z

Retry backend execute on concurrent append

JCZuurmond

See comments

JCZuurmond · 2024-09-30T13:54:19Z

src/databricks/labs/lsql/core.py

@@ -182,6 +189,7 @@ def __init__(  # pylint: disable=too-many-arguments,too-many-positional-argument
            ColumnInfoTypeName.TIMESTAMP: self._parse_timestamp,
        }

+    @retried(is_retryable=_is_retryable_delta_concurrent_append, timeout=timedelta(seconds=10))


@nfx : Is this what you thinking off?

I have to think about the implications of always retrying this in lsql, maybe we should only retry within UCX isntead

I think there needs to be a flag to control this, and should default to off: it's not necessarily safe to blindly retry arbitrary SQL.

In general any time we do a 'read-modify-write' cycle, everything needs to start again from the read part because the modify (and write) often depend on it. Sometimes the read and modify bits are within the same SQL statement as the write, in which case this is safe. But often this is part of application code before we get to SQL and that may need to be restarted. In this situation only the application knows what to do.

Irrespective of this, whatever we do here also needs to end up in the .save_table() implementations: these don't all pass through .execute() and the same thing can happen there.

we need to support this in RuntimeBackend, separately

i think we need to throw a predefined common exception, to @asnare point

I will move the retry to the RuntimeBackend. What "predefined common exception" would you suggest to throw? I looked at the sdk.errors and did not see one that really applied here

tests/integration/test_backends.py

github-actions · 2024-09-30T13:59:25Z

❌ 35/37 passed, 2 failed, 4 skipped, 7m59s total

❌ test_runtime_backend_handles_concurrent_append: databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges: (1.544s)

databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission CREATE on CATALOG. SQLSTATE: 42501
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ed57-1594-b307-39ce767ae1fa",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ed57-1594-b307-39ce767ae1fa",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 table fixtures
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 schema fixtures
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python

❌ test_runtime_backend_handles_concurrent_append: databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges: (654ms)

databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission CREATE on CATALOG. SQLSTATE: 42501
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ee62-1a84-817c-5d4634fe170b",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ee62-1a84-817c-5d4634fe170b",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 table fixtures
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 schema fixtures
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python

_{Running from acceptance #433}

tests/integration/test_backends.py

asnare · 2024-09-30T14:14:17Z

src/databricks/labs/lsql/core.py

@@ -182,6 +189,7 @@ def __init__(  # pylint: disable=too-many-arguments,too-many-positional-argument
            ColumnInfoTypeName.TIMESTAMP: self._parse_timestamp,
        }

+    @retried(is_retryable=_is_retryable_delta_concurrent_append, timeout=timedelta(seconds=10))


I think there needs to be a flag to control this, and should default to off: it's not necessarily safe to blindly retry arbitrary SQL.

In general any time we do a 'read-modify-write' cycle, everything needs to start again from the read part because the modify (and write) often depend on it. Sometimes the read and modify bits are within the same SQL statement as the write, in which case this is safe. But often this is part of application code before we get to SQL and that may need to be restarted. In this situation only the application knows what to do.

Irrespective of this, whatever we do here also needs to end up in the .save_table() implementations: these don't all pass through .execute() and the same thing can happen there.

src/databricks/labs/lsql/core.py

nfx · 2024-09-30T17:14:48Z

src/databricks/labs/lsql/core.py

@@ -182,6 +189,7 @@ def __init__(  # pylint: disable=too-many-arguments,too-many-positional-argument
            ColumnInfoTypeName.TIMESTAMP: self._parse_timestamp,
        }

+    @retried(is_retryable=_is_retryable_delta_concurrent_append, timeout=timedelta(seconds=10))


we need to support this in RuntimeBackend, separately

i think we need to throw a predefined common exception, to @asnare point

JCZuurmond

See comments

JCZuurmond · 2024-10-01T08:47:53Z

src/databricks/labs/lsql/core.py

@@ -119,6 +120,10 @@ def __repr__(self):
        return f"Row({', '.join(f'{k}={v!r}' for (k, v) in zip(self.__columns__, self, strict=True))})"


+class DeltaConcurrentAppend(DatabricksError):


Introduced this error

ConcurrentModification to be precise - we can concurrently delete, append or update

JCZuurmond · 2024-10-01T08:49:46Z

tests/integration/test_backends.py

@@ -139,6 +172,27 @@ def test_runtime_backend_errors_handled(ws, query):
    assert result == "PASSED"


+def test_runtime_backend_handles_concurrent_append(ws, make_random, make_table) -> None:


@nfx : I copied this from the tests above to integration test the runtime backend. However, it is not really the RuntimeBackend. Is this the correct approach for integration testing the runtime backend? I would have introduced a local Spark session using pytest-spark.

JCZuurmond · 2024-10-01T08:50:00Z

tests/integration/test_backends.py

+        commands.run(CONCURRENT_APPEND.format(table_full_name=table.full_name))
+
+    try:
+        Threads.strict("concurrent appends", [update_table, update_table])


I think this does not fail due to the lock in the CommandExecutor

JCZuurmond added 7 commits September 30, 2024 14:57

Add test with concurrent delta append

22d9c63

Move roll over method out

1f4bb5a

Rename table

1745911

Assert the right way

92394ea

Use test table

1010236

Add comment explaining rollover

65dca14

Retry concurrent append

bffd2d5

JCZuurmond added the internal label Sep 30, 2024

JCZuurmond requested review from nfx and asnare September 30, 2024 13:53

JCZuurmond self-assigned this Sep 30, 2024

JCZuurmond had a problem deploying to runtime September 30, 2024 13:53 — with GitHub Actions Failure

JCZuurmond commented Sep 30, 2024

View reviewed changes

JCZuurmond added 2 commits September 30, 2024 15:55

Fix string concat

1f0b0a8

Fix return type hint of retryable

44513b3

JCZuurmond had a problem deploying to runtime September 30, 2024 13:59 — with GitHub Actions Failure

Rename variables

75050ad

asnare requested changes Sep 30, 2024

View reviewed changes

JCZuurmond added 3 commits September 30, 2024 16:41

Remove if delta missing raise as data loss

ac3e37c

Simplify create table

3a7d5b3

Use make_table fixture

c83f117

JCZuurmond had a problem deploying to runtime September 30, 2024 14:45 — with GitHub Actions Failure

nfx reviewed Sep 30, 2024

View reviewed changes

JCZuurmond added 6 commits October 1, 2024 08:38

Remove wait until roll over

4844218

Remove unused import

f5e4db0

Move integration test to the appropriate module

3f1c004

Put back raise error for missing delta transaction log

ae1ee5b

Introduce custom DeltaConcurrentAppend error

7c0d4f4

Unit test DeltaConcurrentAppend error on statement execution backend

c31b679

JCZuurmond added 5 commits October 1, 2024 09:15

Narrow test

664796b

Test DeltaConcurrentAppend error on RuntimeBackend

4f944ad

Narrow tests

79cb07a

Format

b2f2c8c

Add integration test for concurrent write through runtime backend

7f3d72e

JCZuurmond had a problem deploying to runtime October 1, 2024 08:47 — with GitHub Actions Failure

JCZuurmond commented Oct 1, 2024

View reviewed changes

JCZuurmond marked this pull request as draft October 1, 2024 09:13

JCZuurmond mentioned this pull request Oct 1, 2024

Add and populate UCX workflow_runs table databrickslabs/ucx#2754

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry backend execute on concurrent append #303

Retry backend execute on concurrent append #303

JCZuurmond commented Sep 30, 2024 •

edited

Loading

JCZuurmond left a comment

JCZuurmond Sep 30, 2024

asnare Sep 30, 2024

nfx Sep 30, 2024

JCZuurmond Oct 1, 2024

github-actions bot commented Sep 30, 2024 •

edited

Loading

asnare Sep 30, 2024

nfx Sep 30, 2024

JCZuurmond left a comment

JCZuurmond Oct 1, 2024

nfx Oct 2, 2024

JCZuurmond Oct 1, 2024

JCZuurmond Oct 1, 2024

		@@ -119,6 +120,10 @@ def __repr__(self):
		return f"Row({', '.join(f'{k}={v!r}' for (k, v) in zip(self.__columns__, self, strict=True))})"


		class DeltaConcurrentAppend(DatabricksError):

		@@ -139,6 +172,27 @@ def test_runtime_backend_errors_handled(ws, query):
		assert result == "PASSED"


		def test_runtime_backend_handles_concurrent_append(ws, make_random, make_table) -> None:

Retry backend execute on concurrent append #303

Are you sure you want to change the base?

Retry backend execute on concurrent append #303

Conversation

JCZuurmond commented Sep 30, 2024 • edited Loading

JCZuurmond left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JCZuurmond left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JCZuurmond commented Sep 30, 2024 •

edited

Loading

github-actions bot commented Sep 30, 2024 •

edited

Loading