Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry backend execute on concurrent append #303

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

JCZuurmond
Copy link
Member

@JCZuurmond JCZuurmond commented Sep 30, 2024

Retry backend execute on concurrent append

Copy link
Member Author

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@@ -182,6 +189,7 @@ def __init__( # pylint: disable=too-many-arguments,too-many-positional-argument
ColumnInfoTypeName.TIMESTAMP: self._parse_timestamp,
}

@retried(is_retryable=_is_retryable_delta_concurrent_append, timeout=timedelta(seconds=10))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx : Is this what you thinking off?

I have to think about the implications of always retrying this in lsql, maybe we should only retry within UCX isntead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there needs to be a flag to control this, and should default to off: it's not necessarily safe to blindly retry arbitrary SQL.

In general any time we do a 'read-modify-write' cycle, everything needs to start again from the read part because the modify (and write) often depend on it. Sometimes the read and modify bits are within the same SQL statement as the write, in which case this is safe. But often this is part of application code before we get to SQL and that may need to be restarted. In this situation only the application knows what to do.

Irrespective of this, whatever we do here also needs to end up in the .save_table() implementations: these don't all pass through .execute() and the same thing can happen there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we need to support this in RuntimeBackend, separately
  2. i think we need to throw a predefined common exception, to @asnare point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move the retry to the RuntimeBackend. What "predefined common exception" would you suggest to throw? I looked at the sdk.errors and did not see one that really applied here

tests/integration/test_backends.py Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Sep 30, 2024

❌ 35/37 passed, 2 failed, 4 skipped, 7m59s total

❌ test_runtime_backend_handles_concurrent_append: databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges: (1.544s)
databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission CREATE on CATALOG. SQLSTATE: 42501
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ed57-1594-b307-39ce767ae1fa",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_su8gf WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ed57-1594-b307-39ce767ae1fa",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 table fixtures
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 schema fixtures
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python
❌ test_runtime_backend_handles_concurrent_append: databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges: (654ms)
databricks.sdk.errors.platform.BadRequest: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission CREATE on CATALOG. SQLSTATE: 42501
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ee62-1a84-817c-5d4634fe170b",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.sdk] Loaded from environment
08:48 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
08:48 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
08:48 INFO [databricks.sdk] Using Databricks Metadata Service authentication
08:48 DEBUG [databricks.labs.lsql.backends] [api][execute] CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.labs.lsql.core] Executing SQL statement: CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)
08:48 DEBUG [databricks.sdk] POST /api/2.0/sql/statements/
> {
>   "format": "JSON_ARRAY",
>   "statement": "CREATE SCHEMA hive_metastore.dummy_sdw8g WITH DBPROPERTIES (RemoveAfter=2024100110)",
>   "warehouse_id": "DATABRICKS_WAREHOUSE_ID"
> }
< 200 OK
< {
<   "statement_id": "01ef7fd1-ee62-1a84-817c-5d4634fe170b",
<   "status": {
<     "error": {
<       "error_code": "BAD_REQUEST",
<       "message": "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission CREATE on CATA... (20 more bytes)"
<     },
<     "state": "FAILED"
<   }
< }
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 table fixtures
08:48 DEBUG [databricks.labs.pytester.fixtures.baseline] clearing 0 schema fixtures
[gw5] linux -- Python 3.10.15 /home/runner/work/lsql/lsql/.venv/bin/python

Running from acceptance #433

tests/integration/test_backends.py Outdated Show resolved Hide resolved
tests/integration/test_backends.py Outdated Show resolved Hide resolved
@@ -182,6 +189,7 @@ def __init__( # pylint: disable=too-many-arguments,too-many-positional-argument
ColumnInfoTypeName.TIMESTAMP: self._parse_timestamp,
}

@retried(is_retryable=_is_retryable_delta_concurrent_append, timeout=timedelta(seconds=10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there needs to be a flag to control this, and should default to off: it's not necessarily safe to blindly retry arbitrary SQL.

In general any time we do a 'read-modify-write' cycle, everything needs to start again from the read part because the modify (and write) often depend on it. Sometimes the read and modify bits are within the same SQL statement as the write, in which case this is safe. But often this is part of application code before we get to SQL and that may need to be restarted. In this situation only the application knows what to do.

Irrespective of this, whatever we do here also needs to end up in the .save_table() implementations: these don't all pass through .execute() and the same thing can happen there.

src/databricks/labs/lsql/core.py Show resolved Hide resolved
@@ -182,6 +189,7 @@ def __init__( # pylint: disable=too-many-arguments,too-many-positional-argument
ColumnInfoTypeName.TIMESTAMP: self._parse_timestamp,
}

@retried(is_retryable=_is_retryable_delta_concurrent_append, timeout=timedelta(seconds=10))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we need to support this in RuntimeBackend, separately
  2. i think we need to throw a predefined common exception, to @asnare point

Copy link
Member Author

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

@@ -119,6 +120,10 @@ def __repr__(self):
return f"Row({', '.join(f'{k}={v!r}' for (k, v) in zip(self.__columns__, self, strict=True))})"


class DeltaConcurrentAppend(DatabricksError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced this error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentModification to be precise - we can concurrently delete, append or update

@@ -139,6 +172,27 @@ def test_runtime_backend_errors_handled(ws, query):
assert result == "PASSED"


def test_runtime_backend_handles_concurrent_append(ws, make_random, make_table) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx : I copied this from the tests above to integration test the runtime backend. However, it is not really the RuntimeBackend. Is this the correct approach for integration testing the runtime backend? I would have introduced a local Spark session using pytest-spark.

commands.run(CONCURRENT_APPEND.format(table_full_name=table.full_name))

try:
Threads.strict("concurrent appends", [update_table, update_table])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not fail due to the lock in the CommandExecutor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants