Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: deprecate usage of cursor.execute statements in favor of the in class implementation of execute. #60748

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gmcrocetti
Copy link

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@gmcrocetti gmcrocetti force-pushed the refactor-io-sql-execute branch 5 times, most recently from 04fee59 to e9cbf63 Compare January 22, 2025 02:15
@gmcrocetti
Copy link
Author

gmcrocetti commented Jan 22, 2025

Hello @WillAyd.
So this is the follow up of #60376.
I updated the code base to use panda's execute implementation as much as possible. I couldn't replace all places and a simple git grep '.exec' -- pandas/io/sql.py will show you that. Anyways what in my opinion is worth mentioning and asking for a review is in the following:

  1. SQLTable._execute_insert
  2. SQLTable._execute_insert_multi
  3. SQLiteTable._execute_insert
  4. SQLiteTable._execute_insert_multi

This should be no problem because we can always wrap that execution around a try-except block (as I did in SQLiteTable._execute_insert)

Would you mind taking a look and LMK what you think while in draft ?

@gmcrocetti gmcrocetti force-pushed the refactor-io-sql-execute branch from e9cbf63 to ff41294 Compare January 22, 2025 02:16
pandas/io/sql.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added Refactor Internal refactoring of code IO SQL to_sql, read_sql, read_sql_query labels Jan 22, 2025
@gmcrocetti gmcrocetti force-pushed the refactor-io-sql-execute branch from ff41294 to 9e0f436 Compare January 22, 2025 23:53
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment but generally this looks good. @mroeschke can you take a look as well?

pandas/io/sql.py Outdated
for stmt in self.table:
conn.execute(stmt)
self.pd_sql.execute(stmt).close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the need for .close() here? I am hoping we can avoid anything that implicitly changes the transaction state

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say there's a need for that, but it is usually a good practice to close any cursor that is opened. We can remove to maintain the status quo - no problem on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK I misread this as closing the transaction. Does whatever pd_sql.execute return not follow the context manager protocol? We should be preferring with statements whenever a context like that needs to be managed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no worries. I'm super thankful for all the help and attentive comments.

Yeah, it does implement the context manager protocol. This was a personal choice and has no technical reason:

    with self.pd_sql.run_transaction():
            for stmt in self.table:
                self.pd_sql.execute(stmt).close()

Reads better to me than

    with self.pd_sql.run_transaction():
            for stmt in self.table:
                with self.pd_sql.execute(stmt):
                    pass

What you think ? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always in favor of using the context manager over calling close manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at this some more I'm still somewhat concerned about the object lifetime management. For example, the SqliteDatabase implementation of execute looks like this:

    def execute(self, sql: str | Select | TextClause, params=None):
        if not isinstance(sql, str):
            raise TypeError("Query must be a string unless using sqlalchemy.")
        args = [] if params is None else [params]
        cur = self.con.cursor()
        try:
            cur.execute(sql, *args)
            return cur
        except Exception as exc:
            ...

So self.con.cursor() is tied to the lifetime of the SqliteDatabase instance - are you sure we should be closing this here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Let me try to illustrate that with an example from the codebase.
We can say that the ADBCDatabase implementation is the same as of SQLiteDatabase

pandas/pandas/io/sql.py

Lines 2110 to 2128 in c0c778b

def execute(self, sql: str | Select | TextClause, params=None):
if not isinstance(sql, str):
raise TypeError("Query must be a string unless using sqlalchemy.")
args = [] if params is None else [params]
cur = self.con.cursor()
try:
cur.execute(sql, *args)
return cur
except Exception as exc:
try:
self.con.rollback()
except Exception as inner_exc: # pragma: no cover
ex = DatabaseError(
f"Execution failed on sql: {sql}\n{exc}\nunable to rollback"
)
raise ex from inner_exc
ex = DatabaseError(f"Execution failed on sql '{sql}': {exc}")
raise ex from exc

Alright, so now take this line as example:

                sql_statement = f"DROP TABLE {table_name}"
                self.execute(sql_statement)

If we leave it like that ☝️ then an exception is raised because because the cursor was left open:
image

I'm no specialist here but believe that despite all differences between drivers the DBAPI is respected and if that is the case then the SQLite cursor should also be closed. The SQLite cursor will anyways get closed when __del__ is called so we might be safer if we maintain the status quo in case we're uncertain about side-effects ?

Copy link
Author

@gmcrocetti gmcrocetti Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke , @WillAyd , just a heads up regarding self.pd_sql.execute(stmt).close() vs with self.pd_sql.execute(stmt). The later is not feasible because the SQLite Cursor object does not implement the context manager protocol.
Options are calling .close() or using closing from contextlib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so now take this line as example:

                sql_statement = f"DROP TABLE {table_name}"
                self.execute(sql_statement)

If we leave it like that ☝️ then an exception is raised because because the cursor was left open:

Thanks for this example. So previously we were explicitly opening and closing a cursor but with the switch to using self.execute we are re-using the cursor attached to the class instance and not controlling its lifecycle.

So how is that lifecycle being managed? Seems like there is just a gap / inconsistency that is making this all more complicated than it should be

Copy link
Author

@gmcrocetti gmcrocetti Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @WillAyd ,

I don't think we have a lifecycle management problem but nonetheless the implementation has changed as complicated is not the goal here.
The implementation is back to its original version with a small change in the name of things. cur (instead of conn) is used to represent a cursor since it is what run_transaction returns for SQLiteDatabase's implementation.

@gmcrocetti gmcrocetti requested a review from WillAyd January 27, 2025 19:42
@gmcrocetti gmcrocetti requested a review from mroeschke January 29, 2025 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants