Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable lower-case (or rather, case-sensitive) field names in Snowflake #1085

Closed
soltanianalytics opened this issue Mar 12, 2024 · 4 comments
Closed
Assignees
Labels
community This issue came from slack community workspace enhancement New feature or request

Comments

@soltanianalytics
Copy link

Feature description

Request

I request that there is a configuration option that fields are called exactly as they are returned by the generator, i.e. case-sensitive.

While at it, the configuration should also enable using schema and table names case-sensitive

Status Quo

Currently, fields in the destination Snowflake are upper-cased.

Example

Let's say the data generator returns a dict like this:

{
  "id": 1,
  "field": "a value",
  "I am also a field": "some other value"
}

dlt will create the schema of the table such that the column names will be ID, FIELD, I AM ALSO A FIELD. You will notice that these fields will already need to be encapsulated by quoted to work, and they are. Yet dlt upper-cases them anyway. This probably relates to how the metadata tables work under the hood.

Additional Info

Note that I used version 0.3.25, any code links will use the related code tag and code may now live elsewhere. Also note that I use the direct normalizer setting. Finally, note that I wrote down the following based on my notes without any further tests or checks, so it may not be fully correct and/or accurate, but it should give you a rough idea what I did and what happened.

I attempted to monkeypatch dlt to work around it, which failed, but only due to 2nd order effects. Basically, the following almost worked:

import dlt

### Monkeypatch Snowflake Identifier Behaviour ###
# 1st monkeypatch
def escape_snowflake_identifier(v: str) -> str:
    # Don't do anything other than quoting
    return f'"{v}"'
dlt.common.data_writers.escape.escape_snowflake_identifier = escape_snowflake_identifier

# 2nd monkeypatch
from dlt.destinations.snowflake import sql_client
def fqdn(self, escape: bool = True) -> str:
    if escape:
        return self.capabilities.escape_identifier(self.dataset_name)
    return self.dataset_name  # No `.upper()` call!
sql_client.SnowflakeSqlClient.fully_qualified_dataset_name = fqdn

The 1st monkeypatch overwrites this function in order to avoid upper-casing identifiers

The 2nd monkeypatch overwrites this class method to avoid forcing .upper()-casing the dataset name, i.e. enabling a dataset name that is exactly as I provided it.

Together this achieved that my schema, table, and field identifiers became case sensitive. The very first pipeline run of a table into a new, previously non-existant schema even worked as intended. Unfortunately, this also applied to the dlt metadata tables (_dlt_loads etc.). Any follow-on pipeline run within that schema (whether the same table or a completely different one) subsequently failed, presumably because the metadata queries are not properly quoting in their SQL. I tried a few additional monkeypatches, some of which seemed to be effective to solve a particular problem, but never managed to make it work altogether and finally gave up.

The below tries to monkeypatch this method because, while state_table and loads_table are properly quoted with the monkeypatches above, pipeline_name, created_at, status, load_id, and _dlt_load_id are not. Also, it was not obvious to me whether self.state_table_columns were properly quoted and I played around with it a bit (see out-commented lines)

def gst(self, pipeline_name: str) -> "StateInfo":
    state_table = self.sql_client.make_qualified_table_name(self.schema.state_table_name)
    loads_table = self.sql_client.make_qualified_table_name(self.schema.loads_table_name)
    # columns =  '"' + '", "'.join(self.state_table_columns.lower().split(", ")) + '"'
    # columns = self.state_table_columns.lower()
    columns = self.state_table_columns
    query = (
        f"SELECT {columns} FROM {state_table} AS s JOIN {loads_table} "
        'AS l ON l."load_id" = s."_dlt_load_id" WHERE "pipeline_name" = %s '
        'AND l."status" = 0 ORDER BY "created_at" DESC'
    )
    print("\n\n\n\n", query, "\n\n\n\n") # remove me
    with self.sql_client.execute_query(query, pipeline_name) as cur:
        row = cur.fetchone()
    if not row:
        return None
    return StateInfo(row[0], row[1], row[2], row[3], pendulum.instance(row[4]))
dlt.destinations.job_client_impl.SqlJobClientBase.get_stored_state = gst

I think the above worked out enough to get other errors relating to the metadata fields, which I tried to solve as follows, but that did not work and I finally gave up:

dlt.destinations.job_client_impl.SqlJobClientBase._VERSION_TABLE_SCHEMA_COLUMNS = (
        "VERSION_HASH",
        "SCHEMA_NAME",
        "VERSION",
        "ENGINE_VERSION",
        "INSERTED_AT",
        "SCHEMA",
)

dlt.destinations.job_client_impl.SqlJobClientBase._STATE_TABLE_COLUMNS = (
        "VERSION",
        "ENGINE_VERSION",
        "PIPELINE_NAME",
        "STATE",
        "CREATED_AT",
        "_DLT_LOAD_ID",
    )

from dlt.pipeline import state_sync
state_sync.STATE_TABLE_COLUMNS = {k.upper(): {k2: v2.upper() if k2 == 'name' else v2 for k2, v2 in v.items()} for k, v in state_sync.STATE_TABLE_COLUMNS.items()}

Are you a dlt user?

Yes, I run dlt in production.

Use case

Push data from source to destination without opinion on what the data should look like, either at source or destination. This feels extremely important (at least as optional configuration) if dlt wants to be adopted in the real world. I know it is crazy, but the real data does have data sources where a single row of data has two fields, one called Id and one called ID. This is not a joke, this is a real-life example. Currently, dlt fails because it wants to create two fields, both called ID.

Proposed solution

Enable case-sensitive field names for all destinations, but at the very least for Snowflake, such that direct normalized field names are not changed at all, not even upper-cased.

Related issues

No response

@rudolfix
Copy link
Collaborator

@soltanianalytics thank you for investigating this and giving us a proposed fix but also use case for us to test. we are working on this problem quire intensively and there's a PR (#998 ) that finally applies naming convention to all identifiers (also internal + some regexed in the schema :)), allows to control case sensitivity at the destination.

in particular you'll be able to configure or create the destination with desired naming convention and casefolding ie.

from dlt.destinations import snowflake

snow = snowflake(naming="direct", casefold="str")
pipeline = dlt.pipeline(destination=snow)

to get case sensitive snowflake destination enforcing "direct" naming convention on all dlt components (exact interface will be probably slightly different)

I'll ping you when it is ready so you can give us feedback if the PR solves your problem.

@rudolfix rudolfix added enhancement New feature or request community This issue came from slack community workspace labels Mar 15, 2024
@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Mar 15, 2024
@rudolfix rudolfix self-assigned this Mar 18, 2024
@soltanianalytics
Copy link
Author

That would be great indeed! Looking forward to this functionality.

@rudolfix
Copy link
Collaborator

will be available with 0.5.x release!

@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Jun 27, 2024
@soltanianalytics
Copy link
Author

awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants