Enable lower-case (or rather, case-sensitive) field names in Snowflake #1085

soltanianalytics · 2024-03-12T19:28:34Z

Feature description

Request

I request that there is a configuration option that fields are called exactly as they are returned by the generator, i.e. case-sensitive.

While at it, the configuration should also enable using schema and table names case-sensitive

Status Quo

Currently, fields in the destination Snowflake are upper-cased.

Example

Let's say the data generator returns a dict like this:

{
  "id": 1,
  "field": "a value",
  "I am also a field": "some other value"
}

dlt will create the schema of the table such that the column names will be ID, FIELD, I AM ALSO A FIELD. You will notice that these fields will already need to be encapsulated by quoted to work, and they are. Yet dlt upper-cases them anyway. This probably relates to how the metadata tables work under the hood.

Additional Info

Note that I used version 0.3.25, any code links will use the related code tag and code may now live elsewhere. Also note that I use the direct normalizer setting. Finally, note that I wrote down the following based on my notes without any further tests or checks, so it may not be fully correct and/or accurate, but it should give you a rough idea what I did and what happened.

I attempted to monkeypatch dlt to work around it, which failed, but only due to 2nd order effects. Basically, the following almost worked:

import dlt

### Monkeypatch Snowflake Identifier Behaviour ###
# 1st monkeypatch
def escape_snowflake_identifier(v: str) -> str:
    # Don't do anything other than quoting
    return f'"{v}"'
dlt.common.data_writers.escape.escape_snowflake_identifier = escape_snowflake_identifier

# 2nd monkeypatch
from dlt.destinations.snowflake import sql_client
def fqdn(self, escape: bool = True) -> str:
    if escape:
        return self.capabilities.escape_identifier(self.dataset_name)
    return self.dataset_name  # No `.upper()` call!
sql_client.SnowflakeSqlClient.fully_qualified_dataset_name = fqdn

The 1st monkeypatch overwrites this function in order to avoid upper-casing identifiers

The 2nd monkeypatch overwrites this class method to avoid forcing .upper()-casing the dataset name, i.e. enabling a dataset name that is exactly as I provided it.

Together this achieved that my schema, table, and field identifiers became case sensitive. The very first pipeline run of a table into a new, previously non-existant schema even worked as intended. Unfortunately, this also applied to the dlt metadata tables (_dlt_loads etc.). Any follow-on pipeline run within that schema (whether the same table or a completely different one) subsequently failed, presumably because the metadata queries are not properly quoting in their SQL. I tried a few additional monkeypatches, some of which seemed to be effective to solve a particular problem, but never managed to make it work altogether and finally gave up.

The below tries to monkeypatch this method because, while state_table and loads_table are properly quoted with the monkeypatches above, pipeline_name, created_at, status, load_id, and _dlt_load_id are not. Also, it was not obvious to me whether self.state_table_columns were properly quoted and I played around with it a bit (see out-commented lines)

def gst(self, pipeline_name: str) -> "StateInfo":
    state_table = self.sql_client.make_qualified_table_name(self.schema.state_table_name)
    loads_table = self.sql_client.make_qualified_table_name(self.schema.loads_table_name)
    # columns =  '"' + '", "'.join(self.state_table_columns.lower().split(", ")) + '"'
    # columns = self.state_table_columns.lower()
    columns = self.state_table_columns
    query = (
        f"SELECT {columns} FROM {state_table} AS s JOIN {loads_table} "
        'AS l ON l."load_id" = s."_dlt_load_id" WHERE "pipeline_name" = %s '
        'AND l."status" = 0 ORDER BY "created_at" DESC'
    )
    print("\n\n\n\n", query, "\n\n\n\n") # remove me
    with self.sql_client.execute_query(query, pipeline_name) as cur:
        row = cur.fetchone()
    if not row:
        return None
    return StateInfo(row[0], row[1], row[2], row[3], pendulum.instance(row[4]))
dlt.destinations.job_client_impl.SqlJobClientBase.get_stored_state = gst

I think the above worked out enough to get other errors relating to the metadata fields, which I tried to solve as follows, but that did not work and I finally gave up:

dlt.destinations.job_client_impl.SqlJobClientBase._VERSION_TABLE_SCHEMA_COLUMNS = (
        "VERSION_HASH",
        "SCHEMA_NAME",
        "VERSION",
        "ENGINE_VERSION",
        "INSERTED_AT",
        "SCHEMA",
)

dlt.destinations.job_client_impl.SqlJobClientBase._STATE_TABLE_COLUMNS = (
        "VERSION",
        "ENGINE_VERSION",
        "PIPELINE_NAME",
        "STATE",
        "CREATED_AT",
        "_DLT_LOAD_ID",
    )

from dlt.pipeline import state_sync
state_sync.STATE_TABLE_COLUMNS = {k.upper(): {k2: v2.upper() if k2 == 'name' else v2 for k2, v2 in v.items()} for k, v in state_sync.STATE_TABLE_COLUMNS.items()}

Are you a dlt user?

Yes, I run dlt in production.

Use case

Push data from source to destination without opinion on what the data should look like, either at source or destination. This feels extremely important (at least as optional configuration) if dlt wants to be adopted in the real world. I know it is crazy, but the real data does have data sources where a single row of data has two fields, one called Id and one called ID. This is not a joke, this is a real-life example. Currently, dlt fails because it wants to create two fields, both called ID.

Proposed solution

Enable case-sensitive field names for all destinations, but at the very least for Snowflake, such that direct normalized field names are not changed at all, not even upper-cased.

Related issues

No response

The text was updated successfully, but these errors were encountered:

rudolfix · 2024-03-15T09:59:48Z

@soltanianalytics thank you for investigating this and giving us a proposed fix but also use case for us to test. we are working on this problem quire intensively and there's a PR (#998 ) that finally applies naming convention to all identifiers (also internal + some regexed in the schema :)), allows to control case sensitivity at the destination.

in particular you'll be able to configure or create the destination with desired naming convention and casefolding ie.

from dlt.destinations import snowflake

snow = snowflake(naming="direct", casefold="str")
pipeline = dlt.pipeline(destination=snow)

to get case sensitive snowflake destination enforcing "direct" naming convention on all dlt components (exact interface will be probably slightly different)

I'll ping you when it is ready so you can give us feedback if the PR solves your problem.

soltanianalytics · 2024-03-18T21:12:49Z

That would be great indeed! Looking forward to this functionality.

rudolfix · 2024-06-27T16:31:15Z

will be available with 0.5.x release!

soltanianalytics · 2024-06-27T19:10:07Z

awesome!

github-project-automation bot added this to dlt core library Mar 12, 2024

github-project-automation bot moved this to Todo in dlt core library Mar 12, 2024

rudolfix added enhancement New feature or request community This issue came from slack community workspace labels Mar 15, 2024

rudolfix moved this from Todo to In Progress in dlt core library Mar 15, 2024

rudolfix self-assigned this Mar 18, 2024

rudolfix mentioned this issue Jun 24, 2024

allows naming conventions to be changed #998

Merged

rudolfix closed this as completed Jun 27, 2024

github-project-automation bot moved this from In Progress to Done in dlt core library Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable lower-case (or rather, case-sensitive) field names in Snowflake #1085

Enable lower-case (or rather, case-sensitive) field names in Snowflake #1085

soltanianalytics commented Mar 12, 2024

rudolfix commented Mar 15, 2024

soltanianalytics commented Mar 18, 2024

rudolfix commented Jun 27, 2024

soltanianalytics commented Jun 27, 2024

Enable lower-case (or rather, case-sensitive) field names in Snowflake #1085

Enable lower-case (or rather, case-sensitive) field names in Snowflake #1085

Comments

soltanianalytics commented Mar 12, 2024

Feature description

Request

Status Quo

Example

Additional Info

Are you a dlt user?

Use case

Proposed solution

Related issues

rudolfix commented Mar 15, 2024

soltanianalytics commented Mar 18, 2024

rudolfix commented Jun 27, 2024

soltanianalytics commented Jun 27, 2024