You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I request that there is a configuration option that fields are called exactly as they are returned by the generator, i.e. case-sensitive.
While at it, the configuration should also enable using schema and table names case-sensitive
Status Quo
Currently, fields in the destination Snowflake are upper-cased.
Example
Let's say the data generator returns a dict like this:
{
"id": 1,
"field": "a value",
"I am also a field": "some other value"
}
dlt will create the schema of the table such that the column names will be ID, FIELD, I AM ALSO A FIELD. You will notice that these fields will already need to be encapsulated by quoted to work, and they are. Yet dlt upper-cases them anyway. This probably relates to how the metadata tables work under the hood.
Additional Info
Note that I used version 0.3.25, any code links will use the related code tag and code may now live elsewhere. Also note that I use the direct normalizer setting. Finally, note that I wrote down the following based on my notes without any further tests or checks, so it may not be fully correct and/or accurate, but it should give you a rough idea what I did and what happened.
I attempted to monkeypatch dlt to work around it, which failed, but only due to 2nd order effects. Basically, the following almost worked:
importdlt### Monkeypatch Snowflake Identifier Behaviour #### 1st monkeypatchdefescape_snowflake_identifier(v: str) ->str:
# Don't do anything other than quotingreturnf'"{v}"'dlt.common.data_writers.escape.escape_snowflake_identifier=escape_snowflake_identifier# 2nd monkeypatchfromdlt.destinations.snowflakeimportsql_clientdeffqdn(self, escape: bool=True) ->str:
ifescape:
returnself.capabilities.escape_identifier(self.dataset_name)
returnself.dataset_name# No `.upper()` call!sql_client.SnowflakeSqlClient.fully_qualified_dataset_name=fqdn
The 1st monkeypatch overwrites this function in order to avoid upper-casing identifiers
The 2nd monkeypatch overwrites this class method to avoid forcing .upper()-casing the dataset name, i.e. enabling a dataset name that is exactly as I provided it.
Together this achieved that my schema, table, and field identifiers became case sensitive. The very first pipeline run of a table into a new, previously non-existant schema even worked as intended. Unfortunately, this also applied to the dlt metadata tables (_dlt_loads etc.). Any follow-on pipeline run within that schema (whether the same table or a completely different one) subsequently failed, presumably because the metadata queries are not properly quoting in their SQL. I tried a few additional monkeypatches, some of which seemed to be effective to solve a particular problem, but never managed to make it work altogether and finally gave up.
The below tries to monkeypatch this method because, while state_table and loads_table are properly quoted with the monkeypatches above, pipeline_name, created_at, status, load_id, and _dlt_load_id are not. Also, it was not obvious to me whether self.state_table_columns were properly quoted and I played around with it a bit (see out-commented lines)
defgst(self, pipeline_name: str) ->"StateInfo":
state_table=self.sql_client.make_qualified_table_name(self.schema.state_table_name)
loads_table=self.sql_client.make_qualified_table_name(self.schema.loads_table_name)
# columns = '"' + '", "'.join(self.state_table_columns.lower().split(", ")) + '"'# columns = self.state_table_columns.lower()columns=self.state_table_columnsquery= (
f"SELECT {columns} FROM {state_table} AS s JOIN {loads_table} "'AS l ON l."load_id" = s."_dlt_load_id" WHERE "pipeline_name" = %s ''AND l."status" = 0 ORDER BY "created_at" DESC'
)
print("\n\n\n\n", query, "\n\n\n\n") # remove mewithself.sql_client.execute_query(query, pipeline_name) ascur:
row=cur.fetchone()
ifnotrow:
returnNonereturnStateInfo(row[0], row[1], row[2], row[3], pendulum.instance(row[4]))
dlt.destinations.job_client_impl.SqlJobClientBase.get_stored_state=gst
I think the above worked out enough to get other errors relating to the metadata fields, which I tried to solve as follows, but that did not work and I finally gave up:
Push data from source to destination without opinion on what the data should look like, either at source or destination. This feels extremely important (at least as optional configuration) if dlt wants to be adopted in the real world. I know it is crazy, but the real data does have data sources where a single row of data has two fields, one called Id and one called ID. This is not a joke, this is a real-life example. Currently, dlt fails because it wants to create two fields, both called ID.
Proposed solution
Enable case-sensitive field names for all destinations, but at the very least for Snowflake, such that direct normalized field names are not changed at all, not even upper-cased.
Related issues
No response
The text was updated successfully, but these errors were encountered:
@soltanianalytics thank you for investigating this and giving us a proposed fix but also use case for us to test. we are working on this problem quire intensively and there's a PR (#998 ) that finally applies naming convention to all identifiers (also internal + some regexed in the schema :)), allows to control case sensitivity at the destination.
in particular you'll be able to configure or create the destination with desired naming convention and casefolding ie.
to get case sensitive snowflake destination enforcing "direct" naming convention on all dlt components (exact interface will be probably slightly different)
I'll ping you when it is ready so you can give us feedback if the PR solves your problem.
Feature description
Request
I request that there is a configuration option that fields are called exactly as they are returned by the generator, i.e. case-sensitive.
While at it, the configuration should also enable using schema and table names case-sensitive
Status Quo
Currently, fields in the destination Snowflake are upper-cased.
Example
Let's say the data generator returns a
dict
like this:dlt
will create the schema of the table such that the column names will beID
,FIELD
,I AM ALSO A FIELD
. You will notice that these fields will already need to be encapsulated by quoted to work, and they are. Yetdlt
upper-cases them anyway. This probably relates to how the metadata tables work under the hood.Additional Info
Note that I used version
0.3.25
, any code links will use the related code tag and code may now live elsewhere. Also note that I use thedirect
normalizer setting. Finally, note that I wrote down the following based on my notes without any further tests or checks, so it may not be fully correct and/or accurate, but it should give you a rough idea what I did and what happened.I attempted to monkeypatch
dlt
to work around it, which failed, but only due to 2nd order effects. Basically, the following almost worked:The 1st monkeypatch overwrites this function in order to avoid upper-casing identifiers
The 2nd monkeypatch overwrites this class method to avoid forcing
.upper()
-casing the dataset name, i.e. enabling a dataset name that is exactly as I provided it.Together this achieved that my schema, table, and field identifiers became case sensitive. The very first pipeline run of a table into a new, previously non-existant schema even worked as intended. Unfortunately, this also applied to the
dlt
metadata tables (_dlt_loads
etc.). Any follow-on pipeline run within that schema (whether the same table or a completely different one) subsequently failed, presumably because the metadata queries are not properly quoting in their SQL. I tried a few additional monkeypatches, some of which seemed to be effective to solve a particular problem, but never managed to make it work altogether and finally gave up.The below tries to monkeypatch this method because, while
state_table
andloads_table
are properly quoted with the monkeypatches above,pipeline_name
,created_at
,status
,load_id
, and_dlt_load_id
are not. Also, it was not obvious to me whetherself.state_table_columns
were properly quoted and I played around with it a bit (see out-commented lines)I think the above worked out enough to get other errors relating to the metadata fields, which I tried to solve as follows, but that did not work and I finally gave up:
Are you a dlt user?
Yes, I run dlt in production.
Use case
Push data from source to destination without opinion on what the data should look like, either at source or destination. This feels extremely important (at least as optional configuration) if
dlt
wants to be adopted in the real world. I know it is crazy, but the real data does have data sources where a single row of data has two fields, one calledId
and one calledID
. This is not a joke, this is a real-life example. Currently,dlt
fails because it wants to create two fields, both calledID
.Proposed solution
Enable case-sensitive field names for all destinations, but at the very least for Snowflake, such that
direct
normalized field names are not changed at all, not even upper-cased.Related issues
No response
The text was updated successfully, but these errors were encountered: