Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize: allow to work without destination PK #10160

Open
derekperkins opened this issue Apr 27, 2022 · 6 comments
Open

Materialize: allow to work without destination PK #10160

derekperkins opened this issue Apr 27, 2022 · 6 comments
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@derekperkins
Copy link
Member

I'm materializing and summing from our raw usage table to a daily aggregate. I can't use a sequence to get a globally unique id, but a bounded unique id is sufficient for my use case, so I'm using auto-increment. I'm grouping on the source table columns to get a unique value, which works, but errors out if the PK on the target table is the autoinc value.

Materialize Config

{
  "workflow": "billing__usage",
  "source_keyspace": "workspaces",
  "target_keyspace": "workspaces",
  "table_settings": [
    {
      "target_table": "billing__usage",
      "source_expression": "select workspace_id as account_id, date(from_unixtime(requested)) as requested_date, date(created_at) as created_date, 1 as type, count(*) as usage from workspaces_rankings__pulls group by account_id, requested_date, created_date, type"
    }
  ],
  "tablet_types": "REPLICA"
}

Error Message

primary key column \u0026{usage_id bigint bigint true false} not found in select list

Destination Table Schema

CREATE TABLE `billing__usage` (
  `usage_id` bigint NOT NULL AUTO_INCREMENT,
  `account_id` bigint NOT NULL,
  `created_date` date NOT NULL,
  `requested_date` date NOT NULL,
  `type` smallint NOT NULL,
  `usage` bigint NOT NULL,
  `reported_to_chargebee_at` bigint DEFAULT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`account_id`,`requested_date`,`created_date`,`type`), # errors if I switch these two keys
  UNIQUE KEY `usage_id` (`usage_id`),
  CONSTRAINT `fk__workspaces__billing_usage__workspaces__billing` FOREIGN KEY (`account_id`) REFERENCES `billing__accounts` (`account_id`)
) ENGINE=InnoDB AUTO_INCREMENT=335837959 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED

Not urgent at all, and maybe there's a technical reason why the PK has to exist at the source table.

Tested on v13

@derekperkins derekperkins added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication Needs Triage This issue needs to be correctly labelled and triaged labels Apr 27, 2022
@harshit-gangal
Copy link
Member

@vitessio/vreplication

@harshit-gangal harshit-gangal removed the Needs Triage This issue needs to be correctly labelled and triaged label Apr 28, 2022
@mattlord
Copy link
Contributor

mattlord commented May 3, 2022

I'm working on a related issue here: #10192

I wonder if I should fold this in...

This specific failure occurs here:

// analyzePK builds tpb.pkCols.
// Input cols must include all columns which participate in the PRIMARY KEY or the chosen UniqueKey.
// It's OK to also include columns not in the key.
// Input cols should be ordered according to key ordinal.
// e.g. if "UNIQUE KEY(c5,c2)" then we expect c5 to come before c2
func (tpb *tablePlanBuilder) analyzePK(cols []*ColumnInfo) error {
for _, col := range cols {
if !col.IsPK {
continue
}
if col.IsGenerated {
// It's possible that a GENERATED column is part of the PRIMARY KEY. That's valid.
// But then, we also know that we don't actually SELECT a GENERATED column, we just skip
// it silently and let it re-materialize by MySQL itself on the target.
continue
}
cexpr := tpb.findCol(sqlparser.NewColIdent(col.Name))
if cexpr == nil {
// TODO(shlomi): at some point in the futue we want to make this check stricter.
// We could be reading a generated column c1 which in turn selects some other column c2.
// We will want t oensure that `c2` is found in select list...
return fmt.Errorf("primary key column %v not found in select list", col)
}
if cexpr.operation != opExpr {
return fmt.Errorf("primary key column %v is not allowed to reference an aggregate expression", col)
}
cexpr.isPK = true
cexpr.dataType = col.DataType
cexpr.columnType = col.ColumnType
tpb.pkCols = append(tpb.pkCols, cexpr)
}
return nil
}

@mattlord
Copy link
Contributor

mattlord commented May 3, 2022

As I look into it more, now I'm wondering if this is not also fixed by #10192 as it looks like we use MoveTables for materializations.

@derekperkins do you happen to have a simple test case, e.g. using the example commerce and customer keyspaces? I could then do some manual testing in my PR branch and I can go from there.

@mattlord mattlord self-assigned this May 3, 2022
@mattlord mattlord changed the title vreplication: allow to work without destination PK Materialize: allow to work without destination PK May 4, 2022
@mattlord
Copy link
Contributor

mattlord commented May 4, 2022

After looking into this more it's orthogonal to #10192 so I won't be looping this into that work.

@mattlord mattlord removed their assignment May 4, 2022
@derekperkins
Copy link
Member Author

I think this config should trigger the error

Tables

create table customer (
  customer_id bigint not null auto_increment,
  email varbinary(128),
  primary key(customer_id)
);

create table customer_with_new_ids (
  customer_id bigint not null auto_increment,
  email varbinary(128),
  primary key(customer_id)
);

Materialize Config

{
  "workflow": "customer_with_new_ids",
  "source_keyspace": "commerce",
  "target_keyspace": "commerce",
  "table_settings": [
    {
      "target_table": "customer_with_new_ids",
      "source_expression": "select email from customer"
    }
  ],
  "tablet_types": "REPLICA"
}

@mattlord
Copy link
Contributor

mattlord commented Aug 31, 2023

Noting that this is still an issue on main/18.0:

git checkout main
make build
pushd examples/local

./101_initial_cluster.sh

mysql commerce -e '
drop table customer;

create table customer (
  customer_id bigint not null auto_increment,
  email varbinary(128),
  primary key(customer_id)
);

create table customer_with_new_ids (
  customer_id bigint not null auto_increment,
  email varbinary(128),
  primary key(customer_id)
);
'

vtctlclient Materialize '{
  "workflow": "customer_with_new_ids",
  "source_keyspace": "commerce",
  "target_keyspace": "commerce",
  "table_settings": [
    {
      "target_table": "customer_with_new_ids",
      "source_expression": "select email from customer"
    }
  ],
  "tablet_types": "REPLICA"
}'

vtctlclient workflow -- commerce.customer_with_new_ids show

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

No branches or pull requests

3 participants