Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join datasets have values that seem off #21

Open
MrPowers opened this issue Dec 6, 2024 · 7 comments · May be fixed by #24
Open

Join datasets have values that seem off #21

MrPowers opened this issue Dec 6, 2024 · 7 comments · May be fixed by #24
Assignees

Comments

@MrPowers
Copy link
Collaborator

MrPowers commented Dec 6, 2024

Doesn't seem like the id1, id2, and id4 columns in the main table and join tables match up.

Here's the main table:

┌───────┬─────────┬──────────────┬─────┬───┬───────┬─────┬─────┬───────────┐
│ id1   ┆ id2     ┆ id3          ┆ id4 ┆ … ┆ id6   ┆ v1  ┆ v2  ┆ v3        │
│ ---   ┆ ---     ┆ ---          ┆ --- ┆   ┆ ---   ┆ --- ┆ --- ┆ ---       │
│ str   ┆ str     ┆ str          ┆ i64 ┆   ┆ i64   ┆ i64 ┆ i64 ┆ f64       │
╞═══════╪═════════╪══════════════╪═════╪═══╪═══════╪═════╪═════╪═══════════╡
│ id038 ┆ id85082 ┆ id0000083703 ┆ 90  ┆ … ┆ 89817 ┆ 4   ┆ 15  ┆ 28.133477 │
│ id095 ┆ id7331  ┆ id0000031245 ┆ 3   ┆ … ┆ 17720 ┆ 1   ┆ 12  ┆ 91.555302 │
│ id055 ┆ id24810 ┆ id0000014164 ┆ 12  ┆ … ┆ 13241 ┆ 1   ┆ 3   ┆ 64.543029 │
│ id046 ┆ id75326 ┆ id0000061395 ┆ 2   ┆ … ┆ 25    ┆ 1   ┆ 14  ┆ 23.049223 │
│ id052 ┆ id4569  ┆ id0000011446 ┆ 3   ┆ … ┆ 96734 ┆ 1   ┆ 7   ┆ 87.987183 │
│ …     ┆ …       ┆ …            ┆ …   ┆ … ┆ …     ┆ …   ┆ …   ┆ …         │
│ id013 ┆ id66079 ┆ id0000051775 ┆ 8   ┆ … ┆ 93287 ┆ 4   ┆ 14  ┆ 87.804319 │
│ id055 ┆ id84022 ┆ id0000019517 ┆ 28  ┆ … ┆ 68045 ┆ 4   ┆ 4   ┆ 11.484207 │
│ id006 ┆ id78451 ┆ id0000052738 ┆ 66  ┆ … ┆ 29370 ┆ 5   ┆ 9   ┆ 81.052285 │
│ id064 ┆ id23530 ┆ id0000023096 ┆ 38  ┆ … ┆ 34837 ┆ 4   ┆ 11  ┆ 99.93739  │
│ id070 ┆ id51799 ┆ id0000008809 ┆ 58  ┆ … ┆ 46152 ┆ 4   ┆ 6   ┆ 62.117956 │
└───────┴─────────┴──────────────┴─────┴───┴───────┴─────┴─────┴───────────┘

Here's J1_1e7_1e1_0.parquet:

┌─────┬─────┬───────────┐
│ id1 ┆ id4 ┆ v2        │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ str ┆ f64       │
╞═════╪═════╪═══════════╡
│ 4   ┆ id4 ┆ 60.635302 │
│ 9   ┆ id9 ┆ 61.462762 │
│ 3   ┆ id3 ┆ 11.638566 │
│ 5   ┆ id5 ┆ 32.557228 │
│ 9   ┆ id9 ┆ 13.04837  │
│ 4   ┆ id4 ┆ 45.650663 │
│ 8   ┆ id8 ┆ 35.343098 │
│ 4   ┆ id4 ┆ 17.648019 │
│ 2   ┆ id2 ┆ 98.806282 │
│ 1   ┆ id1 ┆ 13.350346 │
└─────┴─────┴───────────┘

Here's J1_1e7_1e4_0.parquet:

┌─────┬───────┬─────┬─────────┬───────────┐
│ id1 ┆ id2   ┆ id4 ┆ id5     ┆ v2        │
│ --- ┆ ---   ┆ --- ┆ ---     ┆ ---       │
│ i64 ┆ i64   ┆ str ┆ str     ┆ f64       │
╞═════╪═══════╪═════╪═════════╪═══════════╡
│ 4   ┆ 10548 ┆ id4 ┆ id10548 ┆ 58.047598 │
│ 9   ┆ 5478  ┆ id9 ┆ id5478  ┆ 28.344673 │
│ 3   ┆ 5478  ┆ id3 ┆ id5478  ┆ 43.711834 │
│ 9   ┆ 10463 ┆ id9 ┆ id10463 ┆ 13.04837  │
│ 4   ┆ 10463 ┆ id4 ┆ id10463 ┆ 4.459861  │
│ …   ┆ …     ┆ …   ┆ …       ┆ …         │
│ 4   ┆ 10978 ┆ id4 ┆ id10978 ┆ 85.268812 │
│ 8   ┆ 10548 ┆ id8 ┆ id10548 ┆ 12.755955 │
│ 4   ┆ 4344  ┆ id4 ┆ id4344  ┆ 96.08827  │
│ 6   ┆ 417   ┆ id6 ┆ id417   ┆ 13.815532 │
│ 5   ┆ 10463 ┆ id5 ┆ id10463 ┆ 13.843241 │
└─────┴───────┴─────┴─────────┴───────────┘

Here's J1_1e7_1e7_NA.parquet:

┌─────┬──────┬─────────┬─────┬────────┬────────┬───────────┐
│ id1 ┆ id2  ┆ id3     ┆ id4 ┆ id5    ┆ id6    ┆ v2        │
│ --- ┆ ---  ┆ ---     ┆ --- ┆ ---    ┆ ---    ┆ ---       │
│ i64 ┆ i64  ┆ i64     ┆ str ┆ str    ┆ str    ┆ f64       │
╞═════╪══════╪═════════╪═════╪════════╪════════╪═══════════╡
│ 4   ┆ 1607 ┆ 8624889 ┆ id4 ┆ id1607 ┆ id1607 ┆ 32.761295 │
│ 5   ┆ 3972 ┆ 83754   ┆ id5 ┆ id3972 ┆ id3972 ┆ 17.648019 │
│ 2   ┆ 49   ┆ 5152803 ┆ id2 ┆ id49   ┆ id49   ┆ 94.688198 │
│ 2   ┆ 4833 ┆ 7623547 ┆ id2 ┆ id4833 ┆ id4833 ┆ 77.909412 │
│ 5   ┆ 5733 ┆ 6155714 ┆ id5 ┆ id5733 ┆ id5733 ┆ 2.269674  │
│ …   ┆ …    ┆ …       ┆ …   ┆ …      ┆ …      ┆ …         │
│ 2   ┆ 1402 ┆ 5541869 ┆ id2 ┆ id1402 ┆ id1402 ┆ 78.53926  │
│ 9   ┆ 1849 ┆ 4288916 ┆ id9 ┆ id1849 ┆ id1849 ┆ 34.115661 │
│ 6   ┆ 7407 ┆ 323953  ┆ id6 ┆ id7407 ┆ id7407 ┆ 71.674646 │
│ 4   ┆ 9078 ┆ 431080  ┆ id4 ┆ id9078 ┆ id9078 ┆ 76.78765  │
│ 1   ┆ 2991 ┆ 4564333 ┆ id1 ┆ id2991 ┆ id2991 ┆ 19.238275 │
└─────┴──────┴─────────┴─────┴────────┴────────┴───────────┘
@MrPowers
Copy link
Collaborator Author

MrPowers commented Dec 6, 2024

I just ran the original script and this is what it output:

(rscript) ~/D/c/c/d/_data ❯❯❯ Rscript join-datagen.R 1e7 0 0 0
Generate join data of 1e7 rows
Producing keys for LHS and RHS data
Producing LHS 1e7 data from keys
Writing LHS 1e7 data J1_1e7_NA_0_0
Producing RHS 1e1 data from keys
Writing RHS 1e1 data J1_1e7_1e1_0_0
Producing RHS 1e4 data from keys
Writing RHS 1e4 data J1_1e7_1e4_0_0
Producing RHS 1e7 data from keys
Writing RHS 1e7 data J1_1e7_1e7_0_0
Join datagen of 1e7 rows finished in 24s

So perhaps J1_1e7_1e7_NA.parquet is the "left" table.

@MrPowers
Copy link
Collaborator Author

MrPowers commented Dec 6, 2024

So here are the files generated by the official script:

  • J1_1e7_NA_0_0
  • J1_1e7_1e1_0_0
  • J1_1e7_1e4_0_0
  • J1_1e7_1e7_0_0

Here are the files that are created when I run this command: falsa join --path-prefix=~/data --size SMALL --data-format PARQUET:

  • J1_1e7_1e1_0
  • J1_1e7_1e4_0
  • J1_1e7_1e7_NA

So I guess we're just missing one of the files.

@zhuqi-lucas
Copy link

Hi @SemyonSinchenko @MrPowers , what's the progress for this issue? Thanks!

@SemyonSinchenko
Copy link
Collaborator

@zhuqi-lucas Hello! I'm sorry, I was busy a little. From now this one is the top priority, you can expect the RP by the end of the week.

@zhuqi-lucas
Copy link

Got it, thank you @SemyonSinchenko ! Looking forward to the PR.

@SemyonSinchenko
Copy link
Collaborator

Hello there. I wasn't able to complete this task during the week. I was thinking that one of the files were just not exposed to the cli, but it looks like the problem is bigger. I cannot remember why I did it like I did, but it looks like all of the J1_* files are generated in a wrong way. Even existing files have a wrong schema. I hope to complete the task next week, sorry for the delay.

@zhuqi-lucas
Copy link

Thanks a lot @SemyonSinchenko for the update!

Hello there. I wasn't able to complete this task during the week. I was thinking that one of the files were just not exposed to the cli, but it looks like the problem is bigger. I cannot remember why I did it like I did, but it looks like all of the J1_* files are generated in a wrong way. Even existing files have a wrong schema. I hope to complete the task next week, sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants