Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand allow.cartesian error hints for non-equi joins #2086

Closed
franknarf1 opened this issue Mar 28, 2017 · 7 comments
Closed

Expand allow.cartesian error hints for non-equi joins #2086

franknarf1 opened this issue Mar 28, 2017 · 7 comments
Labels
enhancement non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@franknarf1
Copy link
Contributor

Taken from a good question on SO...

library(data.table)
DT = setDT(structure(list(City = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L), .Label = c("LA", "NY"), class = "factor"), YearFrom = c(2001L, 
2003L, 2002L, 2006L, 2008L, 2004L, 2005L, 2005L, 2002L), YearTo = c(NA, 
2005L, NA, NA, 2009L, NA, 2008L, NA, NA), quasiYearTo = c(2017L, 
2005L, 2017L, 2017L, 2009L, 2017L, 2008L, 2017L, 2017L)), .Names = c("City", 
"YearFrom", "YearTo", "quasiYearTo"), row.names = c(NA, -9L), class = "data.frame"))


   City YearFrom YearTo quasiYearTo
1:   NY     2001     NA        2017
2:   NY     2003   2005        2005
3:   NY     2002     NA        2017
4:   NY     2006     NA        2017
5:   NY     2008   2009        2009
6:   LA     2004     NA        2017
7:   LA     2005   2008        2008
8:   LA     2005     NA        2017
9:   LA     2002     NA        2017

The goal is to count, for every city and year in the data, how many firms are active (have YearFrom <= Year < quasiYearTo). This code works:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
  .N
, by=.EACHI][, .(City, Year = YearFrom, N)]


    City Year N
 1:   LA 2001 0
 2:   LA 2002 1
 3:   LA 2003 1
 4:   LA 2004 2
 5:   LA 2005 4
 6:   LA 2006 4
 7:   LA 2007 4
 8:   LA 2008 3
 9:   LA 2009 3
10:   NY 2001 1
11:   NY 2002 2
12:   NY 2003 3
13:   NY 2004 3
14:   NY 2005 2
15:   NY 2006 3
16:   NY 2007 3
17:   NY 2008 4
18:   NY 2009 3

However, if I drop allow.cartesian, I get the usual error message:

DT[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), .N, by=.EACHI]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 47 rows; more than 27 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

So my i has no duplicates:

DT[, CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE)][, 
  anyDuplicated(.SD)] # 0

...and I am using by=.EACHI already. It would be nice if the warning message said gave a more relevant hint here (though I don't know what that might be, since I'm not actually clear on why I need allow.cartesian on here). Anyway, a minor suggestion.

An alternative would be to turn off the cartesian check (or downgrade it from error to verbose message) if it's a non-equi join and by=.EACHI, since that case seems pretty safe.

@arunsrinivasan
Copy link
Member

Fetching indices for non-equi joins is slightly more complicated and requires (at times) a reordering, which requires populating them entirely first. Even though it is rare, there might be cases where this triggers the allow.cartesian threshold. It could be sidestepped by checking for by=.EACHI.

@arunsrinivasan arunsrinivasan added enhancement non-equi joins rolling, overlapping, non-equi joins labels Mar 28, 2017
@fedyakov
Copy link

I'm trying to non-equi join two really big tables to count sizes of groups. Cartesian join exceeds the 2^31 limit. So the only possibility for me is to group by .EACHI. Unfortunately, it fails with the mentioned error.

It's a critical bug for me. Please, consider fixing it.

Here is a tiny example to reproduce the error:

dt1 = data.table(a1 = 4:1, b1 = 1:2)
dt2 = data.table(a2 = 5:6, b2 = 3)
dt1[dt2, .N, on = .(a1 < a2, b1 < b2), by = .EACHI]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 8 rows; more than 6 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

@DavidArenburg
Copy link
Member

DavidArenburg commented Mar 6, 2019

@fedyakov as temporary fix just add , allow.cartesian = TRUE as in

dt1[dt2, .N, on = .(a1 < a2, b1 < b2), by = .EACHI, allow.cartesian = TRUE]
#    a1 b1 N
# 1:  5  3 4
# 2:  6  3 4

Unless you've tried that and it fails?

@fedyakov
Copy link

fedyakov commented Mar 7, 2019

Unless you've tried that and it fails?

I don't have the reproducible example now, but as far as I remember it did not work. Cartesian join exceeds the 2^31 limit.

@ColeMiller1
Copy link
Contributor

Closed by #4493. Related to #4489. 1.12.9 will make allow.cartesian = TRUE for non-equi joins which is even better than a plain old hint :)

@fedyakov, there's also #3009 for joins that exceed 2^31 but are aggregated for .N. Then separately, there is #3957 which is explicitly about data.table allowing for more than 2^31 rows.

@ColeMiller1 ColeMiller1 added this to the 1.12.9 milestone Jul 15, 2020
@fedyakov
Copy link

@ColeMiller1 My case is about .EACHI aggregation primarily. It doesn't require 64-bit vectors at all since aggregation could be done alongside joining in a single stage and the result fits in а 32-bit vector. Could you consider a separate issue to fix this special case?

@ColeMiller1
Copy link
Contributor

@fedyakov you are more than welcome to open an issue. However, issue #3009 is the exact use case you referred to earlier in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

No branches or pull requests

5 participants