Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Support ARRAY data type #178

Closed
jtaylor-sfdc opened this issue Apr 27, 2013 · 15 comments
Closed

Support ARRAY data type #178

jtaylor-sfdc opened this issue Apr 27, 2013 · 15 comments

Comments

@jtaylor-sfdc
Copy link
Contributor

Benchmark outline from Sudarshan Kadambi which describes a good use case for supporting an ARRAY data type. Supporting an ARRAY data type, an UNNEST built-in function, plus derived tables would allow the following standard query to be used (as opposed to creating custom aggregate functions):

select avg(v)
from (select unnest(value) v from t
where object_id in (O1,O2,...O250K) and field_type = 'F1' and attrib_id = 'A1')

It'd be nice if you could just do an average over an array directly, but this would be non standard.

On 04/26/2013 11:17 AM, Sudarshan Kadambi (BLOOMBERG) wrote:
Hi James:

Yes, I saw the email. Thank you for this generous offer. I wanted some time to make sure the benchmark correctly represents my use case.

If you wish, here's a benchmark setup you could use:

  1. 1 Billion keys in table
  2. Query contains 250K randomly chosen object_ids and a single field_type (F1)
  3. Aggregation done on a single attribute (A1)
  4. Query:
    select avg(value) where object_id in (O1,O2,...O250K) and field_type = 'F1' and attrib_id = 'A1'

We would want the test done with and without the skip scan filter for the purpose of comparison.

The reason why I wanted some time to think about it is that the values within each attribute is a JSON number array. So an avg across 2 values is a average of the averages.

For e.g. Object_id=O1, Field_type=F1, Attrib_id=A1, Value 1: {1,2,3,4,5,}
Object_id=O2, Field_type=F1, Attrib_id=A1, Value 2: {1,2,3,4,5}

The query should produce: Avg{Avg{Value1}, Avg{Value2}} = Avg{3,3} = 3
If we were doing a sum, the query would produce: Sum{Avg{Value1}, Avg{Value2} = Sum{3,3} = 6.

This might require customization of the aggregate function code.

@ghost ghost assigned JingchengDu Jun 7, 2013
@jtaylor-sfdc
Copy link
Contributor Author

For this, you'd want look at implementing java.sql.Array. I think we'd just have all array values in a single KeyValue column and not yet allow an array in the primary key. As a model for how to surface the ARRAY access in SQL, I'd use what Postgres does as a guide.

We don't need all of that implemented yet, just the bare essentials:

  • support for an array declaration in CREATE TABLE
  • support for inserting an array value using the ARRAY[1,2,3] syntax
  • support for accessing an array element in SELECT
  • a few built-ins like ARRAY_LENGTH, ARRAY_UPPER, and ARRAY_LOWER
  • a built-in for UNNEST so that you can compute aggregates over an array
  • nice to have would be support for ANY and ALL

@ghost ghost assigned anoopsjohn Jul 30, 2013
@ghost ghost assigned ramkrish86 Aug 16, 2013
@jtaylor-sfdc
Copy link
Contributor Author

One thing that'll be a bit tricky is knowing/maintaining the element type of the ARRAY. The easiest thing I can think of is to create a new PDataType for each element type (essentially doubling the number of types). We likely want to switch our current PDataType sqlType to be a typeId. The typeId would basically OR together the ARRAY sqlType (which is above 1000) plus the element type, while the getSqlType() would just return Types.ARRAY.

I think we should take this opportunity to do a bit of cleanup on PDataType too - maybe to the point of having the enum delegate everything to a new interface or abstract class so that we can get better code reuse.

@ramkrish86
Copy link
Contributor

Going through the related links and the exisitng code
For now I will check the create table with Array as a column data type.
-> Arrays support would be done for every data type. So does this mean that I would have INT_ARRAY, DOUBLE_ARRAY, BOOLEAN_ARRAY etc. and the same would be used in the datatype creation (i.e) in the create table syntax? Also the size of the array should be mentioned in the Create table syntax.
This would require some parser changes to the antlr files?
I can see the need for using another typeID along with sqlType in PDDataType which would give what type of array is the given Array. This would mean that all other existing datatypes would have typeId == sqlType.

@jtaylor-sfdc
Copy link
Contributor Author

Yes, for INT_ARRAY, DOUBLE_ARRAY, BOOLEAN_ARRAY as the PDataType enum name. It wouldn't be exposed like that in the CREATE TABLE statement, though. I'd shoot for syntax like this (see link above from Postgres docs):

  CREATE TABLE t (k VARCHAR PRIMARY KEY,
      a INTEGER ARRAY,
      b VARCHAR ARRAY[5]
  )

I think limiting array usage to key value columns is fine for now. That way we can serialize the length as a varint as the first thing in the value. Also, limiting to single dimension is fine for a first version.

I think we can still just store a single typeID, but for arrays (which would be bigger than 1000), we could store the element type in the first few digits. But the implementation of getSqlType() for arrays would return Types.ARRAY.

Also (as you've probably noticed), there's too much copy/paste code in PDataType. I'd like to aim to have that be as thin as possible and have it delegate everything to a new interface. Then we can have a hierarchy and get better code reuse for all the types we have.

@ramkrish86
Copy link
Contributor

For CreateTable Syntax basic lexer changes I have done but need to test them along with some core changes.
While starting with the implementation for ARRAY support in scan, I observed the following things wrt how PreparedStatements can be created.

-> The normal PDataType that exists cannot work directly. This is because for the the Codec are for primitive types.
-> In case of Arrays we need PreparedStatement.setArray(java.sql.Array). So we need to give an implmentation of Array. This can be done using connection.createArrayOf().
-> Now every element in this Array implementation should go with the primitive datatype. So we have to implement the java.sql.Array using which we would access the array elements.
-> As per the postgreSql doc though the syntax allows size specs, but it does not use it. The arrays are unbounded (may be internally there is a limit). so integer array[5] and integer array[] both are same.
-> So once we are identifying that the datatypes are convertible and coercible we need to add them to the Puts that we create. This needs to be checked (i am not yet come to this part).
@jtaylor-sfdc
We may need to link the PDatatype and the Array impl.
So does the above observation makes sense?
I can carry on with my implementation if you feel the approach is correct. I also went thro the postgreSql code to see how they handle Arrays.

@jtaylor-sfdc
Copy link
Contributor Author

Good questions, @ramkrish86. We should look at the SQL92 spec and see how an impl is supposed to handle an ARRAY with a specific dimension. I would have thought that should be an upper bound on the size. Let me do some research on that.

I agree, we'll need a PhoenixArray that implements java.sql.Array. I also can see that we'll likely need to tweak the UpsertCompile code, which is fine. It's ok if a PDataType doesn't have a codec. Only the primitive ones do. For the ARRAY PDataTypes, I'd expect the toObject method to return either a PhoenixArray or an array of the element type (i.e. long[] or BigDecimal[], etc).

Feel free to spin up a pull request for your partial work so we can discuss more specifics. It's ok if it's not working yet. If you think the scope is big enough, we can also create a branch for this, just let me know.

@ramkrish86
Copy link
Contributor

Sure James. Let me come up with some basic changes and implementations (it may not work though), on seeing that we can decide the number of changes required and take a decision accordingly. Thanks for your inputs.

@ramkrish86
Copy link
Contributor

Working on the toBytes() of Array. How should it look like? Currently toBytes() would return the bytes of that datatype but for Arrays we need the entire data inside the array. So we may have to serialize dimension, length of every dimension and then the individual elements.
isCoercible() should check for every element in the array?

@ramkrish86
Copy link
Contributor

I have completed an implementation for INTEGER_ARRAY as a POC and basic upserts and select queries work with INTEGER_ARRAY. I would give a pull request so that it can be reviewed.

@jtaylor-sfdc
Copy link
Contributor Author

Thanks so much for putting this intermediate pull request together - very helpful. Please see my comments and let me know if you have outstanding questions.

@jtaylor-sfdc
Copy link
Contributor Author

The main outstanding work necessary for this is to add some built-in functions that allow array element access and return the length of an array.

@ramkrish86
Copy link
Contributor

Will take this up later next month.

@jtaylor-sfdc
Copy link
Contributor Author

Thanks to @ramkrish86, we now support the SQL ARRAY construct. Fantastic work, Ram!

@ramkrish86
Copy link
Contributor

Thanks a lot James for your patient reviews. Hope to improve next time. Is the build failing? Am not able to view the build output from the link provided in the autogenerated mail.

@anoopsjohn
Copy link
Contributor

Woww.. Great work Ram!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants