Support ARRAY data type #178

jtaylor-sfdc · 2013-04-27T22:01:18Z

Benchmark outline from Sudarshan Kadambi which describes a good use case for supporting an ARRAY data type. Supporting an ARRAY data type, an UNNEST built-in function, plus derived tables would allow the following standard query to be used (as opposed to creating custom aggregate functions):

select avg(v)
from (select unnest(value) v from t
where object_id in (O1,O2,...O250K) and field_type = 'F1' and attrib_id = 'A1')

It'd be nice if you could just do an average over an array directly, but this would be non standard.

On 04/26/2013 11:17 AM, Sudarshan Kadambi (BLOOMBERG) wrote:
Hi James:

Yes, I saw the email. Thank you for this generous offer. I wanted some time to make sure the benchmark correctly represents my use case.

If you wish, here's a benchmark setup you could use:

1 Billion keys in table
Query contains 250K randomly chosen object_ids and a single field_type (F1)
Aggregation done on a single attribute (A1)
Query:
select avg(value) where object_id in (O1,O2,...O250K) and field_type = 'F1' and attrib_id = 'A1'

We would want the test done with and without the skip scan filter for the purpose of comparison.

The reason why I wanted some time to think about it is that the values within each attribute is a JSON number array. So an avg across 2 values is a average of the averages.

For e.g. Object_id=O1, Field_type=F1, Attrib_id=A1, Value 1: {1,2,3,4,5,}
Object_id=O2, Field_type=F1, Attrib_id=A1, Value 2: {1,2,3,4,5}

The query should produce: Avg{Avg{Value1}, Avg{Value2}} = Avg{3,3} = 3
If we were doing a sum, the query would produce: Sum{Avg{Value1}, Avg{Value2} = Sum{3,3} = 6.

This might require customization of the aggregate function code.

jtaylor-sfdc · 2013-06-21T00:32:55Z

For this, you'd want look at implementing java.sql.Array. I think we'd just have all array values in a single KeyValue column and not yet allow an array in the primary key. As a model for how to surface the ARRAY access in SQL, I'd use what Postgres does as a guide.

We don't need all of that implemented yet, just the bare essentials:

support for an array declaration in CREATE TABLE
support for inserting an array value using the ARRAY[1,2,3] syntax
support for accessing an array element in SELECT
a few built-ins like ARRAY_LENGTH, ARRAY_UPPER, and ARRAY_LOWER
a built-in for UNNEST so that you can compute aggregates over an array
nice to have would be support for ANY and ALL

jtaylor-sfdc · 2013-08-16T19:26:11Z

One thing that'll be a bit tricky is knowing/maintaining the element type of the ARRAY. The easiest thing I can think of is to create a new PDataType for each element type (essentially doubling the number of types). We likely want to switch our current PDataType sqlType to be a typeId. The typeId would basically OR together the ARRAY sqlType (which is above 1000) plus the element type, while the getSqlType() would just return Types.ARRAY.

I think we should take this opportunity to do a bit of cleanup on PDataType too - maybe to the point of having the enum delegate everything to a new interface or abstract class so that we can get better code reuse.

ramkrish86 · 2013-08-20T10:00:04Z

Going through the related links and the exisitng code
For now I will check the create table with Array as a column data type.
-> Arrays support would be done for every data type. So does this mean that I would have INT_ARRAY, DOUBLE_ARRAY, BOOLEAN_ARRAY etc. and the same would be used in the datatype creation (i.e) in the create table syntax? Also the size of the array should be mentioned in the Create table syntax.
This would require some parser changes to the antlr files?
I can see the need for using another typeID along with sqlType in PDDataType which would give what type of array is the given Array. This would mean that all other existing datatypes would have typeId == sqlType.

jtaylor-sfdc · 2013-08-21T00:28:53Z

Yes, for INT_ARRAY, DOUBLE_ARRAY, BOOLEAN_ARRAY as the PDataType enum name. It wouldn't be exposed like that in the CREATE TABLE statement, though. I'd shoot for syntax like this (see link above from Postgres docs):

  CREATE TABLE t (k VARCHAR PRIMARY KEY,
      a INTEGER ARRAY,
      b VARCHAR ARRAY[5]
  )

I think limiting array usage to key value columns is fine for now. That way we can serialize the length as a varint as the first thing in the value. Also, limiting to single dimension is fine for a first version.

I think we can still just store a single typeID, but for arrays (which would be bigger than 1000), we could store the element type in the first few digits. But the implementation of getSqlType() for arrays would return Types.ARRAY.

Also (as you've probably noticed), there's too much copy/paste code in PDataType. I'd like to aim to have that be as thin as possible and have it delegate everything to a new interface. Then we can have a hierarchy and get better code reuse for all the types we have.

ramkrish86 · 2013-08-23T13:47:17Z

For CreateTable Syntax basic lexer changes I have done but need to test them along with some core changes.
While starting with the implementation for ARRAY support in scan, I observed the following things wrt how PreparedStatements can be created.

-> The normal PDataType that exists cannot work directly. This is because for the the Codec are for primitive types.
-> In case of Arrays we need PreparedStatement.setArray(java.sql.Array). So we need to give an implmentation of Array. This can be done using connection.createArrayOf().
-> Now every element in this Array implementation should go with the primitive datatype. So we have to implement the java.sql.Array using which we would access the array elements.
-> As per the postgreSql doc though the syntax allows size specs, but it does not use it. The arrays are unbounded (may be internally there is a limit). so integer array[5] and integer array[] both are same.
-> So once we are identifying that the datatypes are convertible and coercible we need to add them to the Puts that we create. This needs to be checked (i am not yet come to this part).
@jtaylor-sfdc
We may need to link the PDatatype and the Array impl.
So does the above observation makes sense?
I can carry on with my implementation if you feel the approach is correct. I also went thro the postgreSql code to see how they handle Arrays.

jtaylor-sfdc · 2013-08-23T17:06:28Z

Good questions, @ramkrish86. We should look at the SQL92 spec and see how an impl is supposed to handle an ARRAY with a specific dimension. I would have thought that should be an upper bound on the size. Let me do some research on that.

I agree, we'll need a PhoenixArray that implements java.sql.Array. I also can see that we'll likely need to tweak the UpsertCompile code, which is fine. It's ok if a PDataType doesn't have a codec. Only the primitive ones do. For the ARRAY PDataTypes, I'd expect the toObject method to return either a PhoenixArray or an array of the element type (i.e. long[] or BigDecimal[], etc).

Feel free to spin up a pull request for your partial work so we can discuss more specifics. It's ok if it's not working yet. If you think the scope is big enough, we can also create a branch for this, just let me know.

ramkrish86 · 2013-08-26T03:03:03Z

Sure James. Let me come up with some basic changes and implementations (it may not work though), on seeing that we can decide the number of changes required and take a decision accordingly. Thanks for your inputs.

ramkrish86 · 2013-08-27T08:39:51Z

Working on the toBytes() of Array. How should it look like? Currently toBytes() would return the bytes of that datatype but for Arrays we need the entire data inside the array. So we may have to serialize dimension, length of every dimension and then the individual elements.
isCoercible() should check for every element in the array?

ramkrish86 · 2013-08-27T12:05:03Z

I have completed an implementation for INTEGER_ARRAY as a POC and basic upserts and select queries work with INTEGER_ARRAY. I would give a pull request so that it can be reviewed.

jtaylor-sfdc · 2013-08-27T19:43:44Z

Thanks so much for putting this intermediate pull request together - very helpful. Please see my comments and let me know if you have outstanding questions.

jtaylor-sfdc · 2013-10-29T04:39:24Z

The main outstanding work necessary for this is to add some built-in functions that allow array element access and return the length of an array.

ramkrish86 · 2013-10-29T05:56:54Z

Will take this up later next month.

jtaylor-sfdc · 2013-12-18T22:56:53Z

Thanks to @ramkrish86, we now support the SQL ARRAY construct. Fantastic work, Ram!

ramkrish86 · 2013-12-19T03:41:35Z

Thanks a lot James for your patient reviews. Hope to improve next time. Is the build failing? Am not able to view the build output from the link provided in the autogenerated mail.

anoopsjohn · 2013-12-19T04:07:42Z

Woww.. Great work Ram!

ghost assigned JingchengDu Jun 7, 2013

ghost assigned anoopsjohn Jul 30, 2013

ghost assigned ramkrish86 Aug 16, 2013

jtaylor-sfdc closed this as completed Dec 18, 2013

jtaylor-sfdc unassigned ramkrish86 Feb 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ARRAY data type #178

Support ARRAY data type #178

jtaylor-sfdc commented Apr 27, 2013

jtaylor-sfdc commented Jun 21, 2013

jtaylor-sfdc commented Aug 16, 2013

ramkrish86 commented Aug 20, 2013

jtaylor-sfdc commented Aug 21, 2013

ramkrish86 commented Aug 23, 2013

jtaylor-sfdc commented Aug 23, 2013

ramkrish86 commented Aug 26, 2013

ramkrish86 commented Aug 27, 2013

ramkrish86 commented Aug 27, 2013

jtaylor-sfdc commented Aug 27, 2013

jtaylor-sfdc commented Oct 29, 2013

ramkrish86 commented Oct 29, 2013

jtaylor-sfdc commented Dec 18, 2013

ramkrish86 commented Dec 19, 2013

anoopsjohn commented Dec 19, 2013

Support ARRAY data type #178

Support ARRAY data type #178

Comments

jtaylor-sfdc commented Apr 27, 2013

jtaylor-sfdc commented Jun 21, 2013

jtaylor-sfdc commented Aug 16, 2013

ramkrish86 commented Aug 20, 2013

jtaylor-sfdc commented Aug 21, 2013

ramkrish86 commented Aug 23, 2013

jtaylor-sfdc commented Aug 23, 2013

ramkrish86 commented Aug 26, 2013

ramkrish86 commented Aug 27, 2013

ramkrish86 commented Aug 27, 2013

jtaylor-sfdc commented Aug 27, 2013

jtaylor-sfdc commented Oct 29, 2013

ramkrish86 commented Oct 29, 2013

jtaylor-sfdc commented Dec 18, 2013

ramkrish86 commented Dec 19, 2013

anoopsjohn commented Dec 19, 2013