-
Notifications
You must be signed in to change notification settings - Fork 227
Support ARRAY data type #178
Comments
For this, you'd want look at implementing java.sql.Array. I think we'd just have all array values in a single KeyValue column and not yet allow an array in the primary key. As a model for how to surface the ARRAY access in SQL, I'd use what Postgres does as a guide. We don't need all of that implemented yet, just the bare essentials:
|
One thing that'll be a bit tricky is knowing/maintaining the element type of the ARRAY. The easiest thing I can think of is to create a new PDataType for each element type (essentially doubling the number of types). We likely want to switch our current PDataType sqlType to be a typeId. The typeId would basically OR together the ARRAY sqlType (which is above 1000) plus the element type, while the getSqlType() would just return Types.ARRAY. I think we should take this opportunity to do a bit of cleanup on PDataType too - maybe to the point of having the enum delegate everything to a new interface or abstract class so that we can get better code reuse. |
Going through the related links and the exisitng code |
Yes, for INT_ARRAY, DOUBLE_ARRAY, BOOLEAN_ARRAY as the PDataType enum name. It wouldn't be exposed like that in the CREATE TABLE statement, though. I'd shoot for syntax like this (see link above from Postgres docs):
I think limiting array usage to key value columns is fine for now. That way we can serialize the length as a varint as the first thing in the value. Also, limiting to single dimension is fine for a first version. I think we can still just store a single typeID, but for arrays (which would be bigger than 1000), we could store the element type in the first few digits. But the implementation of getSqlType() for arrays would return Types.ARRAY. Also (as you've probably noticed), there's too much copy/paste code in PDataType. I'd like to aim to have that be as thin as possible and have it delegate everything to a new interface. Then we can have a hierarchy and get better code reuse for all the types we have. |
For CreateTable Syntax basic lexer changes I have done but need to test them along with some core changes. -> The normal PDataType that exists cannot work directly. This is because for the the Codec are for primitive types. |
Good questions, @ramkrish86. We should look at the SQL92 spec and see how an impl is supposed to handle an ARRAY with a specific dimension. I would have thought that should be an upper bound on the size. Let me do some research on that. I agree, we'll need a PhoenixArray that implements java.sql.Array. I also can see that we'll likely need to tweak the UpsertCompile code, which is fine. It's ok if a PDataType doesn't have a codec. Only the primitive ones do. For the ARRAY PDataTypes, I'd expect the toObject method to return either a PhoenixArray or an array of the element type (i.e. long[] or BigDecimal[], etc). Feel free to spin up a pull request for your partial work so we can discuss more specifics. It's ok if it's not working yet. If you think the scope is big enough, we can also create a branch for this, just let me know. |
Sure James. Let me come up with some basic changes and implementations (it may not work though), on seeing that we can decide the number of changes required and take a decision accordingly. Thanks for your inputs. |
Working on the toBytes() of Array. How should it look like? Currently toBytes() would return the bytes of that datatype but for Arrays we need the entire data inside the array. So we may have to serialize dimension, length of every dimension and then the individual elements. |
I have completed an implementation for INTEGER_ARRAY as a POC and basic upserts and select queries work with INTEGER_ARRAY. I would give a pull request so that it can be reviewed. |
Thanks so much for putting this intermediate pull request together - very helpful. Please see my comments and let me know if you have outstanding questions. |
The main outstanding work necessary for this is to add some built-in functions that allow array element access and return the length of an array. |
Will take this up later next month. |
Thanks to @ramkrish86, we now support the SQL ARRAY construct. Fantastic work, Ram! |
Thanks a lot James for your patient reviews. Hope to improve next time. Is the build failing? Am not able to view the build output from the link provided in the autogenerated mail. |
Woww.. Great work Ram! |
Benchmark outline from Sudarshan Kadambi which describes a good use case for supporting an ARRAY data type. Supporting an ARRAY data type, an UNNEST built-in function, plus derived tables would allow the following standard query to be used (as opposed to creating custom aggregate functions):
select avg(v)
from (select unnest(value) v from t
where object_id in (O1,O2,...O250K) and field_type = 'F1' and attrib_id = 'A1')
It'd be nice if you could just do an average over an array directly, but this would be non standard.
On 04/26/2013 11:17 AM, Sudarshan Kadambi (BLOOMBERG) wrote:
Hi James:
Yes, I saw the email. Thank you for this generous offer. I wanted some time to make sure the benchmark correctly represents my use case.
If you wish, here's a benchmark setup you could use:
select avg(value) where object_id in (O1,O2,...O250K) and field_type = 'F1' and attrib_id = 'A1'
We would want the test done with and without the skip scan filter for the purpose of comparison.
The reason why I wanted some time to think about it is that the values within each attribute is a JSON number array. So an avg across 2 values is a average of the averages.
For e.g. Object_id=O1, Field_type=F1, Attrib_id=A1, Value 1: {1,2,3,4,5,}
Object_id=O2, Field_type=F1, Attrib_id=A1, Value 2: {1,2,3,4,5}
The query should produce: Avg{Avg{Value1}, Avg{Value2}} = Avg{3,3} = 3
If we were doing a sum, the query would produce: Sum{Avg{Value1}, Avg{Value2} = Sum{3,3} = 6.
This might require customization of the aggregate function code.
The text was updated successfully, but these errors were encountered: