You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since we are building the index in a distributed way and we have an overall desiredCubeSize, we need to calculate what would be the desiredGroupCubeSize for each partition in order to reach that size.
Right now, we are doing it with the method estimateGroupCubeSize, that has this implementation:
But we have seen that with a large dataset, the group size estimated happens to be 1. This caused a poor estimation of the CubeWeights with a lot of cubes never used in the final result.
One solution is to raise the minimum number up to 1000, to be able to have a more descriptive representation of the elements in a group.
How to reproduce?
You can use different tests in the code.
1. Code that triggered the bug, or steps to reproduce:
// your-code-herecaseclassClient3(id: Long, name: String, age: Int, val2: Long, val3: Double)
valspark=SparkSession.active
valrdd=
spark.sparkContext.parallelize(
0.to(100000)
.map(i =>Client3(i * i, s"student-$i", i, i *1000+123, i *2567.3432143)))
valdf= spark.createDataFrame(rdd)
valrev=SparkRevisionFactory.createNewRevision(QTableID("test"), df.schema, options)
val (indexed, tc: BroadcastedTableChanges) = oTreeAlgorithm.index(df, IndexStatus(rev))
tc.cubeWeights.size // check this number to see how many cubes are estimated
indexed.select("_qbeastCube").distinct().count() // check this number to see how many cubes are written on the data
What went wrong?
Since we are building the index in a distributed way and we have an overall
desiredCubeSize
, we need to calculate what would be thedesiredGroupCubeSize
for each partition in order to reach that size.Right now, we are doing it with the method estimateGroupCubeSize, that has this implementation:
But we have seen that with a large dataset, the group size estimated happens to be 1. This caused a poor estimation of the CubeWeights with a lot of cubes never used in the final result.
One solution is to raise the minimum number up to
1000
, to be able to have a more descriptive representation of the elements in a group.How to reproduce?
You can use different tests in the code.
1. Code that triggered the bug, or steps to reproduce:
2. Branch and commit id:
main
at 55442d93. Spark version:
On the spark shell run
spark.version
.3.1.2
4. Hadoop version:
On the spark shell run
org.apache.hadoop.util.VersionInfo.getVersion()
.3.2.0
5. How are you running Spark?
Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?
Local spark
6. Stack trace:
Trace of the log/error messages.
Non
The text was updated successfully, but these errors were encountered: