Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CDFQuantile type #416

Closed
osopardo1 opened this issue Sep 16, 2024 · 0 comments
Closed

Add CDFQuantile type #416

osopardo1 opened this issue Sep 16, 2024 · 0 comments
Assignees
Labels
type: enhancement Improvement of existing feature or code

Comments

@osopardo1
Copy link
Member

osopardo1 commented Sep 16, 2024

Right now, we split the Transformations (and Transformers) into:

  • LinearTransformation
  • HashTransformation
  • StringHistogramTransformation
  • NullToZeroTransformation
  • IdentityToZeroTransformation

We wanted to implement a QuantileTransformation (see closed issue #338), which will make the indexing more flexible by calling an streaming algorithm to update and provide the Rank of a specific point while writing new data. But, while trying to implement it, we notice few things:

  1. The HistogramTransformation was mapping the elements like they were Quantiles.
  2. For computing the histogram, we required an external method to be called before indexing:
import io.qbeast.spark.utils.QbeastUtils

val brandStats = QbeastUtils.computeHistogramForColumn(df, "brand", 50)
val statsStr = s"""{"brand_histogram":$brandStats}"""

(df
  .write
  .mode("overwrite")
  .format("qbeast")
  .option("columnsToIndex", "brand:histogram")
  .option("columnStats", statsStr)
  .save(targetPath))
  1. For computing the quantiles in PR Issue #416: Add CDFQuantile Transformers and Transformations #413 , we were also implementing the same methodology.
  2. We require a major abstraction for both Histogram and Quantiles, and other algorithms related to a CDF or Cumulative Distribution Function.

This issue is to reorganize the Transformers and Transformations to have the following nomenclatures:

  • CDFQuantilesTransformation
  • CDF<implementation>Transformation

In which we only would have implementation for QuantilesTransformation in both String and Numeric cases, which different initialization of the bins for each case.

With an API such as:

df.write.format("qbeast").option("columnsToIndex", "id:quantiles").save(..)
@osopardo1 osopardo1 self-assigned this Sep 16, 2024
@osopardo1 osopardo1 added the type: enhancement Improvement of existing feature or code label Sep 16, 2024
@osopardo1 osopardo1 changed the title Refactor Transformers and Transformations to include CDF as a type Add CDF type to Transformers Sep 16, 2024
@osopardo1 osopardo1 changed the title Add CDF type to Transformers Add CDFQuantile type Sep 18, 2024
JosepSampe added a commit that referenced this issue Oct 24, 2024
* Issue #424: Add sampling fraction option for optimization (#426)

* Add sampling fraction option for optimization and remove analyze from QbeastTable

* Issue #430: Simplify denormalized blocks creation (#431)

* Simplify Denormalized Blocks

* Issue #416: Add CDFQuantile Transformers and Transformations (#413)

* Issue 264: Update qviz for multiblock files (#437)

* Update Qbeast Visualiser (qviz) with multiblock files

---------

Co-authored-by: Jorge Marín <jorge.marin.rodenas@estudiantat.upc.edu>
Co-authored-by: Jorge Marín <100561030+jorgeMarin1@users.noreply.github.com>

* Issue #441: Fix dataChange flag in optimize (#444)

* Merge from main branch

---------

Co-authored-by: jiawei <47899566+Jiaweihu08@users.noreply.github.com>
Co-authored-by: Paola Pardo <paolapardoat@gmail.com>
Co-authored-by: Jorge Marín <jorge.marin.rodenas@estudiantat.upc.edu>
Co-authored-by: Jorge Marín <100561030+jorgeMarin1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
Development

No branches or pull requests

1 participant