-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Table properties to Qbeast #124
Conversation
Hi @eavilaes ! I have done a potential workaround on this issue. A lot of things need to be re-worked; it's still a WIP. But due to timing, you can try the new SaveAsTable functionality and be the Q&A Tester hehe.
Please, feel free to suggest any other changes or problems you face during the execution. Many thanks! |
Just tested, working smoooooth in my case🥳 |
Codecov Report
@@ Coverage Diff @@
## main #124 +/- ##
==========================================
+ Coverage 91.81% 92.86% +1.05%
==========================================
Files 62 73 +11
Lines 1453 1709 +256
Branches 114 126 +12
==========================================
+ Hits 1334 1587 +253
- Misses 119 122 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…ts, add tests for both INSERT INTO and OVERWRITE using real data
Calling Calling 'INSERT INTO' operates correctly with qbeast. When calling either The following works just fine on a VIEW:
However, when a TABLE is created using
In the case of delta, records are eliminated in both scenarios with a second JSON log file with For Qbeast, when using |
…ltaCatalog We need the QbeastCatalog to be independent from other existing Catalog solutions.
Since we want the Catalog implementation to be independent of underlying formats, we decided to extend To implement all the methods required (listNamespaces(), listTables()...), we will use the delegated catalog (the one configured with the /**
* If the V2_SESSION_CATALOG config is specified, we try to instantiate the user-specified v2
* session catalog. Otherwise, return the default session catalog.
*
* This catalog is a v2 catalog that delegates to the v1 session catalog. it is used when the
* session catalog is responsible for an identifier, but the source requires the v2 catalog API.
* This happens when the source implementation extends the v2 TableProvider API and is not listed
* in the fallback configuration, spark.sql.sources.useV1SourceList
*/
private[sql] def v2SessionCatalog: CatalogPlugin = {
conf.getConf(SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).map { _ =>
catalogs.getOrElseUpdate(SESSION_CATALOG_NAME, loadV2SessionCatalog())
}.getOrElse(defaultSessionCatalog)
} In that way, the user can configure the
spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
spark.sql.catalog.qbeast_catalog= io.qbeast.spark.internal.sources.catalog.QbeastCatalog
spark.sql.catalog.qbeast_catalog.warehouse=/tmp/dir
....
// Write data with qbeast_catalog prefix
data.write
.format("qbeast")
.option("columnsToIndex", "id")
.saveAsTable("qbeast_catalog.default.qbeast_table") |
This commit also includes removal of Delta dependencies on catalog/sources classes
We delegate a lot of methods to the Session Catalog. We need to verify they work properly with unit tests. Add some minor fixes as well
Made some changes in the test, hope the Codecov report is satisfied this time hehe. |
…CatalogUtils and general QbeastCatalog behaviour
Hello! I think this PR is ready to merge since everyone has tried the |
Description
Solves issue #42 .
The problematic with
saveAsTable()
goes beyond a simpleoverride
method. It requires a lot of reworking in the design and implementation ofQbeastDataSource
and its related classes.A
DataSource
is the main entry point for writing and reading with Qbeast Format through Spark. Apache Spark has two different versions of this API:RelationProvider
andCreatableRelationProvider
.TableProvider
,Table
and implemented with traits likeSupportsWrite
,SupportsRead
,SupportsOverwrite
...This separation of API's has been profitable in terms of optimization but lacks consistency between both. Some SQL statements and operations are implemented for V1, not V2, and vice-versa.
There's a nice series of blogs around this topic that you can read to get the full picture: http://blog.madhukaraphatak.com/categories/datasource-v2-series/
Type of change
This change consists on:
QbeastTableImpl
. This class extends Table and adds writing capabilities to V2 Qbeast DataSource.QbeastWriterBuilder
. This class is in charge of building the Write process. It extendsV1WriteBuilder
, which makes the conversion easier.QbeastCatalog
.QbeastCatalog
is an extension ofCatalogExtension
withSupportNamespaces
(support creation, rename and deletion of namespaces) andStagingTableCatalog
. AnStagingTableCatalog
is for creating a table before committing any metadata along with the content of CREATE TABLE AS SELECT or REPLACE TABLE AS SELECT operations. From the Spark documentation, we can observe the following:The
QbeastCatalog
can be used as the default catalog through the spark_catalog configuration:Or as an external catalog that you can combine with other catalogs:
QbeastStagedTableImpl
. This class contains the code to commit the staged changes atomically. It creates the underlying log and saves any data that may have been processed in the SELECT AS statement.SaveAsTableRule
to make surecolumnsToIndex
option is passed to theQbeastTableImpl
.QbeastAnalysis
rules to read with V1 optimizations.QbeastDataSource
and rework some of the methods.Checklist:
Here is the list of things you should do before submitting this pull request:
How Has This Been Tested? (Optional)
We added few test in order to check how units of code behave.
Under
io.qbeast.spark.internal.sources.catalog
package, you will found different tests.DefaultStagedTableTest
. This is the table returned by default in case qbeast cannot operate.QbeastCatalogTest
. This tests checks all the methods called by the CatalogManager, and make sure they work as expected.QbeastCatalogIntegrationTest
. This tests checks integration with other catalogs and behavior of the Qbeast implementation.QbeastStagedTableTest
. This tests contains method testing for the implementation of an Staged table.Under
io.qbeast.spark.utils
, some tests had been refactored.QbeastSparkCorrectnessTest
. Checks if the results are correct (writing, reading, sampling..)QbeastSparkIntegrationTest
. Checks if the Spark DataFrame API is behaving properly.QbeastSQLIntegrationTest
. Checks if the SQL commands are performed correctly.