-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added support for Databricks Unity Catalog #1331
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
//! Databricks Unity Catalog. | ||
//! | ||
//! This module is gated behind the "unity-experimental" feature. | ||
use super::{DataCatalog, DataCatalogError}; | ||
use reqwest::header; | ||
use reqwest_middleware::{ClientBuilder, ClientWithMiddleware}; | ||
use reqwest_retry::{policies::ExponentialBackoff, RetryTransientMiddleware}; | ||
use serde::Deserialize; | ||
|
||
/// Databricks Unity Catalog - implementation of the `DataCatalog` trait | ||
pub struct UnityCatalog { | ||
client_with_retry: ClientWithMiddleware, | ||
workspace_url: String, | ||
} | ||
|
||
impl UnityCatalog { | ||
/// Creates a new UnityCatalog | ||
pub fn new() -> Result<Self, DataCatalogError> { | ||
let token_var = "DATABRICKS_ACCESS_TOKEN"; | ||
let access_token = | ||
std::env::var(token_var).map_err(|_| DataCatalogError::MissingEnvVar { | ||
var_name: token_var.into(), | ||
})?; | ||
|
||
let auth_header_val = header::HeaderValue::from_str(&format!("Bearer {}", &access_token)) | ||
.map_err(|_| DataCatalogError::InvalidAccessToken)?; | ||
|
||
let headers = header::HeaderMap::from_iter([(header::AUTHORIZATION, auth_header_val)]); | ||
let client = reqwest::Client::builder() | ||
.default_headers(headers) | ||
.build()?; | ||
|
||
let retry_policy = ExponentialBackoff::builder().build_with_max_retries(10); | ||
|
||
let client_with_retry = ClientBuilder::new(client) | ||
.with(RetryTransientMiddleware::new_with_policy(retry_policy)) | ||
.build(); | ||
|
||
let workspace_var = "DATABRICKS_WORKSPACE_URL"; | ||
let workspace_url = | ||
std::env::var(workspace_var).map_err(|_| DataCatalogError::MissingEnvVar { | ||
var_name: workspace_var.into(), | ||
})?; | ||
|
||
Ok(Self { | ||
client_with_retry, | ||
workspace_url, | ||
}) | ||
} | ||
} | ||
|
||
#[derive(Deserialize)] | ||
#[serde(untagged)] | ||
enum TableResponse { | ||
Success { storage_location: String }, | ||
Error { error_code: String, message: String }, | ||
} | ||
|
||
/// Possible errors from the unity-catalog/tables API call | ||
#[derive(thiserror::Error, Debug)] | ||
pub enum GetTableError { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we using these error variants somewhere? While still very much a long term things, we have been trying to reduce the number of error variants we expose to users in this crate and also make them more actionable. Maybe we can make some common errors explicit - like 403? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did it in a similar way to the existing AWS glue module but I understand it's not perfect. Didn't want to start refactoring existing code either. At least not yet. |
||
#[error("GET request error: {source}")] | ||
/// Error from reqwest library | ||
RequestError { | ||
/// The underlying reqwest_middleware::Error | ||
#[from] | ||
source: reqwest_middleware::Error, | ||
}, | ||
|
||
/// Request returned error response | ||
#[error("Invalid table error: {error_code}: {message}")] | ||
InvalidTable { | ||
/// Error code | ||
error_code: String, | ||
/// Error description | ||
message: String, | ||
}, | ||
} | ||
|
||
impl From<reqwest_middleware::Error> for DataCatalogError { | ||
fn from(value: reqwest_middleware::Error) -> Self { | ||
value.into() | ||
} | ||
} | ||
|
||
impl From<reqwest::Error> for DataCatalogError { | ||
fn from(value: reqwest::Error) -> Self { | ||
reqwest_middleware::Error::Reqwest(value).into() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems a bit convoluted. We are wrapping the actual error in another error to then convert it again. Not sure how many variants the middleware error has, but maybe its viable to unpack the middleware error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be only two variants, unpacking sounds like a good idea. I was just trying to solve a type checking error, didn't give it much thought... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, the middleware error is defined like this:
I could flatten the hierarchy by adding the same variants into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could also convert There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose it would be nice to unify some of the |
||
} | ||
} | ||
|
||
#[async_trait::async_trait] | ||
impl DataCatalog for UnityCatalog { | ||
/// Get the table storage location from the UnityCatalog | ||
async fn get_table_storage_location( | ||
&self, | ||
catalog_id: Option<String>, | ||
database_name: &str, | ||
table_name: &str, | ||
) -> Result<String, DataCatalogError> { | ||
let resp = self | ||
.client_with_retry | ||
.get(format!( | ||
"{}/api/2.1/unity-catalog/tables/{}.{}.{}", | ||
&self.workspace_url, | ||
catalog_id.as_deref().unwrap_or("main"), | ||
&database_name, | ||
&table_name | ||
)) | ||
.send() | ||
.await?; | ||
|
||
let parsed_resp: TableResponse = resp.json().await?; | ||
match parsed_resp { | ||
TableResponse::Success { storage_location } => Ok(storage_location), | ||
TableResponse::Error { | ||
error_code, | ||
message, | ||
} => Err(GetTableError::InvalidTable { | ||
error_code, | ||
message, | ||
} | ||
.into()), | ||
} | ||
} | ||
} | ||
|
||
impl std::fmt::Debug for UnityCatalog { | ||
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> { | ||
write!(fmt, "UnityCatalog") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our clients etc, we usually adopt a builder pattern, that allows us to configure options on the client before building the client. We would then have some methos like
UnitiCatalogClientBuilder::from_env()
to support parsing the configuration from the environment.That said, haven't worked with our catalogs for a long time, so not sure if we properly pass things through right now to make that work. IF not, this maybe something we want to address in a a follow-up.