Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test_dicom.py to integration test metadata extract UDF #42

Open
dmoore247 opened this issue Feb 22, 2024 · 1 comment
Open

Add test_dicom.py to integration test metadata extract UDF #42

dmoore247 opened this issue Feb 22, 2024 · 1 comment
Assignees

Comments

@dmoore247
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Add unit test to validate the dicom metadata extraction udf.

Describe the solution you'd like
The real challenge will be to add all the python artifacts necessary to run via db-connect v2.
This is the test_dicom.py that was started and needs to be abandoned...

import pytest
from databricks.connect import DatabricksSession
from pyspark.sql import SparkSession

from dbx.pixels import Catalog
from dbx.pixels.dicom import DicomMetaExtractor, DicomThumbnailExtractor # The Dicom transformers

from dbx.pixels.version import __version__

path  = "s3://hls-eng-data-public/dicom/ddsm/benigns/patient0007/"
table = "main.pixels_solacc.object_catalog"

@pytest.fixture
def spark() -> SparkSession:
    """
    Create a SparkSession (the entry point to Spark functionality) on
    the cluster in the remote Databricks workspace. Unit tests do not
    have access to this SparkSession by default.
    """
    return DatabricksSession.builder.getOrCreate()

def test_dicom_happy(spark):
    import datetime
    catalog = Catalog(spark, table=table)
    catalog_df = catalog.catalog(path=path)
    meta_df = DicomMetaExtractor(catalog).transform(catalog_df)
    assert meta_df.count() == 4
    assert len(meta_df.columns) == 9
    assert meta_df.explain(extended=True) == None
    row = meta_df.selectExpr("left(meta,500)").collect()[0]

Describe alternatives you've considered
The automated notebook execution jobs do cover these code paths at a higher level.

Additional context
See internal slack conversations

@dmoore247
Copy link
Collaborator Author

Fundamentally, db-connect needs to ship dependencies (as an archive) to the cluster with .addArtifact()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants