Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify the Spark Connector captures rejected/exception rows while saving the DataFrame to Vertica #434

Closed
alexey-temnikov opened this issue Jun 20, 2022 · 1 comment
Assignees

Comments

@alexey-temnikov
Copy link
Collaborator

alexey-temnikov commented Jun 20, 2022

Summarize the behaviour of saving or reporting of rejected/exception rows.

For reference, see the following tickets:

@jeremyprime jeremyprime changed the title Verify - Spark Connector captures rejected/exception rows while saving the dataframe to vertica. Verify the Spark Connector captures rejected/exception rows while saving the DataFrame to Vertica Jun 21, 2022
@jeremyprime jeremyprime self-assigned this Jun 21, 2022
@jeremyprime
Copy link
Collaborator

The following is a summary of how rejected rows are currently handled in the Spark Connector:

Each time an operation is run the status is saved in the job status table (S2V_JOB_STATUS_USER_${USER}) if save_job_status_table=true (defaults to false). This job status table contains metadata about the operation, including if the operation was successful and the percentage of failed rows. The user can also specify the error tolerance by setting failed_rows_percent_tolerance (defaults to 0.10, or 10%).

Currently the rejected rows themselves are not persisted to a table (see #293). However, a summary of the rejected rows is printed to the logs. Up to 10 of the most common errors are printed, showing the number or rejected rows, an example, and the rejected reason. For example:

2022-06-21 17:45:32 ERROR VerticaDistributedFilesystemWritePipe:393 - Found 3 rejected rows, displaying up to 10 of the most common reasons:
2022-06-21 17:45:32 ERROR VerticaDistributedFilesystemWritePipe:394 - count | example_data | rejected_reason
2022-06-21 17:45:32 ERROR VerticaDistributedFilesystemWritePipe:396 - 3 | NULL | In column 1: Cannot set NULL value in NOT NULL column

Note that in some cases the write to Vertica will fail before the rows can be evaluated and there will be no rejected row information printed in that case. For example, if there is a schema mismatch between the source data and the Vertica table.

In order to get rejected row information there must be an error during processing. For example, a non-null constraint on a column in Vertica but no such constraint in the source data (and the existence of null values in the source, violating the Vertica constraint).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants