-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tdl 14376 pagination failure #50
Tdl 14376 pagination failure #50
Conversation
|
||
if stream == "sadsheet-pagination": | ||
# verify the data for the "sadsheet-pagination" stream is free of any duplicates or breaks by checking | ||
# our fake pk value ('id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a detailed comment about the data present in this sheet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -514,7 +514,7 @@ def sync(client, config, catalog, state): | |||
from_row=from_row, | |||
columns=columns, | |||
sheet_data_rows=sheet_data_rows) | |||
if row_num < to_row: | |||
if not sheet_data_rows: # If a whole blank page found, then stop looping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prijendev Can you please explain what the earlier behavior row_num < to_row:
meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, to_row
is initialized with a minimum of 200(max page size) or max_row
. Then, it continues to add 200 until max_row
. Initially, from_row
is assigned by 2, and from the next page, it is assigned by to_row
+1.(201 in second page). row_num
is the addition of from_row
and total records get in response. The above condition checks that if row_num is less than to_row or not based on which it set is_last_row true. But API does not return the last empty rows in response.
For example, rows 199 and 200 are empty, and a total 400 rows are there in the sheet. So, in 1st iteration
to_row
= 200
from_row
= 2
row_num
= 2 + 197 = 199(1st row contain header value)
So, the above condition becomes true and breaks the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add comments in the code as to why if row_num < to_row is replaced with sheet_data_rows.
- If for example, if a sheet contains only 200 rows and let us say 99 and 100 are empty, then it will continue to process the remaining rows because of this condition being removed and by adding sheet_data_rows empty condition it will check whether the whole sheet is empty. Is that correct?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are striving to writeup defects found by QA in a way that makes reproducing issues as simple as uncommenting/commenting lines marked with BUG. Unless there is an explicit TODO left in the test, or a bullet in the DoD of the card there shouldn't be any missing test cases. I believe that is the case here and the test additions can be removed. If you find this is not the case please raise this to us so we can adjust our bug reporting process.
@@ -61,6 +61,43 @@ def test_run(self): | |||
# verify that we can paginate with all fields selected | |||
self.assertGreater(record_count_by_stream.get(stream, 0), self.API_LIMIT) | |||
|
|||
record_count_sync = record_count_by_stream.get(stream, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not have needed to be any test changes here besides adding the failing sheet back to the test. I don't think these test additions are adding any test coverage. The sheets have been setup in a way that relies on a specific column with incrementing values used to compare against the sdc column. See fake_pk_list above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed extra assertion. The current test is written in such a way that it was respecting only Pagination
stream.
tap-google-sheets/tests/test_google_sheets_pagination.py
Lines 80 to 87 in 25136fc
# verify the data for the "Pagination" stream is free of any duplicates or breaks by checking | |
# our fake pk value ('id') | |
# THIS ASSERTION CAN BE MADE BECAUSE WE SETUP DATA IN A SPECIFIC WAY. DONT COPY THIS | |
self.assertEqual(list(range(1, 239)), fake_pk_list) | |
# verify the data for the "Pagination" stream is free of any duplicates or breaks by checking | |
# the actual primary key values (__sdc_row) | |
self.assertEqual(list(range(2, 240)), actual_pk_list) |
If we add back
sadsheet-pagination
then we have to write assertion according to this sheet also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment is misleading, the test is setup to apply to both sheets
testable_streams = {"Pagination", "sadsheet-pagination"} |
Additionally there is a large comment at the top of this test that should be removed
tap-google-sheets/tests/test_google_sheets_pagination.py
Lines 9 to 13 in 25136fc
# BUG_TDL-14376 | https://jira.talendforge.org/browse/TDL-14376 | |
# Expectation: Tap will pick up next page (200 rows) iff there is a non-null value on that page | |
# We observed a BUG where the tap does not paginate properly on sheets where the last two rows in a batch | |
# are empty values. The tap does not capture anything on the subsequent pages when this happens. | |
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed large comment at the top of this test. In the sadsheet-pagination
sheet, there is a total of 238 rows and in that sheet row, no 199 and 200 are empty rows whereas in the Pagination
sheet there are total of 239 rows with no empty row.
So, as two rows are empty in sadsheet-pagination
, we need to write the separate assertion in which we are excluding rows no 199 and 200.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to comments
* type of emailaddress corrected (#53) * Tdl 16079 check best practices (#51) * Initial commit for best practice update * updated setup.py and start_date test * Updated test cases * Updated start_date test case * Updated start_date test case * Updated comment * Revert back test case changes * Added new line * Tdl 14376 pagination failure (#50) * Initial commit for pagination failer * Fixed pagination test cases * Added comments * Added detail comment into the code * Removed unnecessary comment * Removed unnecessary assertion * Removed extra comment * added comment for bug (#49) * TDL-14475 added unsupported feature and unittests (#47) * added unsupported feature and unittests * added code comments * fixed indent * fixed indentation * resolved a bug of writing md when 2 consecutive empty headers * updated the logic for consecutive empty headers * rsolved comments * added test case for consecutive empty headers * added comments * resolved circleci errors * resolved comments Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasystems.com> Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com> * TDL-14397-Add skipped log when first row is empty (#46) * added logger message and unittests * added code comments * changed the logger message and logic * resolved comments Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasystems.com> Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com> * TDL-16054 added code comments (#52) * TDL-16054 added code comments * rsolved comments Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasystems.com> Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com> * Tdl 16280 implement request timeout (#54) * TDL-16280 added request timeout * TDL-16280: Added factor 3 to add more wait time between 2 calls * TDL-16280: Updated Connection error as it wasn't defined. * added backoff for access token * updated readme * updated request timeout and added jitter * added comment for jitter * added code coverage * added testcase for connection error * addd request timeout in config example * updated the json example * removed the client initialization outside with Co-authored-by: dbshah1212 <dhruvin.shah@crestdatasys.com> Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com> Co-authored-by: namrata270998 <75604662+namrata270998@users.noreply.github.com> Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasystems.com> Co-authored-by: dbshah1212 <dhruvin.shah@crestdatasys.com>
Description of change
Note : Unittest case was not possible for this PR. We have replaced just one condition and method which having this condition is too large and using lot of other methods. There is sync method with sync.py module. That's why to mock the other methods was not possible due to same name of method and module. So, skipped it.
Manual QA steps
Risks
Rollback steps