feat: Output record count metric from batch files insert #267

ReubenFrankel · 2024-10-03T09:46:34Z

Bit of an opinionated change since it might be conflating batches/records, but it has been helpful for us to know how many rows (records) were created/updated from batch processing.

edgarrmondragon · 2024-10-03T14:24:29Z

target_snowflake/sinks.py

                    full_table_name=full_table_name,
                    schema=self.schema,
                    sync_id=sync_id,
                    file_format=file_format,
                )

+            with self.record_counter_metric as counter:
+                counter.increment(record_count)


Wouldn't this add records to the count twice? Once when the record is processed in https://github.com/meltano/sdk/blob/409a40b48c442e1382611d3a69b2f95df2e073d3/singer_sdk/target_base.py#L362 and again when they're batched in

target-snowflake/target_snowflake/sinks.py

Lines 151 to 154 in 8919a66

self.insert_batch_files_via_internal_stage(

full_table_name=full_table_name,

files=files,

)

?

Hmm, I guess so. The intention of this change was to work with BATCH messages, but you're referring to when a user supplies batch_config on the target side specifically to batch together RECORD data into files for insert via internal stage?

Just to check my understanding, if batch_config is supplied on the tap side, it emits a BATCH message and then doesn't hit this issue?

Maybe the full context of the function I linked is useful:

target-snowflake/target_snowflake/sinks.py

Lines 109 to 156 in 7c9a1fb

def bulk_insert_records(

self,

full_table_name: str,

schema: dict,

records: t.Iterable[dict[str, t.Any]],

) -> int | None:

"""Bulk insert records to an existing destination table.

The default implementation uses a generic SQLAlchemy bulk insert operation.

This method may optionally be overridden by developers in order to provide

faster, native bulk uploads.

Args:

full_table_name: the target table name.

schema: the JSON schema for the new table, to be used when inferring column

names.

records: the input records.

Returns:

True if table exists, False if not, None if unsure or undetectable.

"""

# prepare records for serialization

processed_records = (

conform_record_data_types(

stream_name=self.stream_name,

record=rcd,

schema=schema,

level="RECURSIVE",

logger=self.logger,

)

for rcd in records

)

# serialize to batch files and upload

# TODO: support other batchers

batcher = JSONLinesBatcher(

tap_name=self.target.name,

stream_name=self.stream_name,

batch_config=self.batch_config,

)

batches = batcher.get_batches(records=processed_records)

for files in batches:

self.insert_batch_files_via_internal_stage(

full_table_name=full_table_name,

files=files,

)

# if records list, we can quickly return record count.

return len(records) if isinstance(records, list) else None

RECORD messages are processed by this logic, even with the default batch_config.

I see - I wasn't aware there was a default batch_config here. So processing of RECORD and BATCH messages both end up calling insert_batch_files_via_internal_stage where I've made this change, except RECORD messages are already counted by the SDK before they are batched so would now get counted twice.

In spite of this issue, in principle how do you feel about this idea of emitting a record count when this target receives a BATCH message? Is it OK or does it conflate two concepts?

`RECORD` messages are already counted by the SDK goes through, and processing of them here goes through `insert_batch_files_via_internal_stage` by default

edgarrmondragon · 2024-10-04T02:08:24Z

target_snowflake/sinks.py

+        with self.record_counter_metric as counter:
+            counter.increment(record_count)


Yeah, this should only count records inserted from batch files!

edgarrmondragon · 2024-10-04T02:08:39Z

Thanks @ReubenFrankel!

Output record count metric from batch files insert

359db5c

edgarrmondragon reviewed Oct 3, 2024

View reviewed changes

ReubenFrankel force-pushed the feature/batch-record-count-metrics branch from 0dddac6 to 684d362 Compare October 3, 2024 15:59

Emit record count for BATCH messages only

fb4cc82

`RECORD` messages are already counted by the SDK goes through, and processing of them here goes through `insert_batch_files_via_internal_stage` by default

ReubenFrankel force-pushed the feature/batch-record-count-metrics branch from 684d362 to fb4cc82 Compare October 3, 2024 16:01

Merge branch 'main' into feature/batch-record-count-metrics

8885f8f

edgarrmondragon reviewed Oct 4, 2024

View reviewed changes

edgarrmondragon approved these changes Oct 4, 2024

View reviewed changes

edgarrmondragon merged commit 46df975 into MeltanoLabs:main Oct 4, 2024
2 of 9 checks passed

edgarrmondragon mentioned this pull request Oct 4, 2024

feat: Emit record count from the built-in batch file processor in targets meltano/sdk#2702

Closed

ReubenFrankel deleted the feature/batch-record-count-metrics branch October 4, 2024 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Output record count metric from batch files insert #267

feat: Output record count metric from batch files insert #267

ReubenFrankel commented Oct 3, 2024

edgarrmondragon Oct 3, 2024

ReubenFrankel Oct 3, 2024 •

edited

Loading

edgarrmondragon Oct 3, 2024

ReubenFrankel Oct 3, 2024 •

edited

Loading

edgarrmondragon Oct 4, 2024

edgarrmondragon commented Oct 4, 2024

	self.insert_batch_files_via_internal_stage(
	full_table_name=full_table_name,
	files=files,
	)

	def bulk_insert_records(
	self,
	full_table_name: str,
	schema: dict,
	records: t.Iterable[dict[str, t.Any]],
	) -> int \| None:
	"""Bulk insert records to an existing destination table.

	The default implementation uses a generic SQLAlchemy bulk insert operation.
	This method may optionally be overridden by developers in order to provide
	faster, native bulk uploads.

	Args:
	full_table_name: the target table name.
	schema: the JSON schema for the new table, to be used when inferring column
	names.
	records: the input records.

	Returns:
	True if table exists, False if not, None if unsure or undetectable.
	"""
	# prepare records for serialization
	processed_records = (
	conform_record_data_types(
	stream_name=self.stream_name,
	record=rcd,
	schema=schema,
	level="RECURSIVE",
	logger=self.logger,
	)
	for rcd in records
	)

	# serialize to batch files and upload
	# TODO: support other batchers
	batcher = JSONLinesBatcher(
	tap_name=self.target.name,
	stream_name=self.stream_name,
	batch_config=self.batch_config,
	)
	batches = batcher.get_batches(records=processed_records)
	for files in batches:
	self.insert_batch_files_via_internal_stage(
	full_table_name=full_table_name,
	files=files,
	)
	# if records list, we can quickly return record count.
	return len(records) if isinstance(records, list) else None

		with self.record_counter_metric as counter:
		counter.increment(record_count)

feat: Output record count metric from batch files insert #267

feat: Output record count metric from batch files insert #267

Conversation

ReubenFrankel commented Oct 3, 2024

edgarrmondragon Oct 3, 2024

Choose a reason for hiding this comment

ReubenFrankel Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

edgarrmondragon Oct 3, 2024

Choose a reason for hiding this comment

ReubenFrankel Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

edgarrmondragon Oct 4, 2024

Choose a reason for hiding this comment

edgarrmondragon commented Oct 4, 2024

ReubenFrankel Oct 3, 2024 •

edited

Loading

ReubenFrankel Oct 3, 2024 •

edited

Loading