Skip to content

Commit

Permalink
perf: Cache CSV stream schema (#363)
Browse files Browse the repository at this point in the history
The stream's `schema` property is accessed multiple times for each
record (see `Stream._generate_record_messages()` for instance). Since
the schema should be static this change caches it, resulting in a
significant performance improvement.

Testing with a sample 2,000,000 row dataset (`people-2000000` from
https://github.com/datablist/sample-csv-files) reduced the read time
from 441 seconds to 48 seconds; about a 10x improvement in throughput.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
atl-ggregson and pre-commit-ci[bot] authored Jan 16, 2025
1 parent dfd07ea commit 142afd1
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 3 deletions.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "tap-csv"
version = "1.1.0"
version = "1.2.0"
description = "Singer tap for CSV, built with the Meltano SDK for Singer Taps."
authors = ["Pat Nadolny"]
keywords = [
Expand Down
7 changes: 5 additions & 2 deletions tap_csv/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os
import typing as t
from datetime import datetime, timezone
from functools import cached_property

from singer_sdk import typing as th
from singer_sdk.streams import Stream
Expand Down Expand Up @@ -121,12 +122,14 @@ def get_rows(self, file_path: str) -> t.Iterable[list]:
with open(file_path, encoding=encoding) as f:
yield from csv.reader(f, dialect="tap_dialect")

@property
@cached_property
def schema(self) -> dict:
"""Return dictionary of record schema.
Dynamically detect the json schema for the stream.
This is evaluated prior to any records being retrieved.
This property is accessed multiple times for each record
so it's important to cache the result.
"""
properties: list[th.Property] = []
self.primary_keys = self.file_config.get("keys", [])
Expand Down

0 comments on commit 142afd1

Please sign in to comment.