Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Implement msgspec encoding #2541

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

edgarrmondragon
Copy link
Collaborator

@edgarrmondragon edgarrmondragon commented Jul 17, 2024

  • Docs
  • UX. Relying on mro magic is not great, e.g. why SQLiteTap(MsgSpecWriter, SQLTap) and not SQLiteTap(SQLTap, MsgSpecWriter). A better approach might be to make the IO implementation an attribute of the Singer class.
  • Fix tests

📚 Documentation preview 📚: https://meltano-sdk--2541.org.readthedocs.build/en/2541/

Copy link

codspeed-hq bot commented Jul 17, 2024

CodSpeed Performance Report

Merging #2541 will improve performances by ×12

Comparing edgarrmondragon/refactor/msgspec-impl-naive (815f056) with main (a8e6bf0)

Summary

⚡ 2 improvements
✅ 5 untouched benchmarks

Benchmarks breakdown

Benchmark main edgarrmondragon/refactor/msgspec-impl-naive Change
test_bench_deserialize_json 23.8 ms 5.4 ms ×4.4
test_bench_format_message 52.6 ms 4.4 ms ×12

Copy link

codecov bot commented Jul 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.35%. Comparing base (a8e6bf0) to head (815f056).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2541      +/-   ##
==========================================
+ Coverage   91.29%   91.35%   +0.06%     
==========================================
  Files          62       63       +1     
  Lines        5190     5227      +37     
  Branches      669      669              
==========================================
+ Hits         4738     4775      +37     
  Misses        319      319              
  Partials      133      133              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 97de9f7 to 1d9e947 Compare July 17, 2024 02:00
@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 1d9e947 to 4febeba Compare July 17, 2024 02:59
@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 4febeba to cbe10bd Compare July 17, 2024 03:00
@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 8876b80 to f691f78 Compare July 25, 2024 15:45
@edgarrmondragon edgarrmondragon added this to the 0.41.0 milestone Aug 14, 2024
@BuzzCutNorman
Copy link
Contributor

I found that it helped to add a defualt_output to the for the SingerWriter to use. This allows you to make the write message a little generic.

default_output = sys.stdout.buffer

def write_message(self, message: Message) -> None:
	"""Write a message to stdout.

	Args:
		message: The message to write.
	"""
	self.default_output.write(self.format_message(message))
	self.default_output.flush()

@BuzzCutNorman
Copy link
Contributor

BuzzCutNorman commented Aug 16, 2024

In the json.py file I found to match the msgspec performance suggestions and fit into the framework you put in place I created a function for generating jsonl so I could keep the functionality of seralize_json to return strings. This way the serialize_json can be used in the connector engine creation and also in process_batch_files.

https://jcristharif.com/msgspec/perf-tips.html#line-delimited-json

json.py:

def serialize_json(obj: object, **kwargs: t.Any) -> str:
    """Serialize a dictionary into a line of json.

    Args:
        obj: A Python object usually a dict.
        **kwargs: Optional key word arguments.

    Returns:
        A string of serialized json.
    """
    return encoder.encode(obj).decode()

msg_buffer = bytearray(64)

def serialize_jsonl(obj: object, **kwargs: t.Any) -> bytes:
        """Serialize a dictionary into a line of jsonl.

        Args:
            obj: A Python object usually a dict.
            **kwargs: Optional key word arguments.

        Returns:
            A bytes of serialized json.
        """
        encoder.encode_into(obj, msg_buffer)
        msg_buffer.extend(b"\n")
        return msg_buffer

SingerWriter:

    def serialize_message(self, message: Message) -> str | bytes:
        """Serialize a dictionary into a line of json.

        Args:
            message: A Singer message object.

        Returns:
            A string of serialized json.
        """
        return serialize_jsonl(message.to_dict())

@edgarrmondragon edgarrmondragon added the Release Highlight Call this out in the release notes label Aug 22, 2024
@edgarrmondragon
Copy link
Collaborator Author

edgarrmondragon commented Sep 6, 2024

I found that it helped to add a defualt_output to the for the SingerWriter to use. This allows you to make the write message a little generic.

default_output = sys.stdout.buffer

def write_message(self, message: Message) -> None:
	"""Write a message to stdout.

	Args:
		message: The message to write.
	"""
	self.default_output.write(self.format_message(message))
	self.default_output.flush()

Do you mean in

class MsgSpecWriter(GenericSingerWriter[bytes, Message]):
"""Interface for all plugins writing Singer messages to stdout."""
def serialize_message(self, message: Message) -> bytes: # noqa: PLR6301
"""Serialize a dictionary into a line of json.
Args:
message: A Singer message object.
Returns:
A string of serialized json.
"""
return encoder.encode(message.to_dict())
def write_message(self, message: Message) -> None:
"""Write a message to stdout.
Args:
message: The message to write.
"""
sys.stdout.buffer.write(self.format_message(message) + b"\n")
sys.stdout.flush()

?

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 29dea7a to d23a8ab Compare September 6, 2024 18:56
@BuzzCutNorman
Copy link
Contributor

BuzzCutNorman commented Sep 6, 2024

Yes, that is exactly what I meant. Could have definitely been stated clearer on my part😅.

 class MsgSpecWriter(GenericSingerWriter[bytes, Message]): 
     """Interface for all plugins writing Singer messages to stdout.""" 
     
     default_output = sys.stdout.buffer
     
     def serialize_message(self, message: Message) -> bytes:  # noqa: PLR6301 
         """Serialize a dictionary into a line of json. 
  
         Args: 
             message: A Singer message object. 
  
         Returns: 
             A string of serialized json. 
         """ 
         return serialize_jsonl(message.to_dict()) 
  
     def write_message(self, message: Message) -> None: 
         """Write a message to stdout. 
  
         Args: 
             message: The message to write. 
         """ 
 	self.default_output.write(self.format_message(message))
	self.default_output.flush()

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from d23a8ab to 3169b58 Compare September 6, 2024 19:15
@edgarrmondragon edgarrmondragon changed the title refactor: Implement (naive) msgspec encoding refactor: Implement msgspec encoding Sep 6, 2024
@edgarrmondragon
Copy link
Collaborator Author

Naive of me to think I could get this across in 1/2 a day of work 😅. I'll come back to this later, there's plenty of time until the planned release date.

@BuzzCutNorman
Copy link
Contributor

Like the pun 😊. Great dad joke material. Kind an inside joke now since you dropped (naive) from the title of the PR.

@edgarrmondragon
Copy link
Collaborator Author

Might have to punt this if jcrist/msgspec#711 doesn't get merged before we're ready to officially declare Python 3.13 support

@edgarrmondragon edgarrmondragon modified the milestones: v0.41.0, v0.42.0 Oct 2, 2024
@edgarrmondragon edgarrmondragon modified the milestones: v0.42.0, v0.43.0 Oct 9, 2024
@edgarrmondragon edgarrmondragon modified the milestones: v0.43, v0.44 Nov 19, 2024
@BuzzCutNorman
Copy link
Contributor

I found that it helped to add a defualt_output to the for the SingerWriter to use. This allows you to make the write message a little generic.

 class MsgSpecWriter(GenericSingerWriter[bytes, Message]): 
     """Interface for all plugins writing Singer messages to stdout.""" 
     
     default_output = sys.stdout.buffer
     
     def serialize_message(self, message: Message) -> bytes:  # noqa: PLR6301 
         """Serialize a dictionary into a line of json. 
  
         Args: 
             message: A Singer message object. 
  
         Returns: 
             A string of serialized json. 
         """ 
         return serialize_jsonl(message.to_dict()) 
  
     def write_message(self, message: Message) -> None: 
         """Write a message to stdout. 
  
         Args: 
             message: The message to write. 
         """ 
 	self.default_output.write(self.format_message(message))
	self.default_output.flush()

Well, found that making this generic by using default_output was a bad move. It is best to stick with sys.stdout or sys.stdout.buffer. I couldn't get the built-in SDK Tap tests to work until I changed back to using sys.stdout.buffer in a tap utilizing msgspec. I borrowed your use of io.TextIOWrapper(buffer=io.BytesIO()) in the test_msgspec.py file that is part of this PR and added it to singer_sdk.testing.runners.TapTestRunner._execute_sync so the redirect_stdout doesn't error because of stdout_buf = io.StringIO() lack of a .buffer to write to.

    def _execute_sync(self) -> tuple[str, str]:
        """Invoke a Tap object and return STDOUT and STDERR results in StringIO buffers.

        Returns:
            A 2-item tuple with StringIO buffers from the Tap's output: (stdout, stderr)
        """
        stdout_buf = io.TextIOWrapper(buffer=io.BytesIO())
        stderr_buf = io.TextIOWrapper(buffer=io.BytesIO())
        with redirect_stdout(stdout_buf), redirect_stderr(stderr_buf):
            self.run_sync_dry_run()
        stdout_buf.seek(0)
        stderr_buf.seek(0)
        return stdout_buf.read(), stderr_buf.read()

In summary I had to change the tap's overwrite of write_message to utilize sys.stdout.buffer.write() and sys.stdout.flush(). Then changed singer_sdk.testing.runners.TapTestRunner._execute_sync to use io.TextIOWrapper(buffer=io.BytesIO()) instead of io.StringIO() for stdout_buff and stderr_buff. The Singer-SDK built in tap tests ran correctly after the change to the tap and installing a local development branch of the singer-sdk with the change to TapTestRunner._execute_sync in the virtual environment I ran the tap test from.

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 712466f to 07f348e Compare December 28, 2024 19:24
@edgarrmondragon
Copy link
Collaborator Author

Ok, the tests are passing.

Now I want to think of how to make it easy and straightforward for a developer to use msgspec as the SerDe layer, and also keep the door open to the user being the one deciding which serialization layer to use.

@edgarrmondragon edgarrmondragon self-assigned this Jan 13, 2025
@edgarrmondragon edgarrmondragon modified the milestones: v0.44, v0.45 Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants