Searching a stream

When searching for something in a stream there's a little gotcha when it comes to chunk boundaries. Say we were searching for "service workers", because things arrive in chunks, one chunk could be "this next section is about se" and the next one "rvice workers". Neither chunk contains "service workers", so a naive check of each chunk isn't good enough.

To work around this, a buffer is kept. The buffer needs to be at least the length of the search term - 1 to avoid missing matches across boundaries.

Search & replace in a stream

In addition to above, there's another gotcha:

Say we were replacing "lol" with "goal" in "lolol". We maintain a buffer of two chars because that's "lol".length - 1, meaning we don't miss matches between boundaries. It could go like this:

Buffer is ""
Chunk arrives "lolol", add to buffer
Buffer is "lolol"
Replace "lol" with "goal" in buffer
Buffer is "goalol"
Send buffer up to position "lol".length - 1 - "goal"
Set buffer to the remainder of the previous step - "ol"
Incoming stream ends
Send remaining buffer "ol"

This sends "goalol", which is correct. But what if:

Buffer is ""
Chunk arrives "lol", add to buffer
Buffer is "lol"
Replace "lol" with "goal" in buffer
Buffer is "goal"
Send buffer up to position "lol".length - 1 - "go"
Set buffer to the remainder of the previous step - "al"
Chunk arrives "ol", add to buffer
Buffer is "alol"
Replace "lol" with "goal" in buffer
Buffer is "agoal"
Send buffer up to position "lol".length - 1 - "ago"
Set buffer to the remainder of the previous step - "al"
Incoming stream ends
Send remaining buffer "al"

This sends "goagoal", which is wrong. To fix this, the buffer should be flushed until the position of the last replacement, or up to position buffer.length - ("lol".length - 1), whichever's greater. So:

Buffer is ""
Chunk arrives "lol", add to buffer
Buffer is "lol"
Replace "lol" with "goal" in buffer
Buffer is "goal"
Send buffer until the end of the last replacement, or buffer.length - ("lol" - 1), whichever's greater - "goal"
Set buffer to the remainder of the previous step - ""
Chunk arrives "ol", add to buffer
Buffer is "ol"
Replace "lol" with "goal" in buffer
Buffer is "ol"
Send buffer until the end of the last replacement, or buffer.length - ("lol" - 1), whichever's greater - ""
Set buffer to the last "lol".length - 1 chars of buffer - "ol"
Incoming stream ends
Send remaining buffer "ol"

Phew! Because of this, doing a regex replacement in a stream is tricky, as regex can match varying lengths, eg /clo+ud/, making choosing a buffer size tricky.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

edge-cases.md

edge-cases.md

Searching a stream

Search & replace in a stream

Files

edge-cases.md

Latest commit

History

edge-cases.md

File metadata and controls

Searching a stream

Search & replace in a stream