When searching for something in a stream there's a little gotcha when it comes to chunk boundaries. Say we were searching for "service workers", because things arrive in chunks, one chunk could be "this next section is about se" and the next one "rvice workers". Neither chunk contains "service workers", so a naive check of each chunk isn't good enough.
To work around this, a buffer is kept. The buffer needs to be at least the length of the search term - 1 to avoid missing matches across boundaries.
In addition to above, there's another gotcha:
Say we were replacing "lol" with "goal" in "lolol". We maintain a buffer of two chars because that's "lol".length - 1
, meaning we don't miss matches between boundaries. It could go like this:
- Buffer is
""
- Chunk arrives
"lolol"
, add to buffer - Buffer is
"lolol"
- Replace "lol" with "goal" in buffer
- Buffer is
"goalol"
- Send buffer up to position
"lol".length - 1
-"goal"
- Set buffer to the remainder of the previous step -
"ol"
- Incoming stream ends
- Send remaining buffer
"ol"
This sends "goalol"
, which is correct. But what if:
- Buffer is
""
- Chunk arrives
"lol"
, add to buffer - Buffer is
"lol"
- Replace "lol" with "goal" in buffer
- Buffer is
"goal"
- Send buffer up to position
"lol".length - 1
-"go"
- Set buffer to the remainder of the previous step -
"al"
- Chunk arrives
"ol"
, add to buffer - Buffer is
"alol"
- Replace "lol" with "goal" in buffer
- Buffer is
"agoal"
- Send buffer up to position
"lol".length - 1
-"ago"
- Set buffer to the remainder of the previous step -
"al"
- Incoming stream ends
- Send remaining buffer
"al"
This sends "goagoal"
, which is wrong. To fix this, the buffer should be flushed until the position of the last replacement, or up to position buffer.length - ("lol".length - 1)
, whichever's greater. So:
- Buffer is
""
- Chunk arrives
"lol"
, add to buffer - Buffer is
"lol"
- Replace "lol" with "goal" in buffer
- Buffer is
"goal"
- Send buffer until the end of the last replacement, or
buffer.length - ("lol" - 1)
, whichever's greater -"goal"
- Set buffer to the remainder of the previous step -
""
- Chunk arrives
"ol"
, add to buffer - Buffer is
"ol"
- Replace "lol" with "goal" in buffer
- Buffer is
"ol"
- Send buffer until the end of the last replacement, or
buffer.length - ("lol" - 1)
, whichever's greater -""
- Set buffer to the last
"lol".length - 1
chars of buffer -"ol"
- Incoming stream ends
- Send remaining buffer
"ol"
Phew! Because of this, doing a regex replacement in a stream is tricky, as regex can match varying lengths, eg /clo+ud/
, making choosing a buffer size tricky.