Skip to content
This repository has been archived by the owner on May 25, 2022. It is now read-only.

fix data loss between logrotate #140

Closed
wants to merge 3 commits into from

Conversation

rockb1017
Copy link
Contributor

@rockb1017 rockb1017 commented May 15, 2021

fixes #85

Existing unit tests, checks for no missing log, no duplicates, check point recovery, and etc.
But everything is captured in the Include parameter.

Parameter: {Include: temp/*}
Filenames:
temp/file1.log -> temp/file1.rot1.log
temp/file2.log -> temp/file2.rot1.log

But with kubernetes, we tail symlink files that has other important k8s metadata in the file name itself. When the underlying file is rotated, they fall outside of Include match. So in essence, it is like

Parameter: {Include: temp/*}
temp/file.log -> not_temp/file.log.1

It results in much data loss from k8s environment.

When file is opened with os.Open(), even when it is moved away, we can keep track of the file and continue to tail what was appended to it. It is only possible when we don't close it. So main thing I did is to keep them opened through polling cycle. Before this change, it closes them at the end of polling and re-open them at the beginning of cycle.

I want to point this out. There are 2 methods for file rotation. "Move/Create" and "Chunk/Truncate". It is only possible with "Move/Create" method.

@rockb1017 rockb1017 requested a review from a team May 15, 2021 05:10
@tigrannajaryan
Copy link
Member

Please also add a new test that verifies the bug fix. The test presumably should fail with the implementation before the fix and should pass after the fix.

@djaglowski
Copy link
Member

In an effort to gain confidence in fingerprint comparison logic, I added another test here in #145.

I have yet to see a case where fingerprint.StartsWith fails when it should pass, so I'm strongly suspecting that this is not the problem. If the problem is with this method, we should be able to reproduce it in a test.

I strongly suspect that the issue is not directly with fingerprint comparison.

Tomorrow I will get my head further into this problem.

@rockb1017
Copy link
Contributor Author

rockb1017 commented May 17, 2021

posting some details from what I see with failing tests.

 logger.go:130: 2021-05-14T20:44:03.582-0700	INFO	Started watching file	{"operator_id": "$.testfile", "operator_type": "file_input", "path": "/var/folders/8q/f5bglm2n0kn889xy1tc63n6c0000gn/T/382875241/file0.log", "byte": "ZmlsZSAwLTAsIG1lc3NhZ2UgMApmaWxlIDAtMCwgbWVzc2FnZSAxCmZpbGUgMC0wLCBtZXNzYWdlIDIKZmlsZSAwLTAsIG1lc3NhZ2UgMwpmaWxlIDAtMCwgbWVzc2FnZSA0CmZpbGUgMC0wLCBtZXNzYWdlIDUKZmlsZSAwLTAsIG1lc3NhZ2UgNgpmaWxlIDAtMCwgbWVzc2FnZSA3CmZpbGUgMC0wLCBtZXNzYWdlIDgKZmlsZSAwLTAsIG1lc3NhZ2UgOQpmaWxlIDAtMCwgbWVzc2FnZSAxMApmaWxlIDAtMCwgbWVzc2FnZSAxMQpmaWxlIDAtMCwgbWVzc2FnZSAxMgpmaWxlIDAtMCwgbWVzc2FnZSAxMwpmaWxlIDAtMCwgbWVzc2FnZSAxNApmaWxlIDAtMCwgbWVzc2FnZSAxNQpmaWxlIDAtMCwgbWVzc2FnZSAxNgo=", "text": "file 0-0, message 0\nfile 0-0, message 1\nfile 0-0, message 2\nfile 0-0, message 3\nfile 0-0, message 4\nfile 0-0, message 5\nfile 0-0, message 6\nfile 0-0, message 7\nfile 0-0, message 8\nfile 0-0, message 9\nfile 0-0, message 10\nfile 0-0, message 11\nfile 0-0, message 12\nfile 0-0, message 13\nfile 0-0, message 14\nfile 0-0, message 15\nfile 0-0, message 16\n"}
logger.go:130: 2021-05-14T20:44:03.882-0700	INFO	Started watching file	{"operator_id": "$.testfile", "operator_type": "file_input", "path": "/var/folders/8q/f5bglm2n0kn889xy1tc63n6c0000gn/T/382875241/file0.rot0.log", "byte": "ZmlsZSAwLTAsIG1lc3NhZ2UgMApmaWxlIDAtMCwgbWVzc2FnZSAxCmZpbGUgMC0wLCBtZXNzYWdlIDIKZmlsZSAwLTAsIG1lc3NhZ2UgMwpmaWxlIDAtMCwgbWVzc2FnZSA0CmZpbGUgMC0wLCBtZXNzYWdlIDUKZmlsZSAwLTAsIG1lc3NhZ2UgNgpmaWxlIDAtMCwgbWVzc2FnZSA3CmZpbGUgMC0wLCBtZXNzYWdlIDgKZmlsZSAwLTAsIG1lc3NhZ2UgOQpmaWxlIDAtMCwgbWVzc2FnZSAxMApmaWxlIDAtMCwgbWVzc2FnZSAxMQpmaWxlIDAtMCwgbWVzc2FnZSAxMgpmaWxlIDAtMCwgbWVzc2FnZSAxMwpmaWxlIDAtMCwgbWVzc2FnZSAxNApmaWxlIDAtMCwgbWVzc2FnZSAxNQpmaWxlIDAtMCwgbWVzc2FnZSAxNgpmaWxlIDAtMCwgbWVzc2FnZSAxNwpmaWxlIDAtMCwgbWVzc2FnZSAxOApmaWxlIDAtMCwgbWVzc2FnZSAxOQpmaWxlIDAtMCwgbWVzc2FnZSAyMApmaWxlIDAtMCwgbWVzc2FnZSAyMQpmaWxlIDAtMCwgbWVzc2FnZSAyMgpmaWxlIDAtMCwgbWVzc2FnZSAyMwpmaWxlIDAtMCwgbWVzc2FnZSAyNApmaWxlIDAtMCwgbWVzc2FnZSAyNQpmaWxlIDAtMCwgbWVzc2FnZSAyNgpmaWxlIDAtMCwgbWVzc2FnZSAyNwpmaWxlIDAtMCwgbWVzc2FnZSAyOApmaWxlIDAtMCwgbWVzc2FnZSAyOQo=", "text": "file 0-0, message 0\nfile 0-0, message 1\nfile 0-0, message 2\nfile 0-0, message 3\nfile 0-0, message 4\nfile 0-0, message 5\nfile 0-0, message 6\nfile 0-0, message 7\nfile 0-0, message 8\nfile 0-0, message 9\nfile 0-0, message 10\nfile 0-0, message 11\nfile 0-0, message 12\nfile 0-0, message 13\nfile 0-0, message 14\nfile 0-0, message 15\nfile 0-0, message 16\nfile 0-0, message 17\nfile 0-0, message 18\nfile 0-0, message 19\nfile 0-0, message 20\nfile 0-0, message 21\nfile 0-0, message 22\nfile 0-0, message 23\nfile 0-0, message 24\nfile 0-0, message 25\nfile 0-0, message 26\nfile 0-0, message 27\nfile 0-0, message 28\nfile 0-0, message 29\n"}

these two FirstBytes are the same up until the last 2 characters
image

@djaglowski
Copy link
Member

posting some details from what I see with failing tests

Can you clarify exactly which test is failing?

these two FirstBytes are the same up until the last 2 characters

The printed bytes appear to be encoded, possibly base64. If this is the case, the string representation may be misleading, as the difference (padding) may have been introduced in the encoding process. In any case, it would be helpful to have the exact code used to create this issue.

I agree with Tigran that we really need to have a specific test that demonstrates the problem.

@rockb1017
Copy link
Contributor Author

I added more logs whenever it tails a new file.
f.Infow("Started watching file", "path", path, "byte", reader.Fingerprint.FirstBytes, "text", string(reader.Fingerprint.FirstBytes))
this is from running this test.
go test -v -run TestMultiCopyTruncateSlow

@djaglowski
Copy link
Member

djaglowski commented May 18, 2021

These are some pretty significant code changes, and frankly I can't recommend specific improvements because I can't tell whether the original problem was solved or not.

If I understand correctly, you believe you have identified the root cause of the original issue. If so, I believe a reasonable path forward would be:

  1. Update the original issue with a detailed explanation of what is wrong with the current implementation.
  2. Start a new branch from main, which includes only one new test. This test should demonstrate the problem as consistently as possible. Open a draft PR to share your progress.
  3. Rebase this PR onto that branch.

This approach will allow us to establish whether the changes are solving the problem.

@rockb1017
Copy link
Contributor Author

These are some pretty significant code changes, and frankly I can't recommend specifical improvements because I can't tell whether the original problem was solved or not.

If I understand correctly, you believe you have identified the root cause of the original issue. If so, I believe a reasonable path forward would be:

  1. Update the original issue with a detailed explanation of what is wrong with the current implementation.
  2. Start a new branch from main, which includes only one new test. This test should demonstrate the problem as consistently as possible. Open a draft PR to share your progress.
  3. Rebase this PR onto that branch.

This approach will allow us to establish whether the changes are solving the problem.

Thank you for suggesting a great process. it makes sense. I have done exactly that.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really complex, and there are a lot of changes that have implications I still do not fully understand. I do not see the cause of test failures. However, I left some feedback which may help the design.

}
f.tailingReaders[path] = []*Reader{reader}
} else { // If path exists
f.Infow("Log rotation detected. Started watching file", "path", path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this necessarily mean the file has rotated? Wouldn't this happen any time we have seen the same file path before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in every poll cycle, we will be checking the same file we have been tailing. only if the fingerprint is different and the path is what we have been tailing, then it means log rotation.
Did i address your concern well here?

@@ -151,10 +151,11 @@ func (c InputConfig) Build(context operator.BuildContext) ([]operator.Operator,
firstCheck: true,
cancel: func() {},
knownFiles: make([]*Reader, 0, 10),
tailingReaders: map[string][]*Reader{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the slice of Readers represent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is all previously rotations of the file, then do we really need to keep them? When we detect a rotation, can we just read the rest of the old file and then close it? Do we expect that rotated files are still written to? (My understanding is that the issue we have is that a file is written to immediately before being rotated.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. once file is rotated, it won't be written to. I should add a workflow to stop tailing them yea?


count := min0(f.MaxConcurrentFiles, readerCount)
count = min0(count, len(f.readerQueue))
polledReaders := f.readerQueue[:count]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think max_concurrent_files is really being respected with this design.

The original problem that this setting was introduced for, is when there are so many matching files that we run out of file descriptors. By leaving all the files open, we are not limiting the number of file descriptors.

I'm not certain that it's possible to accomplish both objectives though (limiting file descriptors and tracking rotations), so perhaps we would need to set a reasonably high default for max_concurrent_files and then only keep that many open. (This would mean that if you are matching too many files, you may lose a small percent of logs upon rotation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously not an ideal situation, but I think users must be able to allocate a fixed number of file descriptors to the receiver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. so set it to reasonably high like like 5,000? we definitely need to have a workflow to close out already rotated files to be support this. but even then, in situation where there are 6000 actively written files and maximum is 5000, then it will only tail 5000 files and not tail the rest 1000 files at all.

@rockb1017
Copy link
Contributor Author

closing. it was fixed with #168

@rockb1017 rockb1017 closed this Jun 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

file_input - Data loss when tailing symlink file with log rotation (k8s container logs)
3 participants