Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan older files to be harvested based on mod time #4374

Merged
merged 11 commits into from
Jun 27, 2017

Conversation

codeperfector
Copy link
Contributor

@elasticmachine
Copy link
Collaborator

Can one of the admins verify this patch?

@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

@karmi
Copy link

karmi commented May 22, 2017

Hi @waynemz, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in yout Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@ruflin
Copy link
Contributor

ruflin commented May 23, 2017

@waynemz Thanks for sharing your implementation. This is definitively something we should discuss. Some thoughts:

  • In case of a large amount of files, this could have quite a performance impact as all files need to be touched to fetch the ModTime().
  • This feature should be off by default
  • If we introduce this feature, we should allow desc and ascending order as there were also dicussions where people were interested in the most recent files first.
  • In general this mainly becomes relevant if harvester_limit is used. Assuming harvester_limit is 10, it will still not guarantee which of the 10 oldest files is harvested at which speed.
  • Some ideas for the config naming if we allow both sort orders?

@codeperfector
Copy link
Contributor Author

@ruflin I agree the feature should be off by default given the additional performance impact. Instead of a boolean the config option could be a string with values off/ascending/descending.
I believe some documentation might be needed to explain the performance impact and the fact this would only be most useful in combination with the harvester_limit option, especially when harvester_limit=1
As for the config name I would suggest something like prospector_scan_order but I'm open to any suggestions you have.

@karmi I added my other email to my github account.

@ruflin
Copy link
Contributor

ruflin commented May 26, 2017

I like the suggestion of scan_order but would leave out the prospector part. It goes in line with scan_frequency. It's not really the scan that happens in this order but the starting of harvester, but I think that is more a technical thing. scan_order should be pretty simple to understand.

I have in the back of my head that if we add this file, pretty soon people will ask for it to be sorted by filename instead of date. Could be perhaps take this already into account now when we design the config. Or is it overkill?

@codeperfector
Copy link
Contributor Author

@ruflin Thats a great suggestion. I'll incorporate the filename as well. That will bring the possible values for scan_order to five: off(default), oldest_file_first, newest_file_first, ascending_by_name, descending_by_name
Can you suggest better value strings?

@codeperfector
Copy link
Contributor Author

codeperfector commented May 29, 2017

I'll correct the go imports style issue but I could not figure out what went wrong with the test:
[fail] 1.09% test_base.Test.test_invalid_config_cli_param: 0.0296s
I didn't introduce a new cli param here

BTW I'm using Gogland Early Access as my editor

@ruflin
Copy link
Contributor

ruflin commented May 29, 2017

For the config options I was more thinking of

scan.order: "desc", "asc"
scan.sort: "modtime", "filename", "none"

This kind of implies we should switch to scan.frequency one day ;-)

@ruflin
Copy link
Contributor

ruflin commented May 29, 2017

For the failing check: Run make fmt in the filebeat directory. This should do the cleanup for you.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward. I left a few comments.

Could you also please add a changelog entry? We also will have to add docs for this feature. Could you also add a very short doc entry to it? The config docs are in https://github.com/elastic/beats/blob/master/filebeat/docs/reference/configuration/filebeat-options.asciidoc

@@ -7,6 +7,9 @@ import (
"path/filepath"
"time"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove newline?

}

func getSortedFiles(scanOrder string, scanSort string, sortInfos []FileSortInfo) []FileSortInfo {
switch scanOrder + "_" + scanSort {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you perhaps to that in two switch / case? So first check the scanSort and then the order inside each? It feels a bit strange to create a string for comparison. Perhaps there is an even "shorter" option.

You could move the sort.Slice part to after switch and only define the function which will be used inside the switch cases. This will make the cod cleaner I think.

// Scan starts a scanGlob for each provided path/glob
func (p *Prospector) scan() {

for path, info := range p.getFiles() {
paths := p.getFiles()
files := getSortInfos(paths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen in every call, only if sorting is enabled as this could have a performance impact. I would expect something like:

if config.sort.Enabled() { paths := paths.Sort()

This is heavily simplified but you get what I'm trying to explain? Do we even need to expose the fileinfo outside of the sort?

@ruflin ruflin added the in progress Pull request is currently in progress. label Jun 1, 2017
@codeperfector
Copy link
Contributor Author

@ruflin I made the suggested changes. Can you resolve the conflict with changelog?

@codeperfector
Copy link
Contributor Author

I fixed the goimports style but there is some build failure with the full yml file. Not sure how to resolve that.

@tsg
Copy link
Contributor

tsg commented Jun 6, 2017

@waynemz You can fix that by running make update. Now there is also a conflict in the CHANGELOG (sorry, we've merged a large changelog modification) so you need to sync with master.

@codeperfector
Copy link
Contributor Author

@ruflin @tsg Fixed all conflicts. Please approve.

@tsg
Copy link
Contributor

tsg commented Jun 12, 2017

jenkins, test it

sortFunc = func(i, j int) bool {
return sortInfos[i].info.ModTime().After(sortInfos[j].info.ModTime())
}
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to return an explicit error on these default branches?

Copy link
Contributor Author

@codeperfector codeperfector Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsg what do you want to happen if the default case is reached? we can abort with an error message (panic), or log an error and use the default unsorted behavior.

paths := p.getFiles()
if strings.ToLower(p.config.ScanSort) != "none" {
sortInfos = getSortedFiles(strings.ToLower(p.config.ScanOrder),
strings.ToLower(p.config.ScanSort),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of calling ToLower here, as this creates some leniency in the config parsing which we don't have in other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waynemz Did you see this comment?

var path string
var info os.FileInfo

if strings.ToLower(p.config.ScanSort) != "none" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing ToLower here should also mean less work on the default case.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left 2 minor comments. Also check comments from @tsg


Specifies if files should be harvested in order and how to determine the order. Possible values are modtime, filename and none. To sort by file modification time use modtime otherwise use filename.

If you specify a value other than none for this setting you can determine whether to use ascending or descending order using `scan.order`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add . dots at the end of the sentences? Also check below.

@@ -132,6 +132,8 @@ https://github.com/elastic/beats/compare/v6.0.0-alpha1...v6.0.0-alpha2[View comm
- Add `logging.files` `permissions` option. {pull}4295[4295]

*Filebeat*
- Added ability to sort harvested files. {pull}4374[4374]
- Add experimental Redis slow log prospector type. {pull}4180[4180]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line probably sneaked in during rebasing?

@ruflin
Copy link
Contributor

ruflin commented Jun 13, 2017

@waynemz I really appreciate all the work you put into this. Thanks a lot for adding this feature.

@codeperfector
Copy link
Contributor Author

Thanks @ruflin I appreciate the approval.
@tsg please take a look

@codeperfector
Copy link
Contributor Author

Someone please take a look, there are no conflicts as of now.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I left a few additional comments. It's mainly about the error management we should improve.

return sortInfos
}

func getSortedFiles(scanOrder string, scanSort string, sortInfos []FileSortInfo) ([]FileSortInfo, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return argument should be error and not string. This is quite standard in Golang. Like this you can check

if err != nil {
 ...
}

}

func getSortedFiles(scanOrder string, scanSort string, sortInfos []FileSortInfo) ([]FileSortInfo, string) {
var sortFunc func(i, j int) bool = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set it to nil here

return sortInfos[i].info.ModTime().After(sortInfos[j].info.ModTime())
}
default:
return nil, "Unexpected value for scan.order: " + scanOrder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use fmt.Errorf("Unexpected value for scan.order: %v", scanOrder) here

return strings.Compare(sortInfos[i].info.Name(), sortInfos[j].info.Name()) > 0
}
default:
return nil, "Unexpected value for scan.order: " + scanOrder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

return nil, "Unexpected value for scan.order: " + scanOrder
}
default:
return nil, "Unexpected value for scan.sort: " + scanSort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

sort.Slice(sortInfos, sortFunc)
}

return sortInfos, ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return sortInfos, nil

var absolutePath string
absolutePath, err = filepath.Abs(path)
if err != nil {
logp.Err("could not fetch abs path for file %s: %s", absolutePath, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return here on error? Means this also can return an error?


paths := p.getFiles()

var err string = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var err error

if p.config.ScanSort != "none" {
sortInfos, err = getSortedFiles(strings.ToLower(p.config.ScanOrder),
strings.ToLower(p.config.ScanSort),
getSortInfos(paths))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the error directly inside the if clause. In case of error should we return? what are we doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we are going to print an error saying why sorting failed and continue the regular unsorted route.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

// Scan starts a scanGlob for each provided path/glob
func (p *Prospector) scan() {

for path, info := range p.getFiles() {
var sortInfos []FileSortInfo = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to set it to nil, also follow up below

@codeperfector
Copy link
Contributor Author

@ruflin @tsg added error handling

@codeperfector
Copy link
Contributor Author

@ruflin @tsg fixed it.

@ruflin
Copy link
Contributor

ruflin commented Jun 26, 2017

@waynemz Thanks for your work on this. LGTM.

Two things that came to my mind which can be done a follow up PR:

  • We should use Validate on the config to validate only valid scan options are set
  • We should mark the feature in this Validate part experimental at first so we get some feedback early on and can still make heavy changes on it if needed.

@waynemz Do you want to add this to this PR or better as a follow up?

@ruflin ruflin merged commit b0edcbe into elastic:master Jun 27, 2017
@ruflin
Copy link
Contributor

ruflin commented Jun 27, 2017

@waynemz Merged. Thanks a lot for the work on this one. I will open a follow up PR with the above suggestions.

@ruflin
Copy link
Contributor

ruflin commented Jun 27, 2017

@waynemz Here is the follow up PR. I made a few minor changes. It would be great to also add some system tests for this feature to confirm it works as expected: #4564

@codeperfector
Copy link
Contributor Author

Thanks @ruflin

ruflin added a commit to ruflin/beats that referenced this pull request Jun 28, 2017
andrewkroh pushed a commit that referenced this pull request Jul 12, 2017
codeperfector added a commit to codeperfector/beats that referenced this pull request Sep 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat in progress Pull request is currently in progress. review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants