Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Enable non-AWS S3 buckets for aws-s3 input #28234

Merged
merged 9 commits into from
Oct 26, 2021

Conversation

legoguy1000
Copy link
Contributor

@legoguy1000 legoguy1000 commented Oct 4, 2021

What does this PR do?

This PR adds the ability to use non-AWS S3 bucket services with the aws-s3 input. With the new polling feature the input can now poll other S3 providers like Minio that use both virtual host style and path style buckets as well as the ability to use non SSL enabled buckets (if wanted). This also updates the log.file.path field to use the actual URL used to download the file instead of manually generating it with hard coded values like amazonaws.com.

Why is it important?

To be able to ingest data from other S3 providers besides AWS.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 4, 2021
@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2021

This pull request does not have a backport label. Could you fix it @legoguy1000? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Oct 4, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 4, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-26T12:16:37.460+0000

  • Duration: 109 min 44 sec

  • Commit: dbf0f21

Test stats 🧪

Test Results
Failed 0
Passed 24788
Skipped 1653
Total 26441

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

@legoguy1000
Copy link
Contributor Author

@andrewkroh can u let me know what u think of this?

@aspacca
Copy link

aspacca commented Oct 4, 2021

@legoguy1000 thanks for the contribution, the feature is welcome.

I think we can anyway simplify the implementation and changes required with no need to introduce a non_aws_bucket_name that can be confusing.

My suggestion is to let the endpoint accept a full URI and then parse it and check if it contains a scheme.
In this case we will just use it without using the endpoint enrichment.

With pseudo-code something along this:

endpoint := awscommon.EnrichAWSConfigWithEndpoint(in.config.AWSConfig.Endpoint, s3ServiceName, in.awsConfig.Region, in.awsConfig)
if url.Parse(in.config.AWSConfig.Endpoint).Scheme != "" {
    endpoint = awssdk.ResolveWithEndpointURL(in.config.AWSConfig.Endpoint)
}

path_style support is it fine to be added.

@legoguy1000
Copy link
Contributor Author

If not add something like non_aws_bucket_name would u just use the bucket_arn to set the bucket name? My biggest reason to add a new field was to be able to easily identify that we don't want to use AWS specific settings and options.

@legoguy1000 legoguy1000 marked this pull request as ready for review October 5, 2021 00:32
@legoguy1000 legoguy1000 force-pushed the 28222-nonaws-s3 branch 2 times, most recently from 7752a17 to 66f7fcb Compare October 5, 2021 00:34
@legoguy1000
Copy link
Contributor Author

@aspacca i removed the non_aws_bucket_name config and added validation via the enpoint field. I also added some config overrides for the region and provider to make sure the cloud.* are relatively accurate and not misleading/hardcoded.

@aspacca
Copy link

aspacca commented Oct 5, 2021

@legoguy1000

If not add something like non_aws_bucket_name would u just use the bucket_arn to set the bucket name?

you are right about this: we had a discussion internally and between introducing a breaking change renaming bucket_arn to bucket_name and use for both cases we decided it is actually better your initial solution using non_aws_bucket_name

I apologise for asking you to revert your change for it: you can add back non_aws_bucket_name, thanks

x-pack/filebeat/input/awss3/config.go Outdated Show resolved Hide resolved
@@ -33,6 +34,9 @@ type config struct {
AWSConfig awscommon.ConfigAWS `config:",inline"`
FileSelectors []fileSelectorConfig `config:"file_selectors"`
ReaderConfig readerConfig `config:",inline"` // Reader options to apply when no file_selectors are used.
PathStyle bool `config:"path_style"`
RegionOverride string `config:"region_override"`
ProviderOverride string `config:"provider_override"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway we can infer the provider from the endpoint or any API request?
If we can avoid to specify it explicitly and let the code find it it would be better, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be hard unless you just Didi it by domain name but even then it would make aws into amazonaws

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an override config but also some auto detection based on domain names from the endpoints. Let me know what you think.

x-pack/filebeat/input/awss3/input.go Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Oct 7, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 28222-nonaws-s3 upstream/28222-nonaws-s3
git merge upstream/master
git push upstream 28222-nonaws-s3

@legoguy1000 legoguy1000 force-pushed the 28222-nonaws-s3 branch 2 times, most recently from 570e9b5 to 71dd1d4 Compare October 7, 2021 00:23
@mergify
Copy link
Contributor

mergify bot commented Oct 7, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 28222-nonaws-s3 upstream/28222-nonaws-s3
git merge upstream/master
git push upstream 28222-nonaws-s3

@legoguy1000
Copy link
Contributor Author

legoguy1000 commented Oct 8, 2021

@aspacca let me know what u think of the changes. Instead of non_aws_bucket_name I just added back bucket_name. It will then work with AWS or non AWS buckets especially since the ARN value isn't really used except to pull the bucket name from it. I also found that if the 3rd party service uses the same URL format as AWS for it's buckets, u can just use the domain name as the endpoint as the function to build the full URL will work for that.

@legoguy1000 legoguy1000 force-pushed the 28222-nonaws-s3 branch 3 times, most recently from 5c33b48 to 5bdce86 Compare October 11, 2021 01:31
x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc Outdated Show resolved Hide resolved
The `aws-s3` input can also poll 3rd party S3 compatible services such as the self hosted Minio.
Uisng non-AWS S3 compatible buckets require the use of `access_key_id` and `secret_access_key` for authentication.
To specify the S3 bucket name, use the `bucket_name` config and the `endpoint` must be set to replace the default API endpoint.
Services that have endpoints in the standard form of `https://s3.<region>.<domain>` only need the endpoint config set like `endpoint: <domain>`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase this to something like that.

`endpoint` can be either a full URI with scheme, that will be used as it is as endpoint of the service to connect to, or a single domain, that will be used to build the full endpoint URI for native AWS S3 along wit the region param.
If you use the native AWS S3 service you need to set only the domain and only in case your S3 bucket url is hosted on a custom domain.
No `endpoint` is needed to be configured if you use the native AWS S3 service without a custom domain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, let me know what you think of my reword. I will be the first to admit, i'm terrible at writing docs.

if c.FIPSEnabled && endpoint.Scheme != "" {
return errors.New("fips_enabled cannot be used with a non-AWS S3 bucket.")
}
if c.PathStyle && c.AWSConfig.Endpoint == "" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.AWSConfig.Endpoint can be not empty and still we are using native AWS S3 service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct but short of checking all the other AWS domains I thought this was a quick way to make sure that the 99% of users that use the normal AWS S3 service don't accidently use path_style

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would still be a problem.
let's separate bucket_arn from bucket_name and use the first for the native AWS buckets and the second for compatible AWS buckets. you can rename back to non_aws_bucket_name to clarify better the difference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validation logic is still not 100% correct.
please, now that we have c.NonAWSBucketName test on either it or c.BucketARN for the other config param applicable only to native/not native AWS buckets

if c.PathStyle && c.AWSConfig.Endpoint == "" {
return errors.New("Cannot use path style when using AWS native S3 services")
}
if c.ProviderOverride != "" && c.AWSConfig.Endpoint != "" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.AWSConfig.Endpoint can be not empty and still we are using native AWS S3 service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

x-pack/filebeat/input/awss3/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/awss3/input.go Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Oct 12, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 28222-nonaws-s3 upstream/28222-nonaws-s3
git merge upstream/master
git push upstream 28222-nonaws-s3

@kaiyan-sheng
Copy link
Contributor

/test

@mergify
Copy link
Contributor

mergify bot commented Oct 14, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 28222-nonaws-s3 upstream/28222-nonaws-s3
git merge upstream/master
git push upstream 28222-nonaws-s3

@legoguy1000
Copy link
Contributor Author

I will try to fix the conflict and CI test failures today so we can get this merged. Is there any other changes wanted @kaiyan-sheng @aspacca ??

@kaiyan-sheng
Copy link
Contributor

Looks good on my side @legoguy1000 ! Could you add the new config parameter into documentation in x-pack/filebeat/_meta/config/filebeat.inputs.reference.xpack.yml.tmpl please? Thanks!!

Copy link

@aspacca aspacca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, look at my comments @legoguy1000 , thanks

x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc Outdated Show resolved Hide resolved
if c.FIPSEnabled && endpoint.Scheme != "" {
return errors.New("fips_enabled cannot be used with a non-AWS S3 bucket.")
}
if c.PathStyle && c.AWSConfig.Endpoint == "" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would still be a problem.
let's separate bucket_arn from bucket_name and use the first for the native AWS buckets and the second for compatible AWS buckets. you can rename back to non_aws_bucket_name to clarify better the difference.

x-pack/filebeat/input/awss3/input.go Outdated Show resolved Hide resolved
@legoguy1000
Copy link
Contributor Author

@aspacca @kaiyan-sheng The unit tests seem to be failing here https://github.com/elastic/beats/pull/28234/files#diff-f345fd6a1f5ea9523117d4ead2e5f1d13fb82eb1c65a089fd34fcdd514916a96R173 with respect to the SQS tests. I don't see why it would error out as the code to download an object for S3 is no different whether polling or SQS. Any ideas??

@aspacca
Copy link

aspacca commented Oct 25, 2021

@legoguy1000 resp at line 173 on x-pack/filebeat/input/awss3/s3_objects.go is mocked and the mock is probably missing SDKResponseMetdata() method that therefore return nil that causes the deference panic in the test.
Different tests mocks the call in different way and this might be the reason why not every test fails

@aspacca
Copy link

aspacca commented Oct 25, 2021

you can debug/print resp

here and see where exactly is the nil value, one of the following:

  • SDKResponseMetdata()
  • Request
  • HTTPRequest
  • URL

@legoguy1000
Copy link
Contributor Author

you can debug/print resp

here and see where exactly is the nil value, one of the following:

  • SDKResponseMetdata()
  • Request
  • HTTPRequest
  • URL

Looking at it, SDKResponseMetdata() returns an internal variable response that I have been unsuccessful to mock. I added logic to check if the response from that function is nil and if so, just set the variable to a blank string. If there is a better way to do it, please let me know.

@aspacca
Copy link

aspacca commented Oct 25, 2021

@legoguy1000 I'm a little skeptical about the nil check you introduced: this won't happen in real usage but only in the tests we were not able to mock properly and also there should probably not be a case where s3RequestURL is empty, especially since we are fetching it from the SDKResponseMetdata() because of support for non AWS S3 bucket, but before was built as a string with all the info we already have.

Let me do some test on the mocking side

@legoguy1000
Copy link
Contributor Author

@legoguy1000 I'm a little skeptical about the nil check you introduced: this won't happen in real usage but only in the tests we were not able to mock properly and also there should probably not be a case where s3RequestURL is empty, especially since we are fetching it from the SDKResponseMetdata() because of support for non AWS S3 bucket, but before was built as a string with all the info we already have.

Let me do some test on the mocking side

I concur and appreciate the help. If u get it working, feel free to push the fix to the branch.

@jsoriano jsoriano added the Team:Integrations Label for the Integrations team label Oct 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 25, 2021
@aspacca
Copy link

aspacca commented Oct 25, 2021

@legoguy1000 benchmark should not fail now, I pushed a commit on your branch

@mergify
Copy link
Contributor

mergify bot commented Oct 26, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 28222-nonaws-s3 upstream/28222-nonaws-s3
git merge upstream/master
git push upstream 28222-nonaws-s3

@aspacca aspacca merged commit 7fe0e57 into elastic:master Oct 26, 2021
@aspacca
Copy link

aspacca commented Oct 26, 2021

thanks for the patience @legoguy1000 :)

@legoguy1000
Copy link
Contributor Author

thanks for the patience @legoguy1000 :)

I feel like I should say that to u. Thanks for the help.

Icedroid pushed a commit to Icedroid/beats that referenced this pull request Nov 1, 2021
* Update `aws-s3` input to support non-AWS S3 buckets
@legoguy1000 legoguy1000 deleted the 28222-nonaws-s3 branch December 13, 2021 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Filebeat] Enable S3 support for other clouder than AWS's
5 participants