-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-21700][security] Add an option to disable credential retrieval on a secure cluster #15131
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 3c7129c (Sat Aug 28 11:21:03 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
Hi @lirui-apache , @XComp @rmetzger , could you help to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the PR. I'm still trying to understand this change. Left some minor comments. Also, you should change the commit message/title following the format "[FLINK-21700][yarn] xxxx".
@@ -344,6 +344,13 @@ | |||
.withDescription( | |||
"A comma-separated list of additional Kerberos-secured Hadoop filesystems Flink is going to access. For example, yarn.security.kerberos.additionalFileSystems=hdfs://namenode2:9002,hdfs://namenode3:9003. The client submitting to YARN needs to have access to these file systems to retrieve the security tokens."); | |||
|
|||
public static final ConfigOption<Boolean> YARN_SECURITY_ENABLED = | |||
key("yarn.security.kerberos.fetch.delegationToken.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to generate the doc with mvn clean package -Dgenerate-config-docs -pl flink-docs -am -nsu -DskipTests -Dcheckstyle.skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
@@ -1081,13 +1081,22 @@ private ApplicationReport startAppMaster( | |||
if (UserGroupInformation.isSecurityEnabled()) { | |||
// set HDFS delegation tokens when security is enabled | |||
LOG.info("Adding delegation token to the AM container."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If yarnFetchDelegationTokenEnabled == false
, we should not print this log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified it: The log message is still valid since Utils.setTokensFor
also sets user-related tokens. We only disable HDFS and HBase when SecurityOptions.KERBEROS_FETCH_DELEGATION_TOKEN
is disabled.
final Text alias = new Text(token.getService()); | ||
LOG.info("Adding user token " + alias + " with " + token); | ||
credentials.addToken(alias, token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me to understand why we need this and what the different between token.getService()
and token.getIdentifier()
. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the identifier
is not the only uniq id of the token.
So when getting existed token from UserGroupInformation.getCurrentUser().getTokens
and these token may be having the same identifier
, this is will cause overwriting tokens by using credentials.addToken(identifier, token);
Acutally, i think we should use service
as the alias.
More detailed info has been replied in email to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the token fetching for HBase is also relying on the service name instead of the identifier to add the token. I don't understand your argument, though: In what situation would the identifier not be unique. ...considering that you called the identifier
not being the only unique id. Could you elaborate a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to digging into Hadoop source code, i found that alias
is just as the key as internal Credentials
token HashMap.
In what situation would the identifier not be unique
In our production env, i found that identifier
is the same in active and standby namenode HDFS delegation token. But i dont find any detailed Hadoop doc to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there different delegation tokens for the different namenodes Flink has to deal with? I would have had assumed that there's only a single token for Flink accessing the Yarn cluster. Could you point me to the Hadoop source code you're referring to.
Sorry if it's a bit tedious. I haven't worked with Yarn/Kerberos, yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to https://issues.apache.org/jira/browse/HDFS-9276.
May not be very relevant.
Ok. i will force push it to correct commit log later. |
@zuston Could you squash your PR into one commit? Also the CI failed because of the style check, you could execute |
Done. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zuston and thanks for your contribution.
I'm wondering whether we can test that Flink is behaving in the right way when disabling the newly introduced flag. Right now, no test is covering this specific behavior.
<td><h5>yarn.security.kerberos.fetch.delegationToken.enabled</h5></td> | ||
<td style="word-wrap: break-word;">true</td> | ||
<td>Boolean</td> | ||
<td>When this is true Flink will fetch HDFS/HBase delegation token injected into AM container.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to elaborate a bit more on what's necessary to do when this flag is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have the "kerberos" in the config key? Does it mean the delegation token only be obtained when "kerberos" is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer is YES. The delegation token mechanism is indeed an extension of Kerberos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
final Text alias = new Text(token.getService()); | ||
LOG.info("Adding user token " + alias + " with " + token); | ||
credentials.addToken(alias, token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the token fetching for HBase is also relying on the service name instead of the identifier to add the token. I don't understand your argument, though: In what situation would the identifier not be unique. ...considering that you called the identifier
not being the only unique id. Could you elaborate a bit more.
It only works when credentials existed in UGI and no keytab is specified. I think It's hard to test, could you give me some ideas on it? @XComp |
That's a good questions. @rmetzger How did we deal with these kind of problems in the past? Was it sufficient enough to have a manual test with the external system to see that it worked properly? |
Do i need to submit another PR about changing |
Putting it in a separate commit within this PR is good enough if we go with this change. |
Could you come up with a manual test which is documented properly in the PR to make it reproducible? |
Thanks @XComp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you visualize how the Oozie -> Flink -> Kerberos -> HDFS/Hbase communication works in contrast to the default one? That might help me understand the change in a better way.
// for HBase | ||
obtainTokenForHBase(credentials, conf); | ||
|
||
if (yarnFetchDelegationEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also worth it to add an info log message here for both cases making it explicit in the logs that the delegation fetching is enabled or disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this ticket. After reading some implementation of Spark, I think it is a correct direction to disable fetching the delegation tokens. I left some comments and please have a look.
<td><h5>yarn.security.kerberos.fetch.delegationToken.enabled</h5></td> | ||
<td style="word-wrap: break-word;">true</td> | ||
<td>Boolean</td> | ||
<td>When this is true Flink will fetch HDFS/HBase delegation token injected into AM container.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have the "kerberos" in the config key? Does it mean the delegation token only be obtained when "kerberos" is enabled?
ContainerLaunchContext amContainer, | ||
List<Path> paths, | ||
Configuration conf, | ||
boolean yarnFetchDelegationEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean yarnFetchDelegationEnabled) | |
boolean obtainingDelegationTokens) |
final Text id = new Text(token.getIdentifier()); | ||
LOG.info("Adding user token " + id + " with " + token); | ||
credentials.addToken(id, token); | ||
final Text alias = new Text(token.getService()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the token.getService()
makes sense here. However, I would suggest to factor out these changes into a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
List<Path> yarnAccessList = | ||
ConfigUtils.decodeListFromConfig( | ||
configuration, YarnConfigOptions.YARN_ACCESS, Path::new); | ||
List<Path> yarnAccessList = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to document that yarn.security.kerberos.additionalFileSystems
should be unset when obtaining delegation tokens are disabled.
Maybe we could also add a precondition check instead of process it silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that when YARN_ACCESS
and YARN_SECURITY_ENABLED
are existed together, Flink client should fast fail through throwing exception or exit Non-zero code.
Right? @wangyang0918
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we disable obtaining the delegation tokens(HDFS/HBase), then yarn.security.kerberos.additionalFileSystems
is expected to be unset. Right?
If this is the case, I think throwing exception here makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Agree with you, i will add precheck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please review it.
@@ -344,6 +344,13 @@ | |||
.withDescription( | |||
"A comma-separated list of additional Kerberos-secured Hadoop filesystems Flink is going to access. For example, yarn.security.kerberos.additionalFileSystems=hdfs://namenode2:9002,hdfs://namenode3:9003. The client submitting to YARN needs to have access to these file systems to retrieve the security tokens."); | |||
|
|||
public static final ConfigOption<Boolean> YARN_SECURITY_ENABLED = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether to fetch delegation token and whether "security is enabled" are different things. So I would suggest rename this option to something like KERBEROS_USE_DELEGATION_TOKEN
.
Besides, delegation token is not specific to yarn. I think spark supports delegation token for both yarn and mesos. So I also suggest move this option to SecurityOptions
in flink-core, alongside with other kerberos configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Rename is ok. But KERBEROS_USE_DELEGATION_TOKEN
maybe not suitable.
Actually, this option indicates Flink whether to fetch delegation token actively, not no use delegation token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, could you please explain in which situation would a user want to fetch delegation tokens but not to use them? Or what a user need to do to really use delegation tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadoop delegation token is used to access to HDFS/HBase/Hive service. So when you want to be interactive with HDFS, you will need it.
Two options can be used to access above services.
First is to fetch delegation token using Keytab.
Second is to get it from current ugi, if current ugi has the delegation token which is already existing and can be reused.
Refer to: https://blog.cloudera.com/hadoop-delegation-tokens-explained/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this option only controls whether we fetch delegation tokens, then maybe we can call it KERBEROS_FETCH_DELEGATION_TOKEN
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Got it.
After rechecking the delegation token usage scope, i think it's only specific to Hadoop and related framework, like Hive/HBase. Right?
If not, can you give me some doc or issue link on Spark mesos delegation token? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegation tokens are usually used for a distributed job to authenticate with a Hadoop-based service like Hive or HBase. But it should be orthogonal to the resource management framework like yarn, mesos, or k8s. Spark doc indicates that delegation tokens are supported for yarn and mesos. And its delegation token implementations (e.g. HadoopDelegationTokenManager
, HadoopDelegationTokenProvider
) are in spark-core.
Actually other projects may also implement their own delegation token mechanisms like what kafka does. But I guess that exceeds the scope of this ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please review it.
Update commits. Please check it. @XComp @lirui-apache @wangyang0918 @KarmaGYZ |
.defaultValue(true) | ||
.withDescription( | ||
"Indicates whether to fetch delegation token. If true, Flink will fetch " | ||
+ "HDFS/HBase delegation tokens and inject into AM container." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a more generic description for this option, maybe something like "whether to fetch delegation tokens for external services the Flink job needs to contact". We can mention only HDFS and HBase are supported, and only works for flink-yarn at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please take a look. Thanks.
+ "Only HDFS and HBase are supported, and only works for flink-yarn at the moment. " | ||
+ "If true, Flink will fetch HDFS/HBase delegation tokens and inject into Yarn AM container. " | ||
+ "If false, Flink will assume that the job has delegation tokens and will not fetch them. " | ||
+ "This applies to submission mechanisms like Oozie, which will obtain delegation tokens " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate what it means by saying "This applies to submission mechanisms like Oozie"? IIUC, Flink can't control how Oozie submits jobs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, Oozie schedules Flink job submissions and utilizes Hadoop's ProxyUser feature to obtain an delegation token to access the components (in our case HDFS and Hbase). As far as I understand it, the delegation token request usually triggered by Flink would fail if Apache Oozie handles the delegation tokens as it uses it's own Proxy User that impersonates the actual users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, This applies to submission mechanisms like Oozie
seems to imply that Oozie also respects this config option. Alternatively, maybe we can say something like: You may want to disable this option, if you rely on submission mechanisms, e.g. Apache Oozie, to handle delegation tokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the description is not clear.
ou may want to disable this option, if you rely on submission mechanisms, e.g. Apache Oozie, to handle delegation tokens.
This is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zuston, sorry for the late reply. I didn't have time to look into the delegation token topic till now.
I'm still not comfortable with this change considering that it's not tested. Have you had the chance to check whether it works? Could you provide this manual test in a reproducible fashion (e.g. docker) or is this too much of an effort?
Based on what I read about it, the issue is that Apache Oozie utilizes Apache Hadoop's ProxyUser which impersonates the actual user which has access to the actual data. I still don't understand why the delegation token fetching causes an error. Is it because the Flink job would be still submitted under the "normal" user instead of the Oozie user?
+ "Only HDFS and HBase are supported, and only works for flink-yarn at the moment. " | ||
+ "If true, Flink will fetch HDFS/HBase delegation tokens and inject into Yarn AM container. " | ||
+ "If false, Flink will assume that the job has delegation tokens and will not fetch them. " | ||
+ "This applies to submission mechanisms like Oozie, which will obtain delegation tokens " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, Oozie schedules Flink job submissions and utilizes Hadoop's ProxyUser feature to obtain an delegation token to access the components (in our case HDFS and Hbase). As far as I understand it, the delegation token request usually triggered by Flink would fail if Apache Oozie handles the delegation tokens as it uses it's own Proxy User that impersonates the actual users.
"Indicates whether to fetch delegation token for external services the Flink job needs to contact. " | ||
+ "Only HDFS and HBase are supported, and only works for flink-yarn at the moment. " | ||
+ "If true, Flink will fetch HDFS/HBase delegation tokens and inject into Yarn AM container. " | ||
+ "If false, Flink will assume that the job has delegation tokens and will not fetch them. " | ||
+ "This applies to submission mechanisms like Oozie, which will obtain delegation tokens " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Indicates whether to fetch delegation token for external services the Flink job needs to contact. " | |
+ "Only HDFS and HBase are supported, and only works for flink-yarn at the moment. " | |
+ "If true, Flink will fetch HDFS/HBase delegation tokens and inject into Yarn AM container. " | |
+ "If false, Flink will assume that the job has delegation tokens and will not fetch them. " | |
+ "This applies to submission mechanisms like Oozie, which will obtain delegation tokens " | |
"Indicates whether to fetch the delegation tokens for external services the Flink job needs to contact. " | |
+ "Only HDFS and HBase are supported. It is used in YARN deployments. " | |
+ "If true, Flink will fetch HDFS and HBase delegation tokens and inject them into Yarn AM containers. " | |
+ "If false, Flink will assume that the delegation tokens are managed outside of Flink. As a consequence, it will not fetch delegation tokens for HDFS and HBase." | |
+ "This applies to submission mechanisms like Apache Oozie which will obtain delegation tokens " |
Here's a proposal to make the docs more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the documentation has not updated, yet.
// for HBase | ||
obtainTokenForHBase(credentials, conf); | ||
|
||
if (obtainingDelegationTokens) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have separate parameters for each of the services similar to how Spark is implementing this feature? That's just an idea I came up with after browsing the Spark sources where it's separated like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should submit another issue to separate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not necessary for now considering that we don't have a use-case where we want to disable them individually. So, I'm fine with keeping it like that considering that it makes configuration easier.
boolean yarnAccessFSEnabled = | ||
flinkConfiguration.get(YarnConfigOptions.YARN_ACCESS) != null; | ||
if (!kerberosFetchDTEnabled && yarnAccessFSEnabled) { | ||
throw new RuntimeException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be a warning sufficient enough? The YARN_ACCESS
setting does not harm the system. It would be just ignored if delegation tokens are disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YARN_ACCESS
and KERBEROS_FETCH_DELEGATION_TOKEN
are exclusive, it's better to fast fail.
Because if it’s ignored, it will make users confused whether YARN_ACCESS
really works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I missed the thread. Thanks for pointing to it.
Boolean kerberosFetchDelegationTokenEnabled = | ||
configuration.getBoolean(SecurityOptions.KERBEROS_FETCH_DELEGATION_TOKEN); | ||
|
||
if (kerberosFetchDelegationTokenEnabled) { | ||
// set HDFS delegation tokens when security is enabled | ||
LOG.info("Adding delegation token to the AM container."); | ||
yarnAccessList = | ||
ConfigUtils.decodeListFromConfig( | ||
configuration, YarnConfigOptions.YARN_ACCESS, Path::new); | ||
} | ||
|
||
Utils.setTokensFor( | ||
amContainer, | ||
ListUtils.union(yarnAccessList, fileUploader.getRemotePaths()), | ||
yarnConfiguration); | ||
yarnConfiguration, | ||
kerberosFetchDelegationTokenEnabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether this change is actually necessary? Is there a specific reason for it? Aren't we implicitly ignoring the YARN_ACCESS
setting in Utils.setTokensFor
? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YARN_ACCESS
and KERBEROS_FETCH_DELEGATION_TOKEN
are exclusive.
It's kerberos mechanism.
Please check spark related doc: https://spark.apache.org/docs/latest/running-on-yarn.html#launching-your-application-with-apache-oozie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it...
@XComp Thanks for your reply
We have applied and tested this PR in our production. It looks fine. Sorry, I am not very familiar with the Flink code. Can you provide a test case link about with kerberos HDFS? I want to refer it to add test case. How to reproduce it?
No. Oozie will submit Flink job without keytab and only rely on delegation token to access HDFS. And why cause these exception? Actually there is no need for Flink to fetch token, Flink can use the token Oozie has fetched directly. |
6988932
to
c5b8d96
Compare
Thanks for the update @zuston . I'm just wondering why you squashed the two commits into one again. Considering that the two changes (service name as alias and new parameter) are independent of each other I would favor @wangyang0918 's proposal of keeping the two commits separate. |
@XComp Thanks for your reply. |
Gentle ping @XComp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for recreating the split between the two commits. Two things slipped through under my radar in the last review which needs to be addressed still.
throw new IllegalConfigurationException( | ||
"When security.kerberos.fetch.delegation-token is set, " | ||
+ "yarn.security.kerberos.additionalFileSystems must be unset."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw new IllegalConfigurationException( | |
"When security.kerberos.fetch.delegation-token is set, " | |
+ "yarn.security.kerberos.additionalFileSystems must be unset."); | |
throw new IllegalConfigurationException( | |
String.format( | |
"When %s is disabled, %s must be disabled as well.", | |
SecurityOptions.KERBEROS_FETCH_DELEGATION_TOKEN.key(), | |
YarnConfigOptions.YARN_ACCESS.key())); |
The message was wrong, wasn't it? Additionally, it's better to use the references to the parameters instead of plain text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -1081,13 +1081,22 @@ private ApplicationReport startAppMaster( | |||
if (UserGroupInformation.isSecurityEnabled()) { | |||
// set HDFS delegation tokens when security is enabled | |||
LOG.info("Adding delegation token to the AM container."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified it: The log message is still valid since Utils.setTokensFor
also sets user-related tokens. We only disable HDFS and HBase when SecurityOptions.KERBEROS_FETCH_DELEGATION_TOKEN
is disabled.
amContainer, | ||
ListUtils.union(yarnAccessList, fileUploader.getRemotePaths()), | ||
yarnConfiguration); | ||
List<Path> pathsToObtainToken = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<Path> pathsToObtainToken = null; | |
List<Path> pathsToObtainToken = Collections.emptyList(); |
I'm not that comfortable to pass null
into Utils.setTokensFor
as it's not a @Nullable
parameter. It does not harm the execution because it's only used in Utils.setTokensFor
if SecurityOptions.KERBEROS_FETCH_DELEGATION_TOKEN
is enabled (and in that case we overwrite pathsToObtainToken
). But still, I would vote for switching to an empty list instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done~
@XComp Thanks~. All resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for responding so quickly, @zuston. I did some code analysis on the getIdentifier
/getService
issue which makes me wonder whether we should do it in a dedicated ticket instead of sneaking it in here as part of FLINK-21700. This solution does not require the change from getIdentifier
to getService
, does it?
See my comments below.
final Text id = new Text(token.getIdentifier()); | ||
LOG.info("Adding user token " + id + " with " + token); | ||
credentials.addToken(id, token); | ||
LOG.info("Adding user token " + token.getService() + " with " + token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating over it once more, I realized that HadoopModule uses the identifier as well. In contrast, the HBase delegation token always used getService
. Do we have to align that? Moreover, I'm wondering whether we should make this a dedicated ticket to align it as it's not necessary for this feature. That would make this issue more transparent. Do you have any objections against that, @zuston ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't agree more. It will make more clear when seperating PR.
I will submit new one and we could talk more details on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we will move the second commit to a separate ticket. Right?
+ "If false, Flink will assume that the delegation tokens are managed outside of Flink. " | ||
+ "As a consequence, it will not fetch delegation tokens for HDFS and HBase. " | ||
+ "You may need to disable this option, if you rely on submission mechanisms, e.g. Apache Oozie, " | ||
+ "to handle delegation tokens. "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ "to handle delegation tokens. "); | |
+ "to handle delegation tokens."); |
a minor thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
TokenCache.obtainTokensForNamenodes(credentials, paths.toArray(new Path[0]), conf); | ||
// for HBase | ||
obtainTokenForHBase(credentials, conf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we discussed adding an else
branch containing an info log message pointing out that delegation token retrieval for HDFS and HBase is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i missed.
So message like: LOG.info("Delegation token retrieval for HDFS and HBase is disabled.");
@XComp Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting it up, @zuston . I mentioned one minor issue below. But other than that, I guess, we can finalize it. @wangyang0918 would you be fine with confirming this PR?
flink-core/src/main/java/org/apache/flink/configuration/SecurityOptions.java
Outdated
Show resolved
Hide resolved
@XComp Thanks for your review. All done. |
@XComp Gently ping. Could you help merge it? |
Sure, I will initiate it as soon as @wangyang0918 gave it another pass. He has it on his ToDo list and will do it soon'ish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zuston Thanks for updating this PR. It looks really good to me. I just left some minor comments. Please have a look.
BTW, we have unrelated documentation changes in the second commit. And if we agree to move the second commit to a separate ticket, then this PR needs to be refined again.
@@ -529,6 +529,18 @@ public void killCluster(ApplicationId applicationId) throws FlinkException { | |||
"Hadoop security with Kerberos is enabled but the login user " | |||
+ "does not have Kerberos credentials or delegation tokens!"); | |||
} | |||
|
|||
boolean fetchToken = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fetchToken
and yarnAccessFSEnabled
could be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
boolean fetchToken = | ||
flinkConfiguration.getBoolean(SecurityOptions.KERBEROS_FETCH_DELEGATION_TOKEN); | ||
boolean yarnAccessFSEnabled = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use CollectionUtil.isNullOrEmpty
to check whether YARN_ACCESS
is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
amContainer, | ||
ListUtils.union(yarnAccessList, fileUploader.getRemotePaths()), | ||
yarnConfiguration); | ||
List<Path> pathsToObtainToken = Collections.emptyList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would like to create a final
list and add them explicitly.
This could avoid pathsToObtainToken
is changed unexpectedly. And also the IDE complains the Uncheck assignment
when using ListUtils.union
.
final List<Path> pathsToObtainToken = new ArrayList<>();
...
pathsToObtainToken.addAll(yarnAccessList);
pathsToObtainToken.addAll(fileUploader.getRemotePaths());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
final Text id = new Text(token.getIdentifier()); | ||
LOG.info("Adding user token " + id + " with " + token); | ||
credentials.addToken(id, token); | ||
LOG.info("Adding user token " + token.getService() + " with " + token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we will move the second commit to a separate ticket. Right?
… on a secure cluster
@wangyang0918 All done. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for merging.
Thanks @wangyang0918 for giving it another pass. @rmetzger may you take over and merge the PR, please? |
Thanks for the contribution and the reviews. Merging. |
* [hotfix][docs][python] Fix the invalid url in state.md * [hotfix][table-planner-blink] Fix bug for window join: plan is wrong if join condition contains 'IS NOT DISTINCT FROM' Fix Flink-22098 caused by a mistake when rebasing This closes apache#15695 * [hotfix][docs] Fix the invalid urls in iterations.md * [FLINK-16384][table][sql-client] Support 'SHOW CREATE TABLE' statement This closes apache#13011 * [FLINK-16384][table][sql-client] Improve implementation of 'SHOW CREATE TABLE' statement. This commit tries to 1. resolve the conflicts 2. revert the changes made on old planner 3. apply spotless formatting 4. fix DDL missing `TEMPORARY` keyword for temporary table 5. display table's full object path as catalog.db.table 6. support displaying the expanded query for view 7. add view test in CatalogTableITCase, and adapt sql client test to the new test framework 8. adapt docs This closes apache#13011 * [hotfix][docs][python] Add missing pages for PyFlink documentation * [FLINK-22348][python] Fix DataStream.execute_and_collect which doesn't declare managed memory for Python operators This closes apache#15665. * [FLINK-22350][python][docs] Add documentation for user-defined window in Python DataStream API This closes apache#15702. * [FLINK-22354][table-planner] Fix the timestamp function return value precision isn't matching with declared DataType This closes apache#15689 * [FLINK-22354][table] Remove the strict precision check of time attribute field This closes apache#15689 * [FLINK-22354][table] Fix TIME field display in sql-client This closes apache#15689 * [hotfix] Set default value of resource-stabilization-timeout to 10s Additionally, set speed up the AdaptiveScheduler ITCases by configuring a very low jobmanager.adaptive-scheduler.resource-stabilization-timeout. * [FLINK-22345][coordination] Remove incorrect assertion in scheduler When this incorrect assertion is violated, the scheduler can trip into an unrecoverable failover loop. * [hotfix][table-planner-blink] Fix the incorrect ExecNode's id in testIncrementalAggregate.out After FLINK-22298 is finished, the ExecNode's id should always start from 1 in the json plan tests, while the testIncrementalAggregate.out was overrided by FLINK-20613 This closes apache#15698 * [FLINK-20654] Fix double recycling of Buffers in case of an exception on persisting Exception can be thrown for example if task is being cancelled. This was leading to same buffer being recycled twice. Most of the times that was just leading to an IllegalReferenceCount being thrown, which was ignored, as this task was being cancelled. However on rare occasions this buffer could have been picked up by another task after being recycled for the first time, recycled second time and being picked up by another (third task). In that case we had two users of the same buffer, which could lead to all sort of data corruptions. * [hotfix] Regenerate html configuration after 4be9aff * [FLINK-22001] Fix forwarding of JobMaster exceptions to user [FLINK-XXXX] Draft separation of leader election and creation of JobMasterService [FLINK-XXXX] Continued work on JobMasterServiceLeadershipRunner [FLINK-XXXX] Integrate RunningJobsRegistry, Cancelling state and termination future watching [FLINK-XXXX] Delete old JobManagerRunnerImpl classes [FLINK-22001] Add tests for DefaultJobMasterServiceProcess [FLINK-22001][hotfix] Clean up ITCase a bit [FLINK-22001] Add missing check for empty job graph [FLINK-22001] Rename JobMasterServiceFactoryNg to JobMasterServiceFactory This closes apache#15715. * [FLINK-22396][legal] Remove unnecessary entries from sql hbase connector NOTICE file This closes apache#15706 * [FLINK-22396][hbase] Remove unnecessary entries from shaded plugin configuration * [FLINK-19980][table-common] Add a ChangelogMode.upsert() shortcut * [FLINK-19980][table-common] Add a ChangelogMode.all() shortcut * [FLINK-19980][table] Support fromChangelogStream/toChangelogStream This closes apache#15699. * [hotfix][table-api] Properly deprecate StreamTableEnvironment.execute * [FLINK-22345][coordination] Catch pre-mature state restore for Operator Coordinators * [FLINK-22168][table] Partition insert can not work with union all This closes apache#15608 * [FLINK-22341][hive] Fix describe table for hive dialect This closes apache#15660 * [FLINK-22000][io] Set a default character set in InputStreamReader * [FLINK-22385][runtime] Fix type cast error in NetworkBufferPool This closes apache#15693. * [FLINK-22085][tests] Update TestUtils::tryExecute() to cancel the job after execution failure. This closes apache#15713 * [hotfix][runtime] fix typo in ResourceProfile This closes apache#15723 * [hotfix] Fix typo in TestingTaskManagerResourceInfoProvider * [FLINK-21174][coordination] Optimize the performance of DefaultResourceAllocationStrategy This commit optimize the performance of matching requirements with available/pending resources in DefaultResourceAllocationStrategy. This closes apache#15668 * [FLINK-22177][docs][table] Add documentation for time functions and time zone support This closes apache#15634 * [FLINK-22398][runtime] Fix incorrect comments in InputOutputFormatVertex (apache#15705) * [FLINK-22384][docs] Fix typos in Chinese "fault-tolerance/state" page (apache#15694) * [FLINK-21903][docs-zh] Translate "Importing Flink into an IDE" page into Chinese This closes apache#15721 * [FLINK-21659][coordination] Properly expose checkpoint settings for initializing jobs * [hotfix][tests] Remove unnecessary field * [FLINK-20723][tests] Ignore expected exception when retrying * [FLINK-20723][tests] Allow retries to be defined per class * [FLINK-20723][cassandra][tests] Retry on NoHostAvailableException * [hotfix][docs] Fix typo in jdbc execution options * [hotfix][docs] Fix typos * [FLINK-12351][DataStream] Fix AsyncWaitOperator to deep copy StreamElement when object reuse is enabled apache#8321 (apache#8321) * [FLINK-22119][hive][doc] Update document for hive dialect This closes apache#15630 * [FLINK-20720][docs][python] Add documentation about output types for Python DataStream API This closes apache#15733. * [FLINK-20720][python][docs] Add documentation for ProcessFunction in Python DataStream API This closes apache#15733. * [FLINK-22294][hive] Hive reading fail when getting file numbers on different filesystem nameservices This closes apache#15624 * [FLINK-22356][hive][filesystem] Fix partition-time commit failure when watermark is applied defined TIMESTAMP_LTZ column This closes apache#15709 * [FLINK-22433][tests] Make CoordinatorEventsExactlyOnceITCase work with Adaptive Scheduler. The test previously relied on an implicit contract that instances of OperatorCoordinators are never recreated on the same JobManager. That implicit contract is no longer true with the Adaptive Scheduler. This change adjusts the test to no longer make that assumption. This closes apache#15739 * [hotfix][tests] Remove timout rule from KafkaTableTestBase That way, we get thread dumps from the CI infrastructure when the test hangs. * [FLINK-21247][table] Fix problem in MapDataSerializer#copy when there exists custom MapData This closes apache#15569 * [FLINK-21923][table-planner-blink] Fix ClassCastException in SplitAggregateRule when a query contains both sum/count and avg function This closes apache#15341 * [FLINK-19449][doc] Fix wrong document for lead and lag * [FLINK-19449][table] Introduce LinkedListSerializer * [FLINK-19449][table] Pass isBounded to AggFunctionFactory * [FLINK-19449][table-planner] LEAD/LAG cannot work correctly in streaming mode This closes apache#15747 * [FLINK-18199][doc] translate FileSystem SQL Connector page into chinese This closes apache#13459 * [FLINK-22444][docs] Drop async checkpoint description of state backends in Chinese docs * [FLINK-22445][python][docs] Add documentation for row-based operations This closes apache#15757. * [FLINK-22463][table-planner-blink] Fix IllegalArgumentException in WindowAttachedWindowingStrategy when two phase is enabled for distinct agg This closes apache#15759 * [hotfix] Do not use ExecutorService.submit since it can swallow exceptions This commit changes the KubernetesLeaderElector to use ExecutorService.execute instead of submit which ensures that potential exceptions are forwarded to the fatal uncaught exeception handler. This closes apache#15740. * [FLINK-22431] Add information when and why the AdaptiveScheduler restarts or fails jobs This commit adds info log statements to tell the user when and why it restarts or fails a job. This closes apache#15736. * [hotfix] Add debug logging to the states of the AdaptiveScheduler This commit adds debug log statements to the states of the AdaptiveScheduler to log whenever we ignore a global failure. * [hotfix] Harden against FLINK-21376 by checking for null failure cause In order to harden the AdaptiveScheduler against FLINK-21376, this commit checks whether a task failure cause is null or not. In case of null, it will replace the failure with a generic cause. * [FLINK-22301][runtime] Statebackend and CheckpointStorage type is not shown in the Web UI [FLINK-22301][runtime] Statebackend and CheckpointStorage type is not shown in the Web UI This closes apache#15732. * [hotfix][network] Remove unused method BufferPool#getSubpartitionBufferRecyclers * [FLINK-22424][network] Prevent releasing PipelinedSubpartition while Task can still write to it This bug was happening when a downstream tasks were failing over or being cancelled. If all of the downstream tasks released subpartition views, underlying memory segments/buffers could have been recycled, while upstream task was still writting some records. The problem is fixed by adding the writer (result partition) itself as one more reference counted user of the result partition * [FLINK-22085][tests] Remove timeouts from KafkaSourceLegacyITCase * [FLINK-19606][table-runtime-blink] Refactor utility class JoinConditionWithFullFilters from AbstractStreamingJoinOperator This closes apache#15752 * [hotfix][coordination] Add log for slot allocation in FineGrainedSlotManager This closes apache#15748 * [FLINK-22074][runtime][test] Harden FineGrainedSlotManagerTest#testRequirementCheckOnlyTriggeredOnce in case deploying on a slow machine This closes apache#15751 * [FLINK-22470][python] Make sure that the root cause of the exception encountered during compiling the job was exposed to users in all cases This closes apache#15766. * [hotfix][docs] Removed duplicate word This closes apache#15756 . * [hotfix][python][docs] Add missing debugging page for PyFlink documentation * [FLINK-22136][e2e] Odd parallelism for resume_externalized_checkpoints was added to run-nightly-tests.sh. * [FLINK-22136][e2e] Exception instead of system out is used for errors in DataStreamAllroundTestProgram * [hotfix][python][docs] Clarify python.files option (apache#15779) * [FLINK-22476][docs] Extend the description of the config option `execution.target` This closes apache#15777. * [FLINK-18952][python][docs] Add "10 minutes to DataStream API" documentation This closes apache#15769. * [hotfix][table-planner-blink] Fix unstable itcase in OverAggregateITCase#testRowNumberOnOver This closes apache#15782 * [FLINK-22378][table] Derive type of SOURCE_WATERMARK() from time attribute This closes apache#15730. * [hotfix][python][docs] Fix compile issues * [FLINK-22469][runtime-web] Fix NoSuchFileException in HistoryServer * [FLINK-22289] Update JDBC XA sink docs * [FLINK-22428][docs][table] Translate "Timezone" page into Chinese (apache#15753) * [FLINK-22471][connector-elasticsearch] Remove commads from list This is a responsibility of the Formatter implementation. * [FLINK-22471][connector-elasticsearch] Do not repeat default value * [FLINK-22471][connector-jdbc] Improve spelling and avoid repeating default values * [FLINK-22471][connector-kafka] Use proper Description for connector options * [FLINK-22471][connector-kinesis] Remove required from ConfigOption This is defined by where the factory includes the option. * [FLINK-22471][connector-kinesis] Use proper Description for connector options * [FLINK-22471][table-runtime-blink] Remove repetition of default values * [FLINK-22471][table-runtime-blink] Use proper Description for connector options This closes apache#15764. * [hotfix][python][docs] Fix flat_aggregate example in Python Doc * [hotfix][python][docs] Add documentation to remind users to bundle Python UDF definitions when submitting the job This closes apache#15790. * [FLINK-17783][orc] Add array,map,row types support for orc row writer This closes apache#15746 * [FLINK-22489][webui] Fix displaying individual subtasks backpressure-level Previously (incorrectly) backpressure-level from the whole task was being displayed for each of the subtasks. * [FLINK-22479[Kinesis][Consumer] Potential lock-up under error condition * [FLINK-22304][table] Refactor some interfaces for TVF based window to improve the extendability This closes apache#15745 * [hotfix][python][docs] Fix python.archives docs This closes apache#15783. * [FLINK-20086][python][docs] Add documentation about how to override open() in UserDefinedFunction to load resources This closes apache#15795. * [FLINK-22438][metrics] Add numRecordsOut metric for Async IO This closes apache#15791. * [FLINK-22373][docs] Add Flink 1.13 release notes This closes apache#15687 * [FLINK-22495][docs] Add Reactive Mode section to K8s * [FLINK-21967] Add documentation on the operation of blocking shuffle This closes apache#15701 * [FLINK-22232][tests] Add UnalignedCheckpointsStressITCase * [FLINK-22232][network] Add task name to output recovery tracing. * [FLINK-22232][network] Add task name to persist tracing. * [FLINK-22232][network] More logging of network stack. * [FLINK-22109][table-planner-blink] Resolve misleading exception message in invalid nested function This closes apache#15523. * [FLINK-22426][table] Fix several shortcomings that prevent schema expressions This fixes a couple of critical bugs in the stack that prevented Table API expressions to be used for schema declaration. Due to time constraints, this PR could not make it into the 1.13 release but should be added to the next bugfix release for a smoother user experience. For testing and consistency, it exposes the Table API expression sourceWatermark() in Java, Scala, and Python API. This closes apache#15798 * [FLINK-22493] Increase test stability in AdaptiveSchedulerITCase. This addresses the following problem in the testStopWithSavepointFailOnFirstSavepointSucceedOnSecond() test. Once all tasks are running, the test triggers a savepoint, which intentionally fails, because of a test exception in a Task's checkpointing method. The test then waits for the savepoint future to fail, and the scheduler to restart the tasks. Once they are running again, it performs a sanity check whether the savepoint directory has been properly removed. In the reported run, there was still the savepoint directory around. The savepoint directory is removed via the PendingCheckpoint.discard() method. This method is executed using the i/o executor pool of the CheckpointCoordinator. There is no guarantee that this discard method has been executed when the job is running again (and the executor shuts down with the dispatcher, hence it is not bound to job restarts). * [FLINK-22510][core] Format durations with highest unit When converting a configuration value to a string, durations were formatted in nanoseconds regardless of their values. This produces serialized outputs which are hard to understand for humans. The functionality of formatting in the highest unit which allows the value to be an integer already exists, thus we can simply defer to it to produce a more useful result. This closes apache#15809. * [FLINK-22250][sql-parser] Add missing 'createSystemFunctionOnlySupportTemporary' entry in ParserResource.properties (apache#15582) * [FLINK-22524][docs] Fix the incorrect Java groupBy clause in Table API docs (apache#15802) * [FLINK-22522][table-runtime-blink] BytesHashMap prints many verbose INFO level logs (apache#15801) * [FLINK-22539][python][docs] Restructure the Python dependency management documentation (apache#15818) * [FLINK-22544][python][docs] Add the missing documentation about the command line options for PyFlink * Update japicmp configuration for 1.13.0 * [hotfix][python][docs] Correct a few invalid links and typos in PyFlink documentation * [FLINK-22368] Deque channel after releasing on EndOfPartition ...and don't enqueue the channel if it received EndOfPartition previously. Leaving a released channel enqueued may lead to CancelTaskException which can prevent EndOfPartitionEvent propagation and the job being stuck. * [FLINK-14393][webui] Add an option to enable/disable cancel job in web ui This closes apache#15817. * [FLINK-22323] Fix typo in JobEdge#connecDataSet (apache#15647) Co-authored-by: yunhua@dtstack.com <123456Lq> * [hotfix][python] Add missing space to exception message * [hotfix][docs] Fix typo * [FLINK-22535][runtime] CleanUp is invoked for task even when the task fail during the restore * [FLINK-22535][runtime] CleanUp is invoked despite of fail inside of cancelTask * [FLINK-22432] Update upgrading.md with 1.13.x * [FLINK-22253][docs] Update back pressure monitoring docs with new WebUI changes * [hotfix][docs] Update unaligned checkpoint docs * [hotfix][network] Remove redundant word from comment * [FLINK-21131][webui] Alignment timeout displayed in checkpoint configuration(WebUI) * [FLINK-22548][network] Remove illegal unsynchronized access to PipelinedSubpartition#buffers After peeking last buffer, this last buffer could have been processed and recycled by the task thread, before NetworkActionsLogger.traceRecover would manage to check it's content causing IllegalReferenceCount exceptions. Further more, this log seemed excessive and was accidentally added after debugging session, so it's ok to remove it. * [FLINK-22563][docs] Add migration guide for new StateBackend interfaces Co-authored-by: Nico Kruber <nico.kruber@gmail.com> This closes apache#15831 * [FLINK-22379][runtime] CheckpointCoordinator checks the state of all subtasks before triggering the checkpoint * [FLINK-22488][hotfix] Update SubtaskGatewayImpl to specify the cause of sendEvent() failure when triggering task failover * [FLINK-22442][CEP] Using scala api to change the TimeCharacteristic of the PatternStream is invalid This closes apache#15742 * [FLINK-22573][datastream] Fix AsyncIO calls timeout on completed element. As long as the mailbox is blocked, timers are not cancelled, such that a completed element might still get a timeout. The fix is to check the completed flag when the timer triggers. * [FLINK-22573][datastream] Fix AsyncIO calls timeout on completed element. As long as the mailbox is blocked, timers are not cancelled, such that a completed element might still get a timeout. The fix is to check the completed flag when the timer triggers. * [FLINK-22233][table] Fix "constant" typo in PRIMARY KEY exception messages This closes apache#15656 Co-authored-by: wuys <wuyongsheng@qtshe.com> * [hotfix][hive] Add an ITCase that checks partition-time commit pretty well This closes apache#15754 * [FLINK-22512][hive] Fix issue of calling current_timestamp with hive dialect for hive-3.1 This closes apache#15819 * [FLINK-22362][network] Improve error message when taskmanager can not connect to jobmanager * [FLINK-22581][docs] Keyword CATALOG is missing in sql client doc (apache#15844) * [hotfix][docs] Replace local failover with partial failover * [FLINK-22266][hotfix] Do not clean up jobs in AbstractTestBase if the MiniCluster is not running This closes apache#15829 * [hotfix][docs] Fix code tabs for state backend migration * [hotfix][docs][python] Fix the example in intro_to_datastream_api.md * [hotfix][docs][python] Add an overview page for Python UDFs * [hotfix] Make reactive warning less strong, clarifications * [hotfix][docs] Re-introduce note about FLINK_CONF_DIR This closes apache#15845 * [hotfix][docs][python] Add introduction about the open method in Python DataStream API * [FLINK-21095][ci] Remove legacy slot management profile * [FLINK-22406][tests] Add RestClusterClient to MiniClusterWithClientResource * [FLINK-22406][coordination][tests] Stabilize ReactiveModeITCase * [FLINK-22419][coordination][tests] Rework RpcEndpoint delay tests * [FLINK-22560][build] Move generic filters/transformers into general shade-plugin configuration * [FLINK-22560][build] Filter maven metadata directory * [FLINK-22560][build] Add dedicated name to flink-dist shade-plugin execution * [FLINK-22555][build][python] Exclude leftover jboss files * [hotfix] Ignore failing test reported in FLINK-22559 * [hotfix][docs] Fix typo in dependency_management.md * [hotfix][docs] Mention new StreamTableEnvironment.fromDataStream in release notes * [hotfix][docs] Mention new StreamTableEnvironment.fromDataStream in Chinese release notes * [FLINK-22505][core] Limit the scale of Resource to 8 This closes apache#15815 * [FLINK-19606][table-runtime-blink] Introduce WindowJoinOperator and WindowJoinOperatorBuilder This closes apache#15760 * [FLINK-22536][runtime] Promote the critical log in FineGrainedSlotManager to INFO level This closes apache#15850 * [FLINK-22355][docs] Fix simple task manager memory model image This closes apache#15862 * [FLINK-19606][table-planner] Introduce StreamExecWindowJoin and window join it cases This closes apache#15479 * [hotfix][docs] Correct the examples in Python DataStream API * [FLINK-17170][kinesis] Move KinesaliteContainer to flink-connector-kinesis. This testcontainer will be used in an ITCase in the next commit. Also move system properties required for test into pom.xml. * [FLINK-17170][kinesis] Fix deadlock during stop-with-savepoint. During stop-with-savepoint cancel is called under lock in legacy sources. Thus, if the fetcher is trying to emit a record at the same time, it cannot obtain the checkpoint lock. This behavior leads to a deadlock while cancel awaits the termination of the fetcher. The fix is to mostly rely on the termination inside the run method. As a safe-guard, close also awaits termination where close is always caused without lock. * [FLINK-21181][runtime] Wait for Invokable cancellation before releasing network resources * [FLINK-22609][runtime] Generalize AllVerticesIterator * [hotfix][docs] Fix image links * [FLINK-22525][table-api] Fix gmt format in Flink from GMT-8:00 to GMT-08:00 (apache#15859) * [FLINK-22559][table-planner] The consumed DataType of ExecSink should only consider physical columns This closes apache#15864. * [hotfix][core] Remove unused import * [FLINK-22537][docs] Add documentation how to interact with DataStream API This closes apache#15837. * [FLINK-22596] Active timeout is not triggered if there were no barriers The active timeout did not take effect if it elapsed before the first barrier arrived. The reason is that we did not reset the future for checkpoint complete on barrier announcement. Therefore we considered the completed status for previous checkpoint when evaluating the timeout for current checkpoint. * [hotfix][test] Adds -e flag to interpret newline in the right way * [FLINK-22566][test] Adds log extraction for the worker nodes We struggled to get the logs of the node manager which made it hard to investigate FLINK-22566 where there was a lag between setting up the YARN containers and starting the TaskExecutor. Hopefully, the nodemanager logs located on the worker nodes will help next time to investigate something like that. * [FLINK-22577][tests] Harden KubernetesLeaderElectionAndRetrievalITCase This commit introduces closing logic to the TestingLeaderElectionEventHandler which would otherwise forward calls after the KubernetesLeaderElectionDriver is closed. This closes apache#15849. * [FLINK-22604][table-runtime-blink] Fix NPE on bundle close when task failover after a failed task open This closes apache#15863 * [hotfix] Disable broken savepoint tests tracked in FLINK-22067 * [FLINK-22407][build] Bump log4j to 2.24.1 - CVE-2020-9488 * [FLINK-22313][table-planner-blink] Redundant CAST in plan when selecting window start and window end in window agg (apache#15806) * [FLINK-22523][table-planner-blink] Window TVF should throw helpful exception when specifying offset parameter (apache#15803) * [FLINK-22624][runtime] Utilize the remain resource of new pending task managers to fulfill requirement in DefaultResourceAllocationStrategy This closes apache#15888 * [FLINK-22413][WebUI] Hide Checkpointing page for batch jobs * [FLINK-22364][doc] Translate the page of "Data Sources" to Chinese. (apache#15763) * [FLINK-22586][table] Improve the precision dedivation for decimal arithmetics This closes apache#15848 * [hotfix][e2e] Output and collect the logs for Kubernetes IT cases * [FLINK-17857][test] Make K8s e2e tests could run on Mac This closes apache#14012. * [hotfix][ci] Use static methods * [FLINK-22556][ci] Extend JarFileChecker to search for traces of incompatible licenses * [FLINK-21700][security] Add an option to disable credential retrieval on a secure cluster (apache#15131) * [FLINK-22408][sql-parser] Fix SqlDropPartitions unparse Error This closes apache#15894 * [FLINK-21469][runtime] Implement advanceToEndOfEventTime for MultipleInputStreamTask For stop with savepoint, StreamTask#advanceToEndOfEventTime() is called (in source tasks) to advance to the max watermark. This PR implments advanceToEndOfEventTime for MultipleInputStreamTask chained sources. * [hotfix][docs] Fix all broken images Co-authored-by: Xintong Song <6509172+xintongsong@users.noreply.github.com> This closes apache#15865 * [hotfix][docs] Fix typo in k8s docs This closes apache#15886 * [FLINK-22628][docs] Update state_processor_api.md This closes apache#15892 * [hotfix] Add missing TestLogger to Kinesis tests * [FLINK-22574] Adaptive Scheduler: Fix cancellation while in Restarting state. The Canceling state of Adaptive Scheduler was expecting the ExecutionGraph to be in state RUNNING when entering the state. However, the Restarting state is cancelling the ExecutionGraph already, thus the ExectionGraph can be in state CANCELING or CANCELED when entering the Canceling state. Calling the ExecutionGraph.cancel() method in the Canceling state while being in ExecutionGraph.state = CANCELED || CANCELLED is not a problem. The change is guarded by a new ITCase, as this issue affects the interplay between different AS states. This closes apache#15882 * [FLINK-22640] [datagen] Fix DataGen SQL Connector does not support defining fields min/max option of decimal type field This closes apache#15900 * [FLINK-22534][runtime][yarn] Set delegation token's service name as credential alias This closes apache#15810 * [FLINK-22618][runtime] Fix incorrect free resource metrics of task managers This closes apache#15887 * [FLINK-15064][table-planner-blink] Remove XmlOutput util class in blink planner since Calcite has fixed the issue This closes apache#15911 * [FLINK-19796][table] Explicit casting shoule be made if the type of an element in `ARRAY/MAP` not equals with the derived component type This closes apache#15906 * [FLINK-22475][table-common] Document usage of '#' placeholder in option keys * [FLINK-22475][table-common] Exclude options with '#' placeholder from validation of required options * [FLINK-22475][table-api-java-bridge] Add placeholder options for datagen connector This closes apache#15896. * [FLINK-22511][python] Fix the bug of non-composite result type in Python TableAggregateFunction This closes apache#15796. * [[hotfix][docs] Changed argument for toDataStream to Table This closes apache#15923 . * [FLINK-22654][sql-parser] Fix SqlCreateTable#toString()/unparse() lose CONSTRAINTS and watermarks This closes apache#15918 * [FLINK-22658][table] Remove Deprecated util class TableConnectorUtil (apache#15914) * [FLINK-22592][runtime] numBuffersInLocal is always zero when using unaligned checkpoints This closes apache#15915 * [FLINK-22400][hive] fix NPE problem when convert flink object for Map This closes apache#15712 * [FLINK-22649][python][table-planner-blink] Support StreamExecPythonCalc json serialization/deserialization This closes apache#15913. * [FLINK-22502][checkpointing] Don't tolerate checkpoint retrieval failures on recovery Ignoring such failures and running with an incomplete set of checkpoints can lead to consistency violation. Instead, transient failures should be mitigated by automatic job restart. * [FLINK-22667][docs] Add missing slash * [FLINK-22650][python][table-planner-blink] Support StreamExecPythonCorrelate json serialization/deserialization This closes apache#15922. * [FLINK-22652][python][table-planner-blink] Support StreamExecPythonGroupWindowAggregate json serialization/deserialization This closes apache#15934. * [FLINK-22666][table] Make structured type's fields more lenient during casting Compare children individually for anonymous structured types. This fixes issues with primitive fields and Scala case classes. This closes apache#15935. * [hotfix][table-planner-blink] Give more helpful exception for codegen structured types * [FLINK-22620][orc] Drop BatchTableSource OrcTableSource and related classes This removes the OrcTableSource and related classes including OrcInputFormat. Use the filesystem connector with a ORC format as a replacement. It is possible to read via Table & SQL API snd convert the Table to DataStream API if necessary. DataSet API is not supported anymore. This closes apache#15891. * [FLINK-22651][python][table-planner-blink] Support StreamExecPythonGroupAggregate json serialization/deserialization This closes apache#15928. * [FLINK-22653][python][table-planner-blink] Support StreamExecPythonOverAggregate json serialization/deserialization This closes apache#15937. * [FLINK-12295][table] Fix comments in MinAggFunction and MaxAggFunction This closes apache#15940 * [FLINK-20695][ha] Clean ha data for job if globally terminated At the moment Flink only cleans up the ha data (e.g. K8s ConfigMaps, or Zookeeper nodes) while shutting down the cluster. This is not enough for a long running session cluster to which you submit multiple jobs. In this commit, we clean up the data for the particular job if it reaches a globally terminal state. This closes apache#15561. * [FLINK-22067][tests] Wait for vertices to using API * [hotfix][tests] Remove try/catch from SavepointWindowReaderITCase.testApplyEvictorWindowStateReader * [FLINK-22622][parquet] Drop BatchTableSource ParquetTableSource and related classes This removes the ParquetTableSource and related classes including various ParquetInputFormats. Use the filesystem connector with a Parquet format as a replacement. It is possible to read via Table & SQL API and convert the Table to DataStream API if necessary. DataSet API is not supported anymore. This closes apache#15895. * [hotfix] Ignore failing KinesisITCase traacked in FLINK-22613 * [FLINK-19545][e2e] Add e2e test for native Kubernetes HA The HA e2e test will start a Flink application first and wait for three successful checkpoints. Then kill the JobManager. A new JobManager should be launched and recover the job from latest successful checkpoint. Finally, cancel the job and all the K8s resources should be cleaned up automatically. This closes apache#14172. * [FLINK-22656] Fix typos * [hotfix][runtime] Fixes JavaDoc for RetrievableStateStorageHelper RetrievableStateStorageHelper is not only used by ZooKeeperStateHandleStore but also by KubernetesStateHandleStore. * [hotfix][runtime] Cleans up unnecessary annotations * [FLINK-22494][kubernetes] Introduces PossibleInconsistentStateException We experienced cases where the ConfigMap was updated but the corresponding HTTP request failed due to connectivity issues. PossibleInconsistentStateException is used to reflect cases where it's not clear whether the data was actually written or not. * [FLINK-22494][ha] Refactors TestingLongStateHandleHelper to operate on references The previous implementation stored the state in the StateHandle. This causes problems when deserializing the state creating a new instance that does not point to the actual state but is a copy of this state. This refactoring introduces LongStateHandle handling the actual state and LongRetrievableStateHandle referencing this handle. * [FLINK-22494][ha] Introduces PossibleInconsistentState to StateHandleStore * [FLINK-22494][runtime] Refactors CheckpointsCleaner to handle also discardOnFailedStoring * [FLINK-22494][runtime] Adds PossibleInconsistentStateException handling to CheckpointCoordinator This closes apache#15832. * [FLINK-22515][docs] Add documentation for GSR-Flink Integration * [FLINK-20487][table-planner-blink] Remove restriction on StreamPhysicalGroupWindowAggregate which only supports insert-only input node This closes apache#14830 * [FLINK-22623][hbase] Drop BatchTableSource/Sink HBaseTableSource/Sink and related classes This removes the HBaseTableSource/Sink and related classes including various HBaseInputFormats and HBaseSinkFunction. It is possible to read via Table & SQL API and convert the Table to DataStream API (or vice versa) if necessary. DataSet API is not supported anymore. This closes apache#15905. * [hotfix][hbase] Fix warnings around decimals in HBaseTestBase * [FLINK-22636][zk] Group job specific zNodes under /jobs zNode In order to better clean up job specific HA services, this commit changes the layout of the zNode structure so that the JobMaster leader, checkpoints and checkpoint counter is now grouped below the jobs/ zNode. Moreover, this commit groups the leaders of the cluster components (Dispatcher, ResourceManager, RestServer) under /leader/process/latch and /leader/process/connection-info. This closes apache#15893. * [FLINK-22696][tests] Enable Confluent Schema Registry e2e test on jdk 11 Co-authored-by: Dian Fu <dianfu@apache.org> Co-authored-by: Jing Zhang <beyond1920@126.com> Co-authored-by: Shengkai <1059623455@qq.com> Co-authored-by: Jane Chan <qingyue.cqy@gmail.com> Co-authored-by: huangxingbo <hxbks2ks@gmail.com> Co-authored-by: Leonard Xu <xbjtdcq@gmail.com> Co-authored-by: Till Rohrmann <trohrmann@apache.org> Co-authored-by: Stephan Ewen <sewen@apache.org> Co-authored-by: godfreyhe <godfreyhe@163.com> Co-authored-by: Piotr Nowojski <piotr.nowojski@gmail.com> Co-authored-by: Robert Metzger <rmetzger@apache.org> Co-authored-by: Dawid Wysakowicz <dwysakowicz@apache.org> Co-authored-by: Timo Walther <twalthr@apache.org> Co-authored-by: Jingsong Lee <jingsonglee0@gmail.com> Co-authored-by: Rui Li <lirui@apache.org> Co-authored-by: dbgp2021 <80257284+dbgp2021@users.noreply.github.com> Co-authored-by: sharkdtu <sharkdtu@tencent.com> Co-authored-by: Dong Lin <lindong28@gmail.com> Co-authored-by: chuixue <chuixue@dtstack.com> Co-authored-by: Yangze Guo <karmagyz@gmail.com> Co-authored-by: kanata163 <35188210+kanata163@users.noreply.github.com> Co-authored-by: zhaoxing <zhaoxing@dayuwuxian.com> Co-authored-by: Chesnay Schepler <chesnay@apache.org> Co-authored-by: Cemre Mengu <cemremengu@gmail.com> Co-authored-by: paul8263 <xzhangyao@126.com> Co-authored-by: Jark Wu <jark@apache.org> Co-authored-by: zhangjunfan <zuston.shacha@gmail.com> Co-authored-by: Shuo Cheng <mf1533007@smail.nju.edu.cn> Co-authored-by: Tartarus0zm <zhangmang1@163.com> Co-authored-by: JingsongLi <lzljs3620320@aliyun.com> Co-authored-by: Michael Li <brighty916@gmail.com> Co-authored-by: Yun Tang <myasuka@live.com> Co-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: mans2singh <mans2singh@users.noreply.github.com> Co-authored-by: Anton Kalashnikov <kaa.dev@yandex.ru> Co-authored-by: yiksanchan <evan.chanyiksan@gmail.com> Co-authored-by: Tony Wei <tony19920430@gmail.com> Co-authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Co-authored-by: Yuan Mei <yuanmei.work@gmail.com> Co-authored-by: hackergin <jinfeng1995@gmail.com> Co-authored-by: Ingo Bürk <ingo.buerk@tngtech.com> Co-authored-by: wangwei1025 <jobwangwei@yeah.net> Co-authored-by: Danny Cranmer <dannycranmer@apache.org> Co-authored-by: shuo.cs <shuo.cs@alibaba-inc.com> Co-authored-by: zhangzhengqi3 <zhangzhengqi5@jd.com> Co-authored-by: Yun Gao <gaoyunhenhao@gmail.com> Co-authored-by: Arvid Heise <arvid@ververica.com> Co-authored-by: MaChengLong <592577182@qq.com> Co-authored-by: zhaown <51357674+chaozwn@users.noreply.github.com> Co-authored-by: Huachao Mao <huachaomao@gmail.com> Co-authored-by: GuoWei Ma <guowei.mgw@gmail.com> Co-authored-by: Roman Khachatryan <khachatryan.roman@gmail.com> Co-authored-by: fangyue1 <fangyuefy@163.com> Co-authored-by: Jacklee <lqjacklee@126.com> Co-authored-by: zhang chaoming <72908278+ZhangChaoming@users.noreply.github.com> Co-authored-by: sasukerui <199181qr@sina.com> Co-authored-by: Seth Wiesman <sjwiesman@gmail.com> Co-authored-by: chennuo <chennuo@didachuxing.com> Co-authored-by: wysstartgo <wysstartgo@163.com> Co-authored-by: wuys <wuyongsheng@qtshe.com> Co-authored-by: 莫辞 <luoyuxia.luoyuxia@alibaba-inc.com> Co-authored-by: gentlewangyu <yuwang0917@gmail.com> Co-authored-by: Youngwoo Kim <ywkim@apache.org> Co-authored-by: HuangXiao <hx36w35@163.com> Co-authored-by: Roc Marshal <64569824+RocMarshal@users.noreply.github.com> Co-authored-by: Leonard Xu <xbjtdcq@163.com> Co-authored-by: Matthias Pohl <matthias@ververica.com> Co-authored-by: lincoln lee <lincoln.86xy@gmail.com> Co-authored-by: Senhong Liu <amazingliux@gmail.com> Co-authored-by: wangyang0918 <danrtsey.wy@alibaba-inc.com> Co-authored-by: aidenma <aidenma@tencent.com> Co-authored-by: Kevin Bohinski <kbohinski@users.noreply.github.com> Co-authored-by: yangqu <quyanghaoren@126.com> Co-authored-by: Terry Wang <zjuwangg@foxmail.com> Co-authored-by: wangxianghu <wangxianghu@apache.org> Co-authored-by: hehuiyuan <hehuiyuan@jd.com> Co-authored-by: lys0716 <luystu@gmail.com> Co-authored-by: Yi Tang <ssnailtang@gmail.com> Co-authored-by: LeeJiangchuan <lijiangchuan_ljc@yeah.net> Co-authored-by: Linyu <yaoliny@amazon.com> Co-authored-by: Fabian Paul <fabianpaul@ververica.com>
Linked to ISSUE.
Why
I want to support Flink Batch Action on Oozie.
As we know, Oozie will obtain HDFS/HBase delegation token using Hadoop proxy user mechanism before starting Flink submitter cli. If not, Flink will obtain delegation token again, this will cause exception.
Actually, Spark support disable fetching delegation token on Spark client, related Spark doc.
So i think Flink should allow to disable fetching Hadoop delegation token on Yarn.