-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI: Change Run Job availability based on ACLs #5944
Conversation
This isn’t a valid solution, but it gets closer to one by ensuring the token is loaded when the application boots. Then I can add another step to load and parse the token’s policies.
The endpoint doesn’t actually support this 😳
This should have been part of dc98403.
It’s cut off by the edge of the viewport for now!
These unrelated tests were failing because they assumed that the first 404 was stable.
Here’s another similar instance.
My placeholder test was incorrect.
The “greatest number of matched characters” part is still forthcoming.
It’s too bad reduce is so unwieldy in Javascript… maybe this would be better a for loop☹️
This is written assuming #6017 is merged as-is. It’s trivial to change the property name if needed!
…pect-acl # Conflicts: # ui/mirage/scenarios/default.js # ui/package.json # ui/yarn.lock
Since the API expands a policy shorthand like “write” into its constituent capabilities, examining the policy is no longer needed.
I’m hesitant to add storage-clearing before every test.
I made this gist with scripts and configuration to help test this out locally. Note that after changing tokens, a refresh is required, due to #6492. Some implementation notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @backspace !
I left quite a few comments I think! Most are just me thinking out loud though. The most important one is the CSS selector one, you'll probably want to change that before merging. The others are kind of up to you but I'd defo consider using a disabled <button>
here rather than a <div>
, although maybe theres a reason here that I'm not aware of as to why you can't use a button?
I did a little run of the app, but I don't know how to set things up so I can see things working (the default dev setup just gives me default
, namespace-1
and namespace-2
), no biggie though as Michael can probably give it a once over also.
Oh also, is there test coverage here for if you are running nomad without namespace support?
P.S. Guess who missed a couple of commits off the end of here cos he didn't pull 😆 ! Ignore the CSS selector one!
ui/app/templates/jobs/index.hbs
Outdated
{{#if (can "run job")}} | ||
{{#link-to "jobs.run" data-test-run-job class="button is-primary"}}Run Job{{/link-to}} | ||
{{else}} | ||
<div data-test-run-job class="button tooltip is-right-aligned" aria-label="You don’t have permission to run jobs" disabled>Run Job</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed the disabled here, I think I've seen this working on form like things like fieldsets, does it work with divs also? Actually should this be a button
not a div
? I suppose its a disabled button/non-interactive button which may as well be a div 😁 , not sure what accessibility things might come into play here? But if it was me personally I'd prefer semantic HTML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just noticed the curl apostrophe, super nit I know, I'm guessing that will come out ok on all platforms, but thought I'd check just incase, do you use curly punctuation elsewhere in nomad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I chose a div
over a button
because the thing it’s in parallel to is an a
, but it’s true that having it be a true button
makes the most sense, so I’ve changed it, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re the curly apostrophe, maybe it’s the only one that isn’t in code only; I type this way automatically so it doesn’t occur to me. Maybe @DingoEatingFuzz has a preference but I like them haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a fan of using proper typographic marks, therefore I am a fan of the curly apostrophe. That said, I haven't been disciplined about ensuring things like curly quotes, ellipses, en dashes, and the like.
@@ -49,6 +49,7 @@ | |||
"d3-transition": "^1.1.0", | |||
"ember-ajax": "^5.0.0", | |||
"ember-auto-import": "^1.2.21", | |||
"ember-can": "^2.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am literally about to work on something very similar to this PR, so thanks for the hinter for ember-can
!
ui/app/services/token.js
Outdated
.catch(() => []); | ||
} | ||
} catch (e) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure that you probably did this because you want things to fail silently here, I also noticed the empty catch
up above somewhere, just thought I'd check that you don't want any user visible errors here, I'm guessing that's what this means.
Also, just curious more than anything, but there is this mix of try/catch
/ .then()/.catch()
code here, is there anyway to write it so that you always use one style? No prob if not just wondering really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s possible that something should happen if policy-fetching fails… I’m not sure what that would be, it’s maybe somewhat of a design question. But it’s true that it was overly convoluted, I’ve removed the redundancy so it’s more idiomatic, thanks.
systemService.set('activeNamespace.name', '000-abc-999'); | ||
assert.ok(jobAbility.canRun, 'expected to be able to match against more than one wildcard'); | ||
}); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a nitpick here, but to me these are all integration tests for canRun
. They are testing more than a 'unit'. A unit test for canRun
would be a test to see if or
works (ember owns or
so it wouldn't be worth unit testing it yourself). A lot of what you are testing here is ember code and how you integrate with it.
I'd say the best thing to unit test here would be the _findMatchingNamespace
method, which seems to be the actual logic for this feature. The rest of the code is pretty much integrating that into the framework/app. If you could make it so _findMatchingNamespace
was importable you could even test it without all the mocking code you have here, so your test code would be way shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing taxonomy/nomenclature is always subjective but to me this does qualify as a unit test because I’m mocking the interfaces with the rest of the application that the unit is interacting with; for me, an integration test wouldn’t use mocking. I don’t tend to extract things like the namespace-matching until they’re useful elsewhere in the application, which this likely will never be, unless I need to for testing reasons, which I don’t think I do in this case. But I recognise this is a realm of many opinions and little consensus 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the thoughts!
Thanks to @johncowen for pointing out that this is on the way to being deprecated: #5944 (comment) emberjs/rfcs#554
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work and excellent test coverage!
I'm happy with this as is but I left a bunch of "food for thought" type comments too.
canRun: or('selfTokenIsManagement', 'policiesSupportRunning'), | ||
|
||
selfTokenIsManagement: equal('token.selfToken.type', 'management'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this pattern being repeated a bunch. Might be worth thinking about by one of us as we repeat it for both exec and node drain.
} else if (namespaceNames.includes('default')) { | ||
return 'default'; | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through this whole file, I could make some nitpicks about potential performance gains around caching intermediate values and sorting lists before scanning them, but instead I won't 😛 There is unlikely to ever be so many namespaces that this becomes performance critical code.
However I will point out that there is a lot going on in this ability file. It speaks to the possible need for a better policy primitive that can be used to make authoring abilities easier.
I'm curious what your thoughts here are since you have spent more time working through policy parsing and traversing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there’s a lot happening and that it’ll be worth extracting, I just tend to push that kind of thing off into the future when it’s actually needed to avoid premature abstraction. There was a time when I’d create elaborate generalised structures in anticipation and end up with something that didn’t work as well when it came time to be used elsewhere, so I now err on the side of solving the immediate problem and then trying to generalise when it becomes useful so the solution can be informed by real needs.
I’m planning to check ACLs for the exec button so that time isn’t far off 😆
{{#link-to "jobs.run" data-test-run-job class="button is-primary"}}Run Job{{/link-to}} | ||
{{else}} | ||
<button data-test-run-job class="button tooltip is-right-aligned" aria-label="You don’t have permission to run jobs" disabled>Run Job</button> | ||
{{/if}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplication is a bummer, but it's also not the worst. It's right at the point where it could be dangerous, but at least the two occurrences are co-located.
I also wouldn't be opposed to job-run-button
component.
This must have seemed/been necessary at some point but doesn’t break anything when removed!
I’m going to merge! 😯 |
* actually always canonicalize alloc.Job alloc.Job may be stale as well and need to migrate it. It does cost extra cycles but should be negligible. * e2e: improve reusability of provisioning scripts (hashicorp#6942) This changeset is part of the work to improve our E2E provisioning process to allow our upgrade tests: * Move more of the setup into the AMI image creation so it's a little more obvious to provisioning config authors which bits are essential to deploying a specific version of Nomad. * Make the service file update do a systemd daemon-reload so that we can update an already-running cluster with the same script we use to deploy it initially. * Avoid unnecessary golang version reference * add a script to update golang version * Update golang to 1.12.15 * Update ecs.html.md * Update configuring-tasks.html.md * ui: Change Run Job availability based on ACLs (hashicorp#5944) This builds on API changes in hashicorp#6017 and hashicorp#6021 to conditionally turn off the “Run Job” button based on the current token’s capabilities, or the capabilities of the anonymous policy if no token is present. If you try to visit the job-run route directly, it redirects to the job list. * Update changelog * e2e: use valid jobspec for group check test (hashicorp#6967) Group service checks cannot interpolate task fields, because the task fields are not available at the time the script check hook is created for the group service. When f31482a was merged this e2e test began failing because we are now correctly matching the script check ID to the service ID, which revealed this jobspec was invalid. * UI: Migrate to Storybook (hashicorp#6507) I originally planned to add component documentation, but as this dragged on and I found that JSDoc-to-Markdown sometimes needed hand-tuning, I decided to skip it and focus on replicating what was already present in Freestyle. Adding documentation is a finite task that can be revisited in the future. My goal was to migrate everything from Freestyle with as few changes as possible. Some adaptations that I found necessary: • the DelayedArray and DelayedTruth utilities that delay component rendering until slightly after initial render because without them: ◦ charts were rendering with zero width ◦ the JSON viewer was rendering with empty content • Storybook in Ember renders components in a routerless/controllerless context by default, so some component stories needed changes: ◦ table pagination/sorting stories access to query params, which necessitates some reaching into Ember internals to start routing and dynamically generate a Storybook route/controller to render components into ◦ some stories have a faux controller as part of their Storybook context that hosts setInterval-linked dynamic computed properties • some jiggery-pokery with anchor tags ◦ inert href='#' had to become href='javascript:; ◦ links that are actually meant to navigate need target='_parent' so they don’t navigate inside the Storybook iframe Maybe some of these could be addressed by fixes in ember-cli-storybook but I’m wary of digging around in there any more than I already have, as I’ve lost a lot of time to Storybook confusion and frustrations already 😞 The STORYBOOK=true environment variable tweaks some environment settings to get things working as expected in the Storybook context. I chose to: • use angle bracket invocation within stories rather than have to migrate them soon after having moved to Storybook • keep Freestyle around for now for its palette and typeface components * e2e: update framework to allow deploying Nomad (hashicorp#6969) The e2e framework instantiates clients for Nomad/Consul but the provisioning of the actual Nomad cluster is left to Terraform. The Terraform provisioning process uses `remote-exec` to deploy specific versions of Nomad so that we don't have to bake an AMI every time we want to test a new version. But Terraform treats the resulting instances as immutable, so we can't use the same tooling to update the version of Nomad in-place. This is a prerequisite for upgrade testing. This changeset extends the e2e framework to provide the option of deploying Nomad (and, in the future, Consul/Vault) with specific versions to running infrastructure. This initial implementation is focused on deploying to a single cluster via `ssh` (because that's our current need), but provides interfaces to hook the test run at the start of the run, the start of each suite, or the start of a given test case. Terraform work includes: * provides Terraform output that written to JSON used by the framework to configure provisioning via `terraform output provisioning`. * provides Terraform output that can be used by test operators to configure their shell via `$(terraform output environment)` * drops `remote-exec` provisioning steps from Terraform * makes changes to the deployment scripts to ensure they can be run multiple times w/ different versions against the same host. * e2e: ensure group script check tests interpolation (hashicorp#6972) Fixes a bug introduced in 0aa58b9 where we're writing a test file to a taskdir-interpolated location, which works when we `alloc exec` but not in the jobspec for a group script check. This changeset also makes the test safe to run multiple times by namespacing the file with the alloc ID, which has the added bonus of exercising our alloc interpolation code for group script checks. * Return FailedTGAlloc metric instead of no node err If an existing system allocation is running and the node its running on is marked as ineligible, subsequent plan/applys return an RPC error instead of a more helpful plan result. This change logs the error, and appends a failedTGAlloc for the placement. * update changelog * extract leader step function * Handle Nomad leadership flapping Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193 [2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436 [3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202 [4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877 * e2e: document e2e provisioning process (hashicorp#6976) * Add the digital marketing team as the code owners for the website dir * Mock the eligibility endpoint in mirage * Implement eligibility toggling in the data layer * Add isMigrating property to the allocation model * Mock the drain endpoint * drain and forceDrain adapter methods * Update drain methods to properly wrap DrainSpec params * cancelDrain adapter method * Reformat the client detail page to use the two-row header design * Add tooltip to the eligibility control * Update the underlying node model when toggling eligibility in mirage * Eligibility toggling behavior * PopoverMenu component * Update the dropdown styles to be more similar to button styles * Multiline modifier for tooltips * More form styles as needed for the drain form * Initial layout of the drain options popover * Let dropdowns assume their full width * Add triggerClass support to the popover menu * Factor out the drain popover and implement its behaviors * Extract the duration parsing into a util * Test coverage for the parse duration util * Refactor parseDuration to support multi-character units * Polish for the drain popover * Stub out all the markup for the new drain strategy view * Fill in the drain strategy ribbon values * Fill out the metrics and time since values in the drain summary * Drain complete notification * Drain stop and update and notifications * Modifiers to the two-step-button * Make outline buttons have a solid white background * Force drain button in the drain info box * New toggle component * Swap the eligiblity checkbox out for a toggle * Toggle bugs: focus and multiline alignment * Switch drain popover checkboxes for toggles * Clear all notifications when resetting the controller * Model the notification pattern as a page object component * Update the client detail page object * Integration tests for the toggle component * PopoverMenu integration tests * Update existing tests * New test coverage for the drain capabilities * Stack the popover menu under the subnav * Use qunit-dom where applicable * Increase the size and spacing of the toggle component * Remove superfluous information from the client details ribbon * Tweak vertical spacing of headings * Update client detail test given change to the compositeStatus property * Replace custom parse-duration implementation with an existing lib * fix comment * consul: add support for canary meta * website: add canary meta to api docs * docs: add Go versioning policy * consul: fix var name from rebase * docs: reseting bootstrap doesn't invalidate token * consul: fix var name from rebase * Update website/source/guides/security/acl.html.markdown Co-Authored-By: Tim Gross <tim@0x74696d.com> * e2e: packer builds should not be public (hashicorp#6998) * docs: tweaks * include test and address review comments * handle channel close signal Always deliver last value then send close signal. * tweak leadership flapping log messages * tests: defer closing shutdownCh * client: canonicalize alloc.Job on restore There is a case for always canonicalizing alloc.Job field when canonicalizing the alloc. I'm less certain of implications though, and the job canonicalize hasn't changed for a long time. Here, we special case client restore from database as it's probably the most relevant part. When receiving an alloc from RPC, the data should be fresh enough. * Support customizing full scheduler config * tests: run_for is already a string * canary_meta will be part of 0.10.3 (not 0.10.2) I assume this is just an oversight. I tried adding the `canary_meta` stanza to an existing v0.10.2 setup (Nomad v0.10.2 (0d2d6e3) and it did show the error message: ``` * group: 'ggg', task: 'tttt', invalid key: canary_meta ``` * use golang 1.12.16 * Allow nomad monitor command to lookup server UUID Allows addressing servers with nomad monitor using the servers name or ID. Also unifies logic for addressing servers for client_agent_endpoint commands and makes addressing logic region aware. rpc getServer test * fix tests, update changelog * e2e: add a -suite flag to e2e.Framework This change allows for providing the -suite=<Name> flag when running the e2e framework. If set, only the matching e2e/Framework.TestSuite.Component will be run, and all ther suites will be skipped. * Document default_scheduler_config option * document docker's disable_log_collection flag * batch mahmood's changelog entries [ci skip] * incorporate review feedback * core: add limits to unauthorized connections Introduce limits to prevent unauthorized users from exhausting all ephemeral ports on agents: * `{https,rpc}_handshake_timeout` * `{http,rpc}_max_conns_per_client` The handshake timeout closes connections that have not completed the TLS handshake by the deadline (5s by default). For RPC connections this timeout also separately applies to first byte being read so RPC connections with TLS enabled have `rpc_handshake_time * 2` as their deadline. The connection limit per client prevents a single remote TCP peer from exhausting all ephemeral ports. The default is 100, but can be lowered to a minimum of 26. Since streaming RPC connections create a new TCP connection (until MultiplexV2 is used), 20 connections are reserved for Raft and non-streaming RPCs to prevent connection exhaustion due to streaming RPCs. All limits are configurable and may be disabled by setting them to `0`. This also includes a fix that closes connections that attempt to create TLS RPC connections recursively. While only users with valid mTLS certificates could perform such an operation, it was added as a safeguard to prevent programming errors before they could cause resource exhaustion. * docs: document limits Taken more or less verbatim from Consul. * Merge pull request hashicorp#160 from hashicorp/b-mtls-hostname server: validate role and region for RPC w/ mTLS * docs: bump 0.10.2 -> 0.10.3 * docs: add v0.10.3 release to changelog * Add an ability for client permissions * Refactor ability tests to use a setup hook for ability lookup * Enable the eligibility toggle conditionally based on acls * Refetch all ACL things when the token changes * New disabled buttons story * Disabled button styles * Disable options for popover and drain-popover * hclfmt a test jobspec (hashicorp#7011) * Update disabled 'Run Job' button to use standard disabled style * Add an explanatory tooltip to the unauthorized node drain popover * Fix token referencing from the token controller, as well as resetting * Handle the case where ACLs aren't enabled in abilities * Account for disabled ACLs in ability tests * Acceptance test for disabled node write controls * Use secret ID for NOMAD_TOKEN Use secret ID for NOMAD_TOKEN as the accessor ID doesn't seem to work. I tried with a local micro cluster following the tutorials, and if I do: ```console $ export NOMAD_TOKEN=85310d07-9afa-ef53-0933-0c043cd673c7 ``` Using the accessor ID as in this example, I get an error: ``` Error querying jobs: Unexpected response code: 403 (ACL token not found) ``` But when using the secret ID in that env var it seems to work correctly. * Pass stats interval colleciton to executor This fixes a bug where executor based drivers emit stats every second, regardless of user configuration. When serializing the Stats request across grpc, the nomad agent dropped the Interval value, and then executor uses 1s as a default value. * changelog * Some fixes to connection pooling Pick up some fixes from Consul: * If a stream returns an EOF error, clear session from cache/pool and start a new one. * Close the codec when closing StreamClient * Allow for an icon within the node status light * Add an icon inside the node status light * Assign icons to node statuses * New node initializing icon * Redo the node-status-light CSS to be icon-based * Add an animation for the initializing state * Call out the 'down' status too, since it's a pretty bad one * command, docs: create and document consul token configuration for connect acls (hashicorpgh-6716) This change provides an initial pass at setting up the configuration necessary to enable use of Connect with Consul ACLs. Operators will be able to pass in a Consul Token through `-consul-token` or `$CONSUL_TOKEN` in the `job run` and `job revert` commands (similar to Vault tokens). These values are not actually used yet in this changeset. * nomad: ensure a unique ClusterID exists when leader (hashicorpgh-6702) Enable any Server to lookup the unique ClusterID. If one has not been generated, and this node is the leader, generate a UUID and attempt to apply it through raft. The value is not yet used anywhere in this changeset, but is a prerequisite for hashicorpgh-6701. * client: enable nomad client to request and set SI tokens for tasks When a job is configured with Consul Connect aware tasks (i.e. sidecar), the Nomad Client should be able to request from Consul (through Nomad Server) Service Identity tokens specific to those tasks. * nomad: proxy requests for Service Identity tokens between Clients and Consul Nomad jobs may be configured with a TaskGroup which contains a Service definition that is Consul Connect enabled. These service definitions end up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In the case where Consul ACLs are enabled, a Service Identity token is required for these tasks to run & connect, etc. This changeset enables the Nomad Server to recieve RPC requests for the derivation of SI tokens on behalf of instances of Consul Connect using Tasks. Those tokens are then relayed back to the requesting Client, which then injects the tokens in the secrets directory of the Task. * client: enable envoy bootstrap hook to set SI token When creating the envoy bootstrap configuration, we should append the "-token=<token>" argument in the case where the sidsHook placed the token in the secrets directory. * nomad: fixup token policy validation * nomad: handle SI token revocations concurrently Be able to revoke SI token accessors concurrently, and also ratelimit the requests being made to Consul for the various ACL API uses. * agent: re-enable the server in dev mode * client: remove unused indirection for referencing consul executable Was thinking about using the testing pattern where you create executable shell scripts as test resources which "mock" the process a bit of code is meant to fork+exec. Turns out that wasn't really necessary in this case. * client: skip task SI token file load failure if testing as root The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when running as a non-priveleged user, since it deliberately tries to read an un-readable file to simulate a failure loading the SI token file. * comments: cleanup some leftover debug comments and such * nomad,client: apply smaller PR suggestions Apply smaller suggestions like doc strings, variable names, etc. Co-Authored-By: Nick Ethier <nethier@hashicorp.com> Co-Authored-By: Michael Schurter <mschurter@hashicorp.com> * nomad,client: apply more comment/style PR tweaks * client: set context timeout around SI token derivation The derivation of an SI token needs to be safegaurded by a context timeout, otherwise an unresponsive Consul could cause the siHook to block forever on Prestart. * client: manage TR kill from parent on SI token derivation failure Re-orient the management of the tr.kill to happen in the parent of the spawned goroutine that is doing the actual token derivation. This makes the code a little more straightforward, making it easier to reason about not leaking the worker goroutine. * nomad: fix leftover missed refactoring in consul policy checking * nomad: make TaskGroup.UsesConnect helper a public helper * client: PR cleanup - shadow context variable * client: PR cleanup - improved logging around kill task in SIDS hook * client: additional test cases around failures in SIDS hook * tests: skip some SIDS hook tests if running tests as root * e2e: e2e test for connect with consul acls Provide script for managing Consul ACLs on a TF provisioned cluster for e2e testing. Script can be used to 'enable' or 'disable' Consul ACLs, and automatically takes care of the bootstrapping process if necessary. The bootstrapping process takes a long time, so we may need to extend the overall e2e timeout (20 minutes seems fine). Introduces basic tests for Consul Connect with ACLs. * e2e: remove forgotten unused field from new struct * e2e: do not use eventually when waiting for allocs This test is causing panics. Unlike the other similar tests, this one is using require.Eventually which is doing something bad, and this change replaces it with a for-loop like the other tests. Failure: === RUN TestE2E/Connect === RUN TestE2E/Connect/*connect.ConnectE2ETest === RUN TestE2E/Connect/*connect.ConnectE2ETest/TestConnectDemo === RUN TestE2E/Connect/*connect.ConnectE2ETest/TestMultiServiceConnect === RUN TestE2E/Connect/*connect.ConnectClientStateE2ETest panic: Fail in goroutine after TestE2E/Connect/*connect.ConnectE2ETest has completed goroutine 38 [running]: testing.(*common).Fail(0xc000656500) /opt/google/go/src/testing/testing.go:565 +0x11e testing.(*common).Fail(0xc000656100) /opt/google/go/src/testing/testing.go:559 +0x96 testing.(*common).FailNow(0xc000656100) /opt/google/go/src/testing/testing.go:587 +0x2b testing.(*common).Fatalf(0xc000656100, 0x1512f90, 0x10, 0xc000675f88, 0x1, 0x1) /opt/google/go/src/testing/testing.go:672 +0x91 github.com/hashicorp/nomad/e2e/connect.(*ConnectE2ETest).TestMultiServiceConnect.func1(0x0) /home/shoenig/go/src/github.com/hashicorp/nomad/e2e/connect/multi_service.go:72 +0x296 github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually.func1(0xc0004962a0, 0xc0002338f0) /home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1494 +0x27 created by github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert.Eventually /home/shoenig/go/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/assert/assertions.go:1493 +0x272 FAIL github.com/hashicorp/nomad/e2e 21.427s * e2e: uncomment test case that is not broken * e2e: use hclfmt on consul acls policy config files * e2e: agent token was only being set for server0 * e2e: remove redundant extra API call for getting allocs * e2e: setup consul ACLs a little more correctly * tests: set consul token for nomad client for testing SIDS TR hook * nomad: min cluster version for connect ACLs is now v0.10.4 * nomad: remove unused default schedular variable This is from a merge conflict resolution that went the wrong direction. I assumed the block had been added, but really it had been removed. Now, it is removed once again. * docs: update chanagelog to mention connect with acls * nomad/docs: increment version number to 0.10.4 * sentinel: copy jobs to prevent mutation It's unclear whether Sentinel code can mutate values passed to the eval, so ensure it cannot by copying the job. * ignore computed diffs if node is ineligible test flakey, add temp sleeps for debugging fix computed class * make diffSystemAllocsForNode aware of eligibility diffSystemAllocs -> diffSystemAllocsForNode, this function is only used for diffing system allocations, but lacked awareness of eligible nodes and the node ID that the allocation was going to be placed. This change now ignores a change if its existing allocation is on an ineligible node. For a new allocation, it also checks tainted and ineligible nodes in the same function instead of nil-ing out the diff after computation in diffSystemAllocs * add test for node eligibility * comment for filtering reason * update changelog * vagrant: disable audio interference Avoid Vagrant/virtualbox interferring with host audio when the VM boots. * prehook: fix enterprise repo remote value * dev: Tweaks to cluster dev scripts Consolidate all nomad data dir in a single root `/tmp/nomad-dev-cluster`. Eases clean up. Allow running script from any path - don't require devs to cd into `dev/cluster` directory first. Also, block while nomad processes are running and prapogate SIGTERM/SIGINT to nomad processes to shutdown. * e2e: remove leftover debug println statement * run "make hclfmt" * make: emit explanation for /api isolation Emit a slightly helpful message when /api depends on nomad internal packages. * pool: Clear connection before releasing This to be consistent with other connection clean up handler as well as consul's https://github.com/hashicorp/consul/blob/v1.6.3/agent/pool/pool.go#L468-L479 . * Fix panic when monitoring a local client node Fixes a panic when accessing a.agent.Server() when agent is a client instead. This pr removes a redundant ACL check since ACLs are validated at the RPC layer. It also nil checks the agent server and uses Client() when appropriate. * agent Profile req nil check s.agent.Server() clean up logic and tests * update changelog * docs: fix misspelling * keep placed canaries aligned with alloc status * nomad state store must be modified through raft, rm local state change * add state store test to ensure PlacedCanaries is updated * docs: add link & reorg hashicorp#6690 in changelog * docs: fix typo, ordering, & style in changelog * e2e: turn no-ACLs connect tests back on Also cleanup more missed debugging things >.> * e2e: improve provisioning defaults and documentation (hashicorp#7062) This changeset improves the ergonomics of running the Nomad e2e test provisioning process by defaulting to a blank `nomad_sha` in the Terraform configuration. By default, a user will now need to pass in one of the Nomad version flags. But they won't have to manually edit the `provisioning.json` file for the common case of deploying a released version of Nomad, and won't need to put dummy values for `nomad_sha`. Includes general documentation improvements. * e2e: rename linux runner to avoid implicit build tag (hashicorp#7070) Go implicitly treats files ending with `_linux.go` as build tagged for Linux only. This broke the e2e provisioning framework on macOS once we tried importing it into the `e2e/consulacls` module. * e2e: wait 2m rather than 10s after disabling consul acls Pretty sure Consul / Nomad clients are often not ready yet after the ConsulACLs test disables ACLs, by the time the next test starts running. Running locally things tend to work, but in TeamCity this seems to be a recurring problem. However, when running locally sometimes I do see that the "show status" step after disabling ACLs, some nodes are still initializing, suggesting we're right on the border of not waiting long enough nomad node status ID DC Name Class Drain Eligibility Status 0e4dfce2 dc1 EC2AMAZ-JB3NF9P <none> false eligible ready 6b90aa06 dc2 ip-172-31-16-225 <none> false eligible ready 7068558a dc2 ip-172-31-20-143 <none> false eligible ready e0ae3c5c dc1 ip-172-31-25-165 <none> false eligible ready 15b59ed6 dc1 ip-172-31-23-199 <none> false eligible initializing Going to try waiting a full 2 minutes after disabling ACLs, hopefully that will help things Just Work. In the future, we should probably be parsing the output of the status checks and actually confirming all nodes are ready. Even better, maybe that's something shipyard will have built-in. * add e2e test for system sched ineligible nodes * get test passing, new util func to wait for not pending * clean up * rm unused field * fix check * simplify job, better error * docs: hashicorp#6065 shipped in v0.10.0, not v0.9.6 PR hashicorp#6065 was intended to be backported to v0.9.6 to fix issue hashicorp#6223. However it appears to have not been backported: * https://github.com/hashicorp/nomad/blob/v0.9.6/client/allocrunner/taskrunner/task_runner.go#L1349-L1351 * https://github.com/hashicorp/nomad/blob/v0.9.7/client/allocrunner/taskrunner/task_runner.go#L1349-L1351 The fix was included in v0.10.0: * https://github.com/hashicorp/nomad/blob/v0.10.0/client/allocrunner/taskrunner/task_runner.go#L1363-L1370 * e2e: add --quiet flag to s3 copy to reduce log spam (hashicorp#7085) * Explicit transparent bg on popover actions * Override the max-width on mobile to avoid losing space due to non-existent gutter menu * changelog windows binaries being signed Note that 0.10.4, nomad windows binaries will be signed. [ci skip] * change log for remote pprof endpoints * nomad: unset consul token on job register * nomad: assert consul token is unset on job register in tests * command: use consistent CONSUL_HTTP_TOKEN name Consul CLI uses CONSUL_HTTP_TOKEN, so Nomad should use the same. Note that consul-template uses CONSUL_TOKEN, which Nomad also uses, so be careful to preserve any reference to that in the consul-template context. * docs: update changelog mentioning consul token passthrough * release: prep 0.10.4 * Generate files for 0.10.4 release * Release v0.10.4 Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Higgison <TimHiggison@users.noreply.github.com> Co-authored-by: Buck Doyle <buck@hashicorp.com> Co-authored-by: Drew Bailey <2614075+drewbailey@users.noreply.github.com> Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Michael Lange <dingoeatingfuzz@gmail.com> Co-authored-by: Nick Ethier <nethier@hashicorp.com> Co-authored-by: Tim Gross <tim@0x74696d.com> Co-authored-by: Shantanu Gadgil <shantanugadgil@users.noreply.github.com> Co-authored-by: Seth Hoenig <shoenig@hashicorp.com> Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com> Co-authored-by: Nomad Release bot <nomad@hashicorp.com>
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This builds on API changes in #6017 and #6021 to conditionally turn off the
“Run Job” button based on the current token’s capabilities, or the capabilities
of the anonymous policy if no token is present.
If you try to visit the job-run route directly, it redirects to the job list.
Here’s a GIF to get a sense: