-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admin: add /runtime_modify endpoint and make runtime work without fs backing #2837
admin: add /runtime_modify endpoint and make runtime work without fs backing #2837
Conversation
Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you! |
@mattklein123 lost you, you linked to data-plane-api there? I added a note in |
@mattklein123 nvm, just saw the new docs, willfix |
@jsedgwick this needs docs also, so you can just do docs/release in same data-plane-api PR if that is easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great thanks. Well done. Main comment is around operational usability.
source/common/runtime/runtime_impl.h
Outdated
|
||
NullSnapshotImpl snapshot_; | ||
Filesystem::WatcherPtr watcher_; | ||
std::string root_path_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you copied tis code, but can you constify root_path_
and override_path_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks good catch
|
||
void LoaderImpl::loadNewSnapshot() { | ||
ThreadLocal::ThreadLocalObjectSharedPtr ptr = createNewSnapshot(); | ||
tls_->set([ptr_copy = ptr](Event::Dispatcher&)->ThreadLocal::ThreadLocalObjectSharedPtr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as this is a shared_ptr, I think you can just capture ptr
directly and don't need ptr_copy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yeah that was supposed to be ptr = std::move(ptr)
, thx
source/server/http/admin.cc
Outdated
@@ -650,7 +665,9 @@ AdminImpl::AdminImpl(const std::string& access_log_path, const std::string& prof | |||
{"/stats", "print server stats", MAKE_ADMIN_HANDLER(handlerStats), false, false}, | |||
{"/listeners", "print listener addresses", MAKE_ADMIN_HANDLER(handlerListenerInfo), false, | |||
false}, | |||
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}}, | |||
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with the overrides is that if people are doing local tests, it would be very nice for /runtime
to be able to print out the original values, and the override value, and which one is in effect (roughly). Is there a way to make that happen? I think that would make this substantially easier to use. In the future you could also imagine something like POST /runtime_reset
which would reset all changed values to the original (whether on disk or in the base implementation nothing).
Can you take a look at what would be required to modify /runtime
to show state in a more obvious way? For runtime_reset
I would just throw a TODO somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear you on that. I'm thinking of formalizing/abstracting the concept of an override and then at the endpoint you can see all three (or more, in the future, who knows) layered - disk, override on disk, override in memory.
My concern is whether it's worth the complexity to make that happen in this PR as opposed to just waiting for more structured admin stuff to land and implementing a better, proto-ized version of this. It would make a good case study in to think about how mutations would interact with structured admin.
So, WDYT about getting this in more-or-less as-is and combining your suggestions with proto-izing these endpoints? It's not about the (mild) engineering effort but more that any complexity might be immediately refactored/deleted anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this as part of this change, even if it's not a super optimal solution. It's going to be too confusing if the user does not know what was overriden using /runtime_modify
. They will basically have to restart Envoy to get back to a fresh state. We need to be able to view the memory overrides and set them back if desired.
Along these lines, one additional request: Please add a gauge that is something like runtime.memory_overrides_active
which is set to 1 if any are set, and 0 otherwise. This will allow alarms for when someone changes something if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggreenway can you have a look as well? |
OK. I'll add something minimal to be able to view the layers of overrides.
ack on the gauge.
On Mon, Mar 19, 2018 at 1:11 PM Matt Klein ***@***.***> wrote:
***@***.**** commented on this pull request.
In source/server/http/admin.cc
<#2837 (comment)>:
> @@ -650,7 +665,9 @@ AdminImpl::AdminImpl(const std::string& access_log_path, const std::string& prof
{"/stats", "print server stats", MAKE_ADMIN_HANDLER(handlerStats), false, false},
{"/listeners", "print listener addresses", MAKE_ADMIN_HANDLER(handlerListenerInfo), false,
false},
- {"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}},
+ {"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false},
I think we need to fix this as part of this change, even if it's not a
super optimal solution. It's going to be too confusing if the user does not
know what was overriden using /runtime_modify. They will basically have
to restart Envoy to get back to a fresh state. We need to be able to view
the memory overrides and set them back if desired.
Along these lines, one additional request: Please add a gauge that is
something like runtime.memory_overrides_active which is set to 1 if any
are set, and 0 otherwise. This will allow alarms for when someone changes
something if desired.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2837 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAc4InXJlIKnFDO0y48ZXfYxSenLTl2Fks5tf-bTgaJpZM4SuQ2_>
.
--
*James Sedgwick*
Software Engineer, Networking Infrastructure
973.738.0737 <+19737380737>
[image: Lyft] <http://www.lyft.com/>
|
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}}, | ||
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}, | ||
{"/runtime_modify", "modify runtime values", MAKE_ADMIN_HANDLER(handlerRuntimeModify), | ||
false, true}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 was I think commenting on this line in #2771
Sorry I didn't notice this till just now, but per the docs this should only respond to a POST. Can we fix that? Since we are probably going to have more POST methods in the future (cc @jmarantz) it might be worth it to build this into the registration system somehow, but up to you.
Can I just write a separate PR that adds POST checking whenever UrlHandler.mutates_server_state_ is true, which is the case here? That would go into AdminImpl::runCallback inside if (path_and_query.compare(0, query_index, handler.prefix_) == 0) {
The advantage of doing that there is that it solves this issue, as well as some aspects of #2763. The disadvantage is that it would require changes to existing scripts that use curl GET for mutations, when they should be POST. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I think it's probably OK. Which are the endpoints that would be switched to POST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the ones whose last initializer-arg is true:
/cpuprofiler
/healthcheck/fail
/healthcheck/ok
/logging
/quitquitquit
/reset_counters
/runtime_modify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/healthcheck/fail
/healthcheck/ok
Is going to cause issues for Lyft, and potentially others. I'm wondering if we should leave those as GET with a deprecation cycle? (Like support both GET and POST and get people to migrate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Currently the http method is ignored so GET and POST will both work. Do you want also a warning logged when POST is issued?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, warning SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these settings going to be subsumed by xds at some point or runtime will continue to exist? |
Runtime will continue to exist. It fills a parallel function as a kill switch/operational system. However, as a best practice all settings should be configurable via normal config, and not just runtime. |
@mattklein123 PTAL, updated to expose override "layers;" the Left Please see a couple of TODOs addressed to reviewers in the code, especially the ones around structured output for |
test/server/http/admin_test.cc
Outdated
@@ -243,9 +243,9 @@ TEST_P(AdminInstanceTest, Runtime) { | |||
std::unordered_map<std::string, Runtime::Snapshot::Entry> entries2{ | |||
{"string_key", {"override", {}}}, {"extra_key", {"bar", {}}}}; | |||
|
|||
ON_CALL(*layer1, name()).WillByDefault(testing::ReturnRef("layer1")); | |||
ON_CALL(*layer1, name()).WillByDefault(testing::ReturnRefOfCopy(std::string{"layer1"})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super-fluent in gmock for various reasons, but this looks funny. What's the lifetime of expression std::string{"layer1"} such that it makes sense to return a ref to it?
I'd feel better if you made std::string{"layer1"} a method-scoped const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to return a reference because the name()
method returns const std::string&
, but it's also nice to just pass the temporary string there for brevity. This is exactly what ReturnRefOfCopy
is for; as opposed to ReturnRef
, to which your suspicion applies, it will copy and hold on to the temporary, so you don't end up with a dangling reference. See https://github.com/google/googlemock/blob/master/googlemock/docs/CheatSheet.md#returning-a-value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, very cool. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to add, full disclosure I just learned about ReturnRefOfCopy
today as well! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general looks great. Some comments to get started. Will need to take another pass. @jmarantz do you mind also reviewing once the next round of updates are done?
include/envoy/runtime/runtime.h
Outdated
@@ -116,10 +135,20 @@ class Snapshot { | |||
virtual uint64_t getInteger(const std::string& key, uint64_t default_value) const PURE; | |||
|
|||
/** | |||
* TODO: Reviewers, shall I delete getAll()? It's no longer used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, nuke if not used.
include/envoy/runtime/runtime.h
Outdated
* to top; for instance, the second layer's entries override the first layer's entries, and so on. | ||
* Any layer can add a key in addition to overriding keys in layers below. The layer vector is | ||
* safe only for the lifetime of the Snapshot. | ||
* @return const std::unordered_map<std::string, const Entry>& the raw map of loaded values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like returning vector, not map
source/common/runtime/runtime_impl.h
Outdated
@@ -55,41 +56,72 @@ struct RuntimeStats { | |||
}; | |||
|
|||
/** | |||
* Implementation of Snapshot that reads from disk. | |||
* Implementation of Snapshot whose source the vector of layers passed to the constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "whose source is ..."
source/common/runtime/runtime_impl.h
Outdated
|
||
/** | ||
* Extension of OverrideLayerImpl that maintains an in-memory set of values. These values can be | ||
* modified programmatically via mergeValues(). AdminLayer is so named because it can be access and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "can be accessed"
source/common/runtime/runtime_impl.h
Outdated
// Load a new Snapshot into TLS | ||
void loadNewSnapshot(); | ||
|
||
std::shared_ptr<AdminLayer> adminLayer_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: admin_layer_
@@ -44,7 +44,8 @@ class RandomGeneratorImpl : public RandomGenerator { | |||
COUNTER(override_dir_not_exists) \ | |||
COUNTER(override_dir_exists) \ | |||
COUNTER(load_success) \ | |||
GAUGE (num_keys) | |||
GAUGE (num_keys) \ | |||
GAUGE (admin_overrides_active) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for proper set/unset of this gauge.
include/envoy/runtime/runtime.h
Outdated
virtual const std::string& name() const PURE; | ||
}; | ||
|
||
typedef std::shared_ptr<OverrideLayer> OverrideLayerSharedPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this be const OverrideLayer
and OverrideLayerConstSharedPtr
LoaderImpl::LoaderImpl(DoNotLoadSnapshot /* unused */, RandomGenerator& generator, | ||
Stats::Store& store, ThreadLocal::SlotAllocator& tls) | ||
: generator_(generator), stats_(generateStats(store)), tls_(tls.allocateSlot()) { | ||
adminLayer_ = std::make_shared<AdminLayer>(stats_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move to initializer list
stats_.override_dir_not_exists_.inc(); | ||
} | ||
} catch (EnvoyException& e) { | ||
stats_.load_error_.inc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the behavior before when we had a load error. Did we have any runtime values at all? Just wondering if this is a change in behavior or not. It seems reasonable to have admin layer, or even root, if some layer doesn't load, but just want to be clear of what behavior was before and now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, no values at all. And yes, the admin layer will be put in place even if one or both of the disk layers doesn't load. Now that I think about it, should it be all or nothing for the two disk layers? What behavior do you think is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be simplest to just do all or nothing for all layers. If not that, I would definitely do all or nothing for the disk layers IMO. I'm fine either way, but I think I would avoid the incremental behavior we have now.
source/common/runtime/runtime_impl.h
Outdated
SnapshotImpl(const std::string& root_path, const std::string& override_path, RuntimeStats& stats, | ||
RandomGenerator& generator, Api::OsSysCalls& os_sys_calls); | ||
SnapshotImpl(RandomGenerator& generator, RuntimeStats& stats, | ||
const std::vector<OverrideLayerSharedPtr>& layers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we pass std::vector<OverrideLayerSharedPtr>&&
and move. I think all callers are move capable.
…t fs backing Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
PTAL |
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick reviewing now. In the future please do not rebase once you open a PR, it makes it much harder to review. Once opened, just merge master and add commits. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, thanks. Just a few small comments.
|
||
void LoaderImpl::loadNewSnapshot() { | ||
ThreadLocal::ThreadLocalObjectSharedPtr ptr = createNewSnapshot(); | ||
tls_->set([ptr = std::move(ptr)](Event::Dispatcher&)->ThreadLocal::ThreadLocalObjectSharedPtr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can just capture ptr
directly without move. With C++14 you might even be able to just do ptr = createNewSnapshot()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::move
avoids a shared_ptr copy so I did this by instinct. And no, since, we need ptr
to be the TLS shared_ptr, not a unique_ptr<Snapshot>
, since the TLS system requires this lambda to be copyable. IMO this is the cleanest way to get the casts/copyability right.
*edit: to be clear, you can do this without the first line but it would make the capture clause really long and unwieldy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
source/common/runtime/runtime_impl.h
Outdated
/** | ||
* Copy-constructible so that it can snapshotted. | ||
*/ | ||
AdminLayer(const AdminLayer& o) : AdminLayer{o.stats_} { values_ = o.values(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/o/admin_layer
source/common/runtime/runtime_impl.h
Outdated
|
||
private: | ||
RuntimeStats generateStats(Stats::Store& store); | ||
ThreadLocal::SlotPtr tls_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline before this line (whitespace between methods and members)
source/server/server.cc
Outdated
@@ -289,12 +289,13 @@ Runtime::LoaderPtr InstanceUtil::createRuntime(Instance& server, | |||
ENVOY_LOG(info, "runtime override subdirectory: {}", override_subdirectory); | |||
|
|||
Api::OsSysCallsPtr os_sys_calls(new Api::OsSysCallsImpl); | |||
return Runtime::LoaderPtr{new Runtime::LoaderImpl( | |||
return Runtime::LoaderPtr{new Runtime::DiskBackedLoaderImpl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: std::make_unique (not your code but might as well fix). Same below.
test/server/http/admin_test.cc
Outdated
"layer2" | ||
], | ||
"entries": { | ||
"extra_key": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make each entry look like:
"extra_key": {
"final_value": "...",
"layers": [ ... ]
}
I think it would make it easier to see what is the current value quickly.
source/server/http/admin.cc
Outdated
} else { | ||
response.add("usage: /runtime?format=json\n"); | ||
rc = Http::Code::BadRequest; | ||
rapidjson::Document document; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO here to protoize this? I should be able to get to this soon when I finish up the other config_dump stuff.
…time_modification
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, thanks.
Description:
Adds
/runtime_modify
endpoint to modify runtime values. Values added/modified via this endpoint are effectively another override layer. Deleting one by setting it to empty string only deletes a previously added override, not any underlying fs-backed key. The added tests demonstrate this behavior, as do added comments.This ended up mostly being a refactor of runtime_impl such that the "null" runtime actually operates on an in memory map manipulated by this endpoint instead of on an empty map. The disk backed implementation becomes an extension of the in memory implementation where the disk values are primary.
Risk Level: Low. New admin endpoint, largely a refactor.
Testing:
Added unit. Did a little manual sanity testing as well
Fixes #514