Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin: add /runtime_modify endpoint and make runtime work without fs backing #2837

Merged
merged 11 commits into from
Apr 2, 2018
Merged

admin: add /runtime_modify endpoint and make runtime work without fs backing #2837

merged 11 commits into from
Apr 2, 2018

Conversation

jsedgwick
Copy link

Description:
Adds /runtime_modify endpoint to modify runtime values. Values added/modified via this endpoint are effectively another override layer. Deleting one by setting it to empty string only deletes a previously added override, not any underlying fs-backed key. The added tests demonstrate this behavior, as do added comments.

This ended up mostly being a refactor of runtime_impl such that the "null" runtime actually operates on an in memory map manipulated by this endpoint instead of on an empty map. The disk backed implementation becomes an extension of the in memory implementation where the disk values are primary.

Risk Level: Low. New admin endpoint, largely a refactor.

Testing:
Added unit. Did a little manual sanity testing as well

Fixes #514

@mattklein123
Copy link
Member

Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you!

@jsedgwick
Copy link
Author

@mattklein123 lost you, you linked to data-plane-api there? I added a note in RAW_RELEASE_NOTES.md in this PR. Please advise.

@jsedgwick
Copy link
Author

@mattklein123 nvm, just saw the new docs, willfix

@mattklein123
Copy link
Member

@mattklein123
Copy link
Member

@jsedgwick this needs docs also, so you can just do docs/release in same data-plane-api PR if that is easier.

@jsedgwick
Copy link
Author

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great thanks. Well done. Main comment is around operational usability.


NullSnapshotImpl snapshot_;
Filesystem::WatcherPtr watcher_;
std::string root_path_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you copied tis code, but can you constify root_path_ and override_path_

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks good catch


void LoaderImpl::loadNewSnapshot() {
ThreadLocal::ThreadLocalObjectSharedPtr ptr = createNewSnapshot();
tls_->set([ptr_copy = ptr](Event::Dispatcher&)->ThreadLocal::ThreadLocalObjectSharedPtr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as this is a shared_ptr, I think you can just capture ptr directly and don't need ptr_copy ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yeah that was supposed to be ptr = std::move(ptr), thx

@@ -650,7 +665,9 @@ AdminImpl::AdminImpl(const std::string& access_log_path, const std::string& prof
{"/stats", "print server stats", MAKE_ADMIN_HANDLER(handlerStats), false, false},
{"/listeners", "print listener addresses", MAKE_ADMIN_HANDLER(handlerListenerInfo), false,
false},
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}},
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with the overrides is that if people are doing local tests, it would be very nice for /runtime to be able to print out the original values, and the override value, and which one is in effect (roughly). Is there a way to make that happen? I think that would make this substantially easier to use. In the future you could also imagine something like POST /runtime_reset which would reset all changed values to the original (whether on disk or in the base implementation nothing).

Can you take a look at what would be required to modify /runtime to show state in a more obvious way? For runtime_reset I would just throw a TODO somewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you on that. I'm thinking of formalizing/abstracting the concept of an override and then at the endpoint you can see all three (or more, in the future, who knows) layered - disk, override on disk, override in memory.

My concern is whether it's worth the complexity to make that happen in this PR as opposed to just waiting for more structured admin stuff to land and implementing a better, proto-ized version of this. It would make a good case study in to think about how mutations would interact with structured admin.

So, WDYT about getting this in more-or-less as-is and combining your suggestions with proto-izing these endpoints? It's not about the (mild) engineering effort but more that any complexity might be immediately refactored/deleted anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to fix this as part of this change, even if it's not a super optimal solution. It's going to be too confusing if the user does not know what was overriden using /runtime_modify. They will basically have to restart Envoy to get back to a fresh state. We need to be able to view the memory overrides and set them back if desired.

Along these lines, one additional request: Please add a gauge that is something like runtime.memory_overrides_active which is set to 1 if any are set, and 0 otherwise. This will allow alarms for when someone changes something if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in #2771 about making this POST. cc @jmarantz

@mattklein123 mattklein123 self-assigned this Mar 17, 2018
@zuercher
Copy link
Member

@ggreenway can you have a look as well?

@jsedgwick
Copy link
Author

jsedgwick commented Mar 19, 2018 via email

{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false}},
{"/runtime", "print runtime values", MAKE_ADMIN_HANDLER(handlerRuntime), false, false},
{"/runtime_modify", "modify runtime values", MAKE_ADMIN_HANDLER(handlerRuntimeModify),
false, true}},
Copy link
Contributor

@jmarantz jmarantz Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 was I think commenting on this line in #2771

Sorry I didn't notice this till just now, but per the docs this should only respond to a POST. Can we fix that? Since we are probably going to have more POST methods in the future (cc @jmarantz) it might be worth it to build this into the registration system somehow, but up to you.

Can I just write a separate PR that adds POST checking whenever UrlHandler.mutates_server_state_ is true, which is the case here? That would go into AdminImpl::runCallback inside if (path_and_query.compare(0, query_index, handler.prefix_) == 0) {

The advantage of doing that there is that it solves this issue, as well as some aspects of #2763. The disadvantage is that it would require changes to existing scripts that use curl GET for mutations, when they should be POST. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I think it's probably OK. Which are the endpoints that would be switched to POST?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the ones whose last initializer-arg is true:

/cpuprofiler
/healthcheck/fail
/healthcheck/ok
/logging
/quitquitquit
/reset_counters
/runtime_modify

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/healthcheck/fail
/healthcheck/ok

Is going to cause issues for Lyft, and potentially others. I'm wondering if we should leave those as GET with a deprecation cycle? (Like support both GET and POST and get people to migrate).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Currently the http method is ignored so GET and POST will both work. Do you want also a warning logged when POST is issued?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, warning SGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mandarjog
Copy link
Contributor

Are all these settings going to be subsumed by xds at some point or runtime will continue to exist?

@mattklein123
Copy link
Member

Are all these settings going to be subsumed by xds at some point or runtime will continue to exist?

Runtime will continue to exist. It fills a parallel function as a kill switch/operational system. However, as a best practice all settings should be configurable via normal config, and not just runtime.

@jsedgwick
Copy link
Author

@mattklein123 PTAL, updated to expose override "layers;" the /runtime endpoint now prints a table that includes all overrides, including disk override and admin overrides

Left /runtime_reset as a TODO but it would be trivial on top of this.

Please see a couple of TODOs addressed to reviewers in the code, especially the ones around structured output for /runtime in light of the whole layers thing

@@ -243,9 +243,9 @@ TEST_P(AdminInstanceTest, Runtime) {
std::unordered_map<std::string, Runtime::Snapshot::Entry> entries2{
{"string_key", {"override", {}}}, {"extra_key", {"bar", {}}}};

ON_CALL(*layer1, name()).WillByDefault(testing::ReturnRef("layer1"));
ON_CALL(*layer1, name()).WillByDefault(testing::ReturnRefOfCopy(std::string{"layer1"}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super-fluent in gmock for various reasons, but this looks funny. What's the lifetime of expression std::string{"layer1"} such that it makes sense to return a ref to it?

I'd feel better if you made std::string{"layer1"} a method-scoped const.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to return a reference because the name() method returns const std::string&, but it's also nice to just pass the temporary string there for brevity. This is exactly what ReturnRefOfCopy is for; as opposed to ReturnRef, to which your suspicion applies, it will copy and hold on to the temporary, so you don't end up with a dangling reference. See https://github.com/google/googlemock/blob/master/googlemock/docs/CheatSheet.md#returning-a-value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very cool. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add, full disclosure I just learned about ReturnRefOfCopy today as well! :)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks great. Some comments to get started. Will need to take another pass. @jmarantz do you mind also reviewing once the next round of updates are done?

@@ -116,10 +135,20 @@ class Snapshot {
virtual uint64_t getInteger(const std::string& key, uint64_t default_value) const PURE;

/**
* TODO: Reviewers, shall I delete getAll()? It's no longer used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, nuke if not used.

* to top; for instance, the second layer's entries override the first layer's entries, and so on.
* Any layer can add a key in addition to overriding keys in layers below. The layer vector is
* safe only for the lifetime of the Snapshot.
* @return const std::unordered_map<std::string, const Entry>& the raw map of loaded values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like returning vector, not map

@@ -55,41 +56,72 @@ struct RuntimeStats {
};

/**
* Implementation of Snapshot that reads from disk.
* Implementation of Snapshot whose source the vector of layers passed to the constructor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "whose source is ..."


/**
* Extension of OverrideLayerImpl that maintains an in-memory set of values. These values can be
* modified programmatically via mergeValues(). AdminLayer is so named because it can be access and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "can be accessed"

// Load a new Snapshot into TLS
void loadNewSnapshot();

std::shared_ptr<AdminLayer> adminLayer_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: admin_layer_

@@ -44,7 +44,8 @@ class RandomGeneratorImpl : public RandomGenerator {
COUNTER(override_dir_not_exists) \
COUNTER(override_dir_exists) \
COUNTER(load_success) \
GAUGE (num_keys)
GAUGE (num_keys) \
GAUGE (admin_overrides_active)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for proper set/unset of this gauge.

virtual const std::string& name() const PURE;
};

typedef std::shared_ptr<OverrideLayer> OverrideLayerSharedPtr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this be const OverrideLayer and OverrideLayerConstSharedPtr

LoaderImpl::LoaderImpl(DoNotLoadSnapshot /* unused */, RandomGenerator& generator,
Stats::Store& store, ThreadLocal::SlotAllocator& tls)
: generator_(generator), stats_(generateStats(store)), tls_(tls.allocateSlot()) {
adminLayer_ = std::make_shared<AdminLayer>(stats_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move to initializer list

stats_.override_dir_not_exists_.inc();
}
} catch (EnvoyException& e) {
stats_.load_error_.inc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the behavior before when we had a load error. Did we have any runtime values at all? Just wondering if this is a change in behavior or not. It seems reasonable to have admin layer, or even root, if some layer doesn't load, but just want to be clear of what behavior was before and now.

Copy link
Author

@jsedgwick jsedgwick Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, no values at all. And yes, the admin layer will be put in place even if one or both of the disk layers doesn't load. Now that I think about it, should it be all or nothing for the two disk layers? What behavior do you think is best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be simplest to just do all or nothing for all layers. If not that, I would definitely do all or nothing for the disk layers IMO. I'm fine either way, but I think I would avoid the incremental behavior we have now.

SnapshotImpl(const std::string& root_path, const std::string& override_path, RuntimeStats& stats,
RandomGenerator& generator, Api::OsSysCalls& os_sys_calls);
SnapshotImpl(RandomGenerator& generator, RuntimeStats& stats,
const std::vector<OverrideLayerSharedPtr>& layers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we pass std::vector<OverrideLayerSharedPtr>&& and move. I think all callers are move capable.

James Sedgwick added 8 commits March 29, 2018 18:25
…t fs backing

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

PTAL

Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@mattklein123
Copy link
Member

@jsedgwick reviewing now. In the future please do not rebase once you open a PR, it makes it much harder to review. Once opened, just merge master and add commits. Thanks!

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, thanks. Just a few small comments.


void LoaderImpl::loadNewSnapshot() {
ThreadLocal::ThreadLocalObjectSharedPtr ptr = createNewSnapshot();
tls_->set([ptr = std::move(ptr)](Event::Dispatcher&)->ThreadLocal::ThreadLocalObjectSharedPtr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can just capture ptr directly without move. With C++14 you might even be able to just do ptr = createNewSnapshot()?

Copy link
Author

@jsedgwick jsedgwick Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move avoids a shared_ptr copy so I did this by instinct. And no, since, we need ptr to be the TLS shared_ptr, not a unique_ptr<Snapshot>, since the TLS system requires this lambda to be copyable. IMO this is the cleanest way to get the casts/copyability right.

*edit: to be clear, you can do this without the first line but it would make the capture clause really long and unwieldy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

/**
* Copy-constructible so that it can snapshotted.
*/
AdminLayer(const AdminLayer& o) : AdminLayer{o.stats_} { values_ = o.values(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/o/admin_layer


private:
RuntimeStats generateStats(Stats::Store& store);
ThreadLocal::SlotPtr tls_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline before this line (whitespace between methods and members)

@@ -289,12 +289,13 @@ Runtime::LoaderPtr InstanceUtil::createRuntime(Instance& server,
ENVOY_LOG(info, "runtime override subdirectory: {}", override_subdirectory);

Api::OsSysCallsPtr os_sys_calls(new Api::OsSysCallsImpl);
return Runtime::LoaderPtr{new Runtime::LoaderImpl(
return Runtime::LoaderPtr{new Runtime::DiskBackedLoaderImpl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::make_unique (not your code but might as well fix). Same below.

"layer2"
],
"entries": {
"extra_key": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make each entry look like:

"extra_key": {
  "final_value": "...",
  "layers": [ ... ]
}

I think it would make it easier to see what is the current value quickly.

} else {
response.add("usage: /runtime?format=json\n");
rc = Http::Code::BadRequest;
rapidjson::Document document;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO here to protoize this? I should be able to get to this soon when I finish up the other config_dump stuff.

James Sedgwick added 2 commits April 2, 2018 12:49
Signed-off-by: James Sedgwick <jsedgwick@lyft.com>
@jsedgwick
Copy link
Author

PTAL

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, thanks.

@mattklein123 mattklein123 merged commit 45a126d into envoyproxy:master Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants