-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for using Consul 0.6.0 as a Membership Provider #1267
Conversation
Hi @PaulNorth, I'm your friendly neighborhood .NET Foundation Pull Request Bot (You can call me DNFBOT). Thanks for your contribution! TTYL, DNFBOT; |
This code is that which was discussed in #1226. In that discussion @gabikliot raised some points which I will discuss here. For the record, I am not affiliated with Consul or its creator Hashicorp and the following is simply a dump of my experience of using Consul. Consul is a distributed, highly-available, datacenter aware service discovery solution (https://consul.io/). Consumers register their services with a string key which can then be discovered by clients in a number of ways, including via HTTP by appending the key to Consul's root uri. This PR uses the DeploymentId as the Consul Key. Consul works by adding nodes to a cluster which uses a Gossip protocol to maintain awareness of node availability and distribute the service directory around the cluster. To do this it uses two other projects: https://github.com/hashicorp/raft and https://github.com/hashicorp/serf; Raft ensures there is consensus within the cluster when nodes are added or services are registered & Serf optimises the cluster connection and inter-communication (Gossip). Although it is feasible that Orleans could incorporate these technologies directly, that would require a separate and more complex development effort. We, and I suspect most others, have non-orleans requirements for service discovery so it made sense to make use of Consul directly rather than reimplement Raft and Serf in Orleans. To support geo-distribution, Consul simply includes the concept of each node being in a "datacenter" (https://consul.io/docs/guides/datacenters.html). By having each node take a -dc "" parameter at startup, nodes in the cluster can be grouped into the same datacenter and this information is used to both optimise the cluster and route clients to the closest resource. Registered services are assumed to be co-located with the node in the cluster that they registered with, so a client connecting to an Orleans Silo would include the DeploymentId (and in theory a datacenter) in the client SystemStore config element. The list of silos Consul returns is then ordered based on their proximity to the client. Please note, datacenter support is not implemented in this PR as this would break the abstraction that the SystemStore places over existing membership options (and possibly conflict with Orleans own geo-distribution effort?). Example Consul service registration for a 2 silo Orleans cluster (running on the same machine for dev only :) where each has been stopped and restarted once. Let me know if this answers the questions you had. |
Thank you very much @PaulNorth !
@PaulNorth, you can write a page about using Orleans with Consul and submit it as a PR to https://github.com/dotnet/orleans/tree/gh-pages. A good place will be in the http://dotnet.github.io/orleans/Advanced-Concepts/.
|
AFAIK, we haven't considered Consul in Gigya, @rore can probably provide a definite answer. I personally wasn't aware on its existence till now.
I might be (probably am) wrong here, so add |
@jthelin Thanks, I have now read that page and will make the required change.
Thanks for the excellent feedback. |
Just a small comment about consul - we looked at it as a possibility for doing service discovery, we're not sure yet we will actually use it. But I believe it will be a great addition as a membership provider alongside zookeeper, as it is gaining adoption and becoming a common tool. |
@PaulNorth, Thanks for signing the contribution license agreement so quickly! Actual humans will now validate the agreement and then evaluate the PR. |
@PaulNorth, just my 2 cents Because |
THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS | ||
OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, | ||
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently removed all license verbiage from source files and left it only in https://github.com/dotnet/orleans/blob/master/LICENSE. Please remove it from the sources to reduce noise.
It seems to me it should be trivial to make Consul async - by adding an async method ( |
I think I can craft a PR for https://github.com/PlayFab/consuldotnet if somebody could help testing it. |
@sergeybykov You may want to hold fire a while as they had a PR open which has switched to using HttpClient ( PlayFab/consuldotnet#31), after I asked the question (PlayFab/consuldotnet#33) it looks like the author is merging it in; he also mentioned preferring to create a separate AsyncClient. It's up to you but as @alfeg and I have commented, it may be cleaner to just use HttpClient async calls directly as I am only using a small subset of Consul commands. |
I'm almost done with the PR. I might submit it anyway. :-) |
OK, let's discuss the options for implementing membership table integration. Here are all the options I see:
If I had to choose out of all those options, I will start with 2. This is the easiest to implemented and the safest since it has already been done at least 3 times (Azure Table, ZK, SQL). |
Thanks for your thoughts on this @gabikliot it's very useful information as we have been working on this yesterday and have been doing a PoC on a solution best described as a variation on options 2+4. Correct me if I'm wrong, but it is my interpretation that the only silo information that is subject to contention is: TableVersion, TableVersionEtag, SiloStatus, SiloEtag, SuspectingSilos? Building on that, my assumption is that no silo will ever try to update a different Silo's other details; in particular its IAmAliveTime. Our solution is based on this assumption and registers the basic Silo information as a Service Register. The assumption is that only the Silo itself will ever need to update this registration when it calls UpdateIAmAlive(), and it is further assumed that the Silo will not make multiple competing calls to UpdateIAmAlive(). All the data subject to contention (as above) is stored in a KV and writes are guaranteed consistent by using Consul's CAS. This partitioning has the benefit of minimising the length of the KV value (it is limited to 512KB), using Consul's Service Registration which is a natural fit for the silo endpoint information and reducing contention on the KV entry due to silos updating their IAmAlive, On read, this solution requires one call to get the Service Registration (which will return all known silos) and a second call to get the Cluster "Liveness" KV, the ConsulMembershipProvider then rehydrates this data into the format that Orleans expects. What are your thoughts on this hybrid approach? We do have some issues at present:
|
I have updated the PR which addresses every issue raised and implements the approach I proposed today. Known Limitations / Issues (Incorporates & supersedes my previous list of issues :) )
|
About item 2: There's an issue open for that PlayFab/consuldotnet#28 but no real progress has been made on it. The problem is that the version doesn't actually matter (it's using an extremely simple subset of Newtonsoft.Json) but there's no good way to tell .NET "use any Newtonsoft.Json.dll over version 4.0.0.0, whatever you've got is fine" that I know of. |
Hi @PaulNorth . Great work, but let me suggest a bit different approach. I don't like the single kv entry to ALL silo data (even if it is only the dynamic part that is subject to contention). It limits the size of the deployment, creates the need to shorten/clean it up. Let me suggest an alternative. Use one kv entry per silo and just don't implement Table version at all. As you saw in the protocol description, Table version is only used in the extended protocol. The basic protocol will work without iy, just like that. What you will loose is really not much:
So my suggestion is to completely ignore Table version. Upon update/insert just don't write it anywhere at all, and only write the silo kv row. Upon read, just always return Table version 0 with same etag. That way in the future, if we do decide to use it somewhere else, it will be very clear that Consul membership does not support Table version and we will be able to look for alternative solutions (use semaphore, use your approach, ...). I would not be worried about that. Now, if you agree with the above, you should not be concerned about kv size. That means you can further simplify your protocol and only use kv and not register the service. I actually prefer you do it this way, since by utilizing both you have created too tight coupling between those 2 systems (kv and registration) and this may be fragile. For example, what will happen if service registration thinks silo is down, but Orleans membership does not think so? You may not be able to read all silos info from service registration. Also from the standpoint of taking strong dependencies, it is better to take dependency only on one system and not on 2. It will also simplify debugging/testing/troubleshooting if something does not work. If you still want silos to be registered as a service, do it like in option 4, but don't make Orleans membership be dependent on 2 different systems. That is at least my intuition.
I think you can update I will look at the configuration. I think we should support different systems stores for liveness and reminder. |
I agree with everything @gabikliot said. just wanted to add something about:
A safe table cleanup can be implemented as follows:
|
@gabikliot I will give this approach a go today. Is there any policy on upgrading Newtonsoft.Json to the same version as Consul (7.0.1) or to the latest current version (8.0.2)? |
that's probably a question for @jthelin |
I see no reason for not upgrading to 8.0.2. |
Let's look at the data:
Also, a "patch" release within 10 days of v8 availability is not an encouraging sign (too much "Christmas Cheer" maybe?) and suggests to me that v8 is maybe not completely "stable" just yet? I would favor staying at the tried and trusted 7.0.1 stability point for now. |
} | ||
catch (Exception ex) | ||
{ | ||
_logger.Verbose("ConsulMembershipProvider failed to insert registration for silo {0}; {1}.", entry.SiloAddress, ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend logging this and all exception at least on Info level, maybe even warning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gone with Info level for exceptions for two reasons: 1. The exception is rethrown anyway and Orleans MembershipOracle looks to be already logging these exceptions as error. 2. Logger.Warn() requires an ErrorCode and every example I can find uses the Orleans.ErrorCode enum which I assume I shouldn't modify for the benefit of an extension or arbitrarily overload an existing error code.
Insert and Update look good now. We are very close now! |
Notes:
|
👍 |
Looks good! |
@gabikliot We are just starting to work with GitHub here and have never had to Rebase and Squash changes. So following guidance I have pulled and merged the latest from orleans/master to PaulNorth/master and used "Rebase...Interactively" in SourceTree on my branch , choosing to squash the commits during the rebase. This hasn't created the commit history I was expecting to see so can you advise further if it is not what you wanted. |
I am not a GIT expert either, there might be other ways, but the way I do it is the following:
|
e574146
to
2c88f54
Compare
2c88f54
to
16b395b
Compare
Cheers for the advice, the commit should be what you are after now. |
Added support for using Consul 0.6.0 as a Membership Provider
Great! It would be great if you could write a documentation page, where you can: You can put this page at: https://github.com/dotnet/orleans/tree/gh-pages/Runtime-Implementation-Details/Consul |
Big thanks to @PaulNorth and everyone who helped shape this into another important option for running Orleans anywhere you need it, the way you need! Shows the true power of OSS collaboration and leveraging other people's work. |
Glad I could help. I will keep an eye on Consul development and update the provider when a stable release supporting atomic operations is available. |
@PaulNorth we have multi key transactions slated for the upcoming 0.7 major release! |
At some point a package update should be done for the Consul/Orleans integration because I've fixed quite a few bugs in Consul.NET and cleaned up a ton of things - see the Changelog (everything from 2016-02-07 and newer is not in the Consul/Orleans package). I'll look into making a PR for the update myself but it may be a lot faster/cleaner of @PaulNorth does a PR when he has time since the CLA has been signed and all that jazz - it should only involve updating the Of course I'll also be updating the .NET API for Consul 0.7 whenever it comes out. Cool to hear that multikey transactions are going to be a thing, @armon . |
@PaulNorth If you'd like to update |
@sergeybykov It appears that the Orleans master branch already uses the latest Consul.net client (0.6.4.1) (thanks @highlyunavailable) and Hashicorp have not yet released version 0.7 of the server. The changes to take advantage of multi key transaction will have to wait until the next release. |
👍 |
@PaulNorth @sergeybykov Just a heads up that Consul now supports multi-key transactions in master. |
Nice! |
Config Usage:
<Globals><SystemStore SystemStoreType="Consul" DataConnectionString="http://localhost:8500/" DeploymentId="TestDeployment" /></Globals>