Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bonus payment for seed nodes operators when node has 99.9% availibility #102

Closed
ManfredKarrer opened this issue Jul 12, 2019 · 15 comments

Comments

@ManfredKarrer
Copy link
Contributor

This is a Bisq Network proposal. Please familiarize yourself with the submission and review process.

We had recently many problmes with seed nodes and that could even cause a failed DAO voting round, beside problems with disputes and trades. We need to ensure that the nodes are runing 99.9% stable. It is understandable that operators are not always be able to deal with the problems as they can be complex and can require some dev background, but at the end analysing the log file was enough to find the problem. It is clear that the relative small BSQ payment is only covering the server costs and basic maintanence like update or restart if needed but not further investigations work.

So to fix the incentive strructure for operators I propose 2 changes to the compensation model for node operators:

  1. If extra time is spent on investigating problems, just add that to the compensation request additional to the normal server costs.
  2. If a node runs very stable (metrics need to be defined, but something like >99% uptime and no DAO state conflicts) the operator should get an extra bonus. I suggest a bonus of 400 BSQ/month, but up for discussion whats a good value here.
@ghost
Copy link

ghost commented Jul 12, 2019

It's a good idea imo.
But also, enlarging the basis would be good. For my part, I'll vote for doubling the reward for running a seednode (and other special nodes).
Running special nodes is a useful task, and not so easy, it should be payed accordingly (imo).

@ManfredKarrer
Copy link
Contributor Author

@freimair
Copy link

freimair commented Jul 14, 2019

I played around a bit with the monitor and created that: http://monitor.bisq.network/d/wTDl6T_iz/p2p-stats?refresh=30s&panelId=16&fullscreen&orgId=1

100% at the moment only indicates the node that has been online the most.

@ripcurlx
Copy link

I played around a bit with the monitor and created that: http://monitor.bisq.network/d/wTDl6T_iz/p2p-stats?refresh=30s&panelId=16&fullscreen&orgId=1

Screenshot from 2019-07-14 20-03-39

100% at the moment only indicates the node that has been online the most.

That's a good start, but for the bonus bounty it must be a real uptime metric, so no-one is payed if everyone has a similar bad uptime 😉

@freimair
Copy link

hm :). you sure?

no, honestly, Grafana might not be the correct tool to get such a metric as it compresses data over time. I have set up different ways of getting such a metric here and I am not particularly happy with them...

@freimair
Copy link

Since we are confronted with a lot of guesswork when a seed node is offline, a more professional monitoring system would benefit the cause as well.

I therefore suggest to add setting up a system health reporting system to be a mandatory requirement for compensation or something alike. (@ManfredKarrer maybe you want to add that to your proposal?)

I went forward and got such a system up and running for our monitor (live):
Screenshot from 2019-07-19 11-23-36

This example is collectd collecting information every 30 seconds on the monitor server itself and sending it to the grafana database. The JVM heap space is gathered from the bisq monitor process, everything else is gathered from the OS. These metrics can all be gathered without the need to change the bisq code. It is minimal effort to setup (I guess 1h of your time) and collectd is made for heavy load environments so we do not loose a great amount of computing power. All in all, we would gain the ability to correlate seed nodes going offline with their system health (excessive CPU usage, out of memory, bandwidth limitations, ...) and maybe take a lot of guesswork out of our debugging efforts.

I am currently working on getting the monitor ready for such input and piecing together a readme on how to make collectd talk to the monitor. I would then integrate that into the How to seednode guide for everyone to see when the time comes.

@freimair
Copy link

can we have some specifics? here is a suggestion. Would be nice to have input from people who actually run a seed node 😉

  • base compensation: operating costs + 50 BSQ?
    • might prevent the use of cheap hosting providers
    • but might need an upper limit though. Maybe insert common operating costs of DigitalOcean, Linode, ...?
  • setup 400 BSQ (so people do a decent job)
  • bonus on 95% availability: 150 BSQ (to be discussed on how we evaluate that)
  • bonus on 99% availability: 350 BSQ (to be discussed on how we evaluate that)

@mrosseel
Copy link

* base compensation: operating costs + 50 BSQ?
  * might prevent the use of cheap hosting providers
  * but might need an upper limit though. Maybe insert common operating costs of DigitalOcean, Linode, ...?

sounds good, it does not prevent the use of cheap hosting providers but also doesn't encourage it. People will just choose the best tool for the job. Upper limit probably around 70/100 BSQ

* setup 400 BSQ (so people do a decent job)

Not so sure, not doing a good job during setup will impact availability, which is a monthly bounty so setup fee is not so important. I propose not to pay this.

* bonus on 95% availability: 150 BSQ (to be discussed on how we evaluate that)

this is 36 hours of outage (if I calculated correctly). Should this get a bounty?

* bonus on 99% availability: 350 BSQ (to be discussed on how we evaluate that)

This is 7 hours of outage. Probably ok to give a bounty.

In short I'm fine with this, of course what if your seednode is down due to bugs, tor errors, ... - measuring will be the hard part. You can easily measure if you're better than needed, but if worse than needed you might need to give some proof in your componsation request that the downtime was due to bisq and not you or your hosting situation.

In the long run my opinion is that seednodes should not be precious porcelain servers that need 99% uptime, too easy to disrupt the network. Ideally they are just the longest running bisq instances. If they are down, clients connect to the next one. The BSQ changes with full nodes/blocknotify didn't bring this dream any closer of course, but IMHO it should be a long term goal, so we shouldn't get too attached to our bonuses ;)

@christophsturm
Copy link

we could also go the other way and use something like chaos monkey to ensure that seed nodes are fragile, and then fix the software to deal with that scenario. https://github.com/Netflix/chaosmonkey

@mrosseel
Copy link

we could also go the other way and use something like chaos monkey to ensure that seed nodes are fragile, and then fix the software to deal with that scenario. https://github.com/Netflix/chaosmonkey

With seed nodes there's also the issue of working but not fully synced, slow, incorrect, etc.
But that's indeed the spirit. Maybe a discussion for another ticket ;)

@ManfredKarrer
Copy link
Contributor Author

base compensation: operating costs + 50 BSQ?

I think we can use a fixed number. With the space requirement I think its 50-100 EUR/ month hosting costs on most providers.
On Digital Ocean its 20 EUR for 4GB and 40 EUR for 400 GB space. So 60 EUR in total. I think 100 EUR should be a fair payment.

might prevent the use of cheap hosting providers

I think it is not really the price but those hosters who offer anonymous accounts (payable in BTC). They are usually used by shady clients as well and then react more harsh (shutting down service without notification) or getting ddos'ed, etc. We had basically always problems in the past when people used such hosters. I would recommend to use only high quality hosters like Digitals Ocean, Linode, AWS,...

this is 36 hours of outage
That is way too much....

This is 7 hours of outage.
This is still a lot.

I think a good seed should only have its downtimes while it restarts (as long that is required), which is about 2 minutes each day -> 1 hour/months. If we count in one update with rebuild its another 10 minutes max. So offline time of 2 hours max I think is what we should aim for and that is what deserves a bounty. Not sure if we need 2 levels of bounties. All seeds should run as smooth as possible. If not it has to be fixed. If the operator cannot fix it a dev need to fix it.

Of course the problems are often not the operators fault but if there is a financial incentive I think they will kick the devs faster as it was in the past. So I recommend to keep it strict. If not < 2 hours offline/months - no bounty. Bounty should be for now quite high, high enough so that an operator gets frustrated enough with devs that they push them if there are code problems as they risk to lose their bounty. Maybe 500-1000 BSQ might be a level where that happens?

In the long run my opinion is that seednodes should not be precious porcelain servers that need 99% uptime, too easy to disrupt the network. Ideally they are just the longest running bisq instances. If they are down, clients connect to the next one.

Yes of course. There is built in resilience (you connect to multiple nodes and if all are offline to any normal node) but the BSQ block delivery is still not at the resilience level as it should be and that is currently the main issue. Due lack of dev resources we cannot count that this gets improved soon so keeping the server stable (as they have been in April, May) seems the easier way to guarantee that people are not experiencing problems with BSQ.

And dont forget the biggest bonus/malus is implicit to all contributors. If BSQ is not working well people cannot use it for fee payment and will not buy then from contributors and drive down the BSQ price. So there is a bonus/malus for any BSQ stakeholder and should motivate all to get that infrastructure as stable as possible.

to ensure that seed nodes are fragile

Yes basically the seeds are not considered as servers but as nodes which might be offline as well. So the Bisq node connects to multiple nodes anyway and if one fails it picks the next. But with the BSQ block request it is a bit more complicate as once you receive the blocks you process them. If you would have multiple requests the processing must not be interrupted... There is some timeout when you dont get a response that you request from another node, but seems that is too long or somehow not working well... All in all that is definitely code are to be improved but its complex and tricky and there might be more important tasks atm (protection tools, new trade protocol,...). But maybe you want to look into that if you feel attracted.... RequestBlocksHandler is the main class for requesting BSQ blocks....

@freimair
Copy link

@mrosseel @ManfredKarrer I aggree on the down-time with both of you. However, and that is the big however, I do not see a way to create that good of a measurement for the following reasons:

With seed nodes there's also the issue of working but not fully synced, slow, incorrect, etc.

  • furthermore, in case we find a solution to provide a resolution of <2h in 31 days would be quite an effort to keep up, running and reliable
  • such a system would most probably stress our seed nodes even more, only to get an accurate enough uptime metric
  • 500-1000 BSQ is quite an amount. I sense a strong source of conflict as long as the metric deciding whether a seed node operator receives the bonus or not is not 100% bullet-proof and transparent

that is why I suggested smaller bonuses for even 95% uptime. Just to give the whole thing a gentle push in the right direction and without the need for the metric to be that accurate.

Maybe we need another metric? For example the time between an issue being reported until it is fixed? Or simply the response time of the operator? What if the operator cannot fix the issue because something is wrong with the Bisq software - in such a case, the operator does not receive a bonus because of Bisq and not because of him being unresponsive. That all only becomes a source of conflict when 500-1000 BSQ is at stake.

That is not an easy challenge to solve.

@ripcurlx I suggest this very topic for a future dev-call.

@wiz
Copy link

wiz commented Sep 16, 2019

@freimair did a great job with the bisq-monitor node implementation, as well as the server metrics and DAO state monitoring system using grafana. The beautiful graphs are an excellent tool for visualizing past monitoring data.

For active monitoring with real-time alerts for service uptime, I mentioned on a dev call a while back that we probably need a new and separate monitoring system that utilizes basic service checks to check availability and response time, powered by icinga2, which has algorithms to filter out false positives and is better suited for this goal.

This new Icinga 2 powered monitoring system is now live for btcnodes, reporting issues to #alerts on slack, as btcnodes were the lower hanging fruit with a much simpler service check to implement. As this seems to be working well so far, I've started working to expand this monitoring to seednodes, by gathering data from the bisq-monitor node. Please join the slack channel #alerts and check it out.

After enough data is collected, it will be possible to analyze the Icinga2 data to create a monthly report that will list the actual availability and response time for each btcnode or seednode. For example, "node X was up for 98.54% with average response time 142ms during DAO cycle 5"

Therefore I think we should expand this proposal to include both btcnodes and seednodes. We can now also define acceptable service criteria, something like:

< 99% uptime: de-listed as a btcnode or seednode

99.9% uptime: high-availability bonus of 2x compensation

This should ensure all nodes have a 99%+ uptime, and incentivize nodes to have HA uptimes appropriately. So far it already identified nodes that were unreliable, and these have been removed from the hard-coded list in Bisq.

@mpolavieja
Copy link

@ManfredKarrer Should I close this proposal as stalled?

@wiz
Copy link

wiz commented Dec 12, 2019

Yeah, I suppose we resolved a lot of the underlying issues so seednode uptime isn't an issue anymore these days. Probably fine to close this issue IMO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants