-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bonus payment for seed nodes operators when node has 99.9% availibility #102
Comments
It's a good idea imo. |
@freimair has added an uptime screen: http://monitor.bisq.network/d/wTDl6T_iz/p2p-stats?refresh=30s&panelId=16&fullscreen&orgId=1 |
I played around a bit with the monitor and created that: http://monitor.bisq.network/d/wTDl6T_iz/p2p-stats?refresh=30s&panelId=16&fullscreen&orgId=1 100% at the moment only indicates the node that has been online the most. |
That's a good start, but for the bonus bounty it must be a real uptime metric, so no-one is payed if everyone has a similar bad uptime 😉 |
hm :). you sure? no, honestly, Grafana might not be the correct tool to get such a metric as it compresses data over time. I have set up different ways of getting such a metric here and I am not particularly happy with them... |
Since we are confronted with a lot of guesswork when a seed node is offline, a more professional monitoring system would benefit the cause as well. I therefore suggest to add setting up a system health reporting system to be a mandatory requirement for compensation or something alike. (@ManfredKarrer maybe you want to add that to your proposal?) I went forward and got such a system up and running for our monitor (live): This example is collectd collecting information every 30 seconds on the monitor server itself and sending it to the grafana database. The JVM heap space is gathered from the bisq monitor process, everything else is gathered from the OS. These metrics can all be gathered without the need to change the bisq code. It is minimal effort to setup (I guess 1h of your time) and collectd is made for heavy load environments so we do not loose a great amount of computing power. All in all, we would gain the ability to correlate seed nodes going offline with their system health (excessive CPU usage, out of memory, bandwidth limitations, ...) and maybe take a lot of guesswork out of our debugging efforts. I am currently working on getting the monitor ready for such input and piecing together a readme on how to make collectd talk to the monitor. I would then integrate that into the How to seednode guide for everyone to see when the time comes. |
can we have some specifics? here is a suggestion. Would be nice to have input from people who actually run a seed node 😉
|
sounds good, it does not prevent the use of cheap hosting providers but also doesn't encourage it. People will just choose the best tool for the job. Upper limit probably around 70/100 BSQ
Not so sure, not doing a good job during setup will impact availability, which is a monthly bounty so setup fee is not so important. I propose not to pay this.
this is 36 hours of outage (if I calculated correctly). Should this get a bounty?
This is 7 hours of outage. Probably ok to give a bounty. In short I'm fine with this, of course what if your seednode is down due to bugs, tor errors, ... - measuring will be the hard part. You can easily measure if you're better than needed, but if worse than needed you might need to give some proof in your componsation request that the downtime was due to bisq and not you or your hosting situation. In the long run my opinion is that seednodes should not be precious porcelain servers that need 99% uptime, too easy to disrupt the network. Ideally they are just the longest running bisq instances. If they are down, clients connect to the next one. The BSQ changes with full nodes/blocknotify didn't bring this dream any closer of course, but IMHO it should be a long term goal, so we shouldn't get too attached to our bonuses ;) |
we could also go the other way and use something like chaos monkey to ensure that seed nodes are fragile, and then fix the software to deal with that scenario. https://github.com/Netflix/chaosmonkey |
With seed nodes there's also the issue of working but not fully synced, slow, incorrect, etc. |
I think we can use a fixed number. With the space requirement I think its 50-100 EUR/ month hosting costs on most providers.
I think it is not really the price but those hosters who offer anonymous accounts (payable in BTC). They are usually used by shady clients as well and then react more harsh (shutting down service without notification) or getting ddos'ed, etc. We had basically always problems in the past when people used such hosters. I would recommend to use only high quality hosters like Digitals Ocean, Linode, AWS,...
I think a good seed should only have its downtimes while it restarts (as long that is required), which is about 2 minutes each day -> 1 hour/months. If we count in one update with rebuild its another 10 minutes max. So offline time of 2 hours max I think is what we should aim for and that is what deserves a bounty. Not sure if we need 2 levels of bounties. All seeds should run as smooth as possible. If not it has to be fixed. If the operator cannot fix it a dev need to fix it. Of course the problems are often not the operators fault but if there is a financial incentive I think they will kick the devs faster as it was in the past. So I recommend to keep it strict. If not < 2 hours offline/months - no bounty. Bounty should be for now quite high, high enough so that an operator gets frustrated enough with devs that they push them if there are code problems as they risk to lose their bounty. Maybe 500-1000 BSQ might be a level where that happens?
Yes of course. There is built in resilience (you connect to multiple nodes and if all are offline to any normal node) but the BSQ block delivery is still not at the resilience level as it should be and that is currently the main issue. Due lack of dev resources we cannot count that this gets improved soon so keeping the server stable (as they have been in April, May) seems the easier way to guarantee that people are not experiencing problems with BSQ. And dont forget the biggest bonus/malus is implicit to all contributors. If BSQ is not working well people cannot use it for fee payment and will not buy then from contributors and drive down the BSQ price. So there is a bonus/malus for any BSQ stakeholder and should motivate all to get that infrastructure as stable as possible.
Yes basically the seeds are not considered as servers but as nodes which might be offline as well. So the Bisq node connects to multiple nodes anyway and if one fails it picks the next. But with the BSQ block request it is a bit more complicate as once you receive the blocks you process them. If you would have multiple requests the processing must not be interrupted... There is some timeout when you dont get a response that you request from another node, but seems that is too long or somehow not working well... All in all that is definitely code are to be improved but its complex and tricky and there might be more important tasks atm (protection tools, new trade protocol,...). But maybe you want to look into that if you feel attracted.... RequestBlocksHandler is the main class for requesting BSQ blocks.... |
@mrosseel @ManfredKarrer I aggree on the down-time with both of you. However, and that is the big however, I do not see a way to create that good of a measurement for the following reasons:
that is why I suggested smaller bonuses for even 95% uptime. Just to give the whole thing a gentle push in the right direction and without the need for the metric to be that accurate. Maybe we need another metric? For example the time between an issue being reported until it is fixed? Or simply the response time of the operator? What if the operator cannot fix the issue because something is wrong with the Bisq software - in such a case, the operator does not receive a bonus because of Bisq and not because of him being unresponsive. That all only becomes a source of conflict when 500-1000 BSQ is at stake. That is not an easy challenge to solve. @ripcurlx I suggest this very topic for a future dev-call. |
@freimair did a great job with the bisq-monitor node implementation, as well as the server metrics and DAO state monitoring system using grafana. The beautiful graphs are an excellent tool for visualizing past monitoring data. For active monitoring with real-time alerts for service uptime, I mentioned on a dev call a while back that we probably need a new and separate monitoring system that utilizes basic service checks to check availability and response time, powered by icinga2, which has algorithms to filter out false positives and is better suited for this goal. This new Icinga 2 powered monitoring system is now live for btcnodes, reporting issues to #alerts on slack, as btcnodes were the lower hanging fruit with a much simpler service check to implement. As this seems to be working well so far, I've started working to expand this monitoring to seednodes, by gathering data from the bisq-monitor node. Please join the slack channel #alerts and check it out. After enough data is collected, it will be possible to analyze the Icinga2 data to create a monthly report that will list the actual availability and response time for each btcnode or seednode. For example, "node X was up for 98.54% with average response time 142ms during DAO cycle 5" Therefore I think we should expand this proposal to include both btcnodes and seednodes. We can now also define acceptable service criteria, something like: < 99% uptime: de-listed as a btcnode or seednode
This should ensure all nodes have a 99%+ uptime, and incentivize nodes to have HA uptimes appropriately. So far it already identified nodes that were unreliable, and these have been removed from the hard-coded list in Bisq. |
@ManfredKarrer Should I close this proposal as stalled? |
Yeah, I suppose we resolved a lot of the underlying issues so seednode uptime isn't an issue anymore these days. Probably fine to close this issue IMO |
We had recently many problmes with seed nodes and that could even cause a failed DAO voting round, beside problems with disputes and trades. We need to ensure that the nodes are runing 99.9% stable. It is understandable that operators are not always be able to deal with the problems as they can be complex and can require some dev background, but at the end analysing the log file was enough to find the problem. It is clear that the relative small BSQ payment is only covering the server costs and basic maintanence like update or restart if needed but not further investigations work.
So to fix the incentive strructure for operators I propose 2 changes to the compensation model for node operators:
The text was updated successfully, but these errors were encountered: