Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some consensus node only proposal one miner transaction when there are transactions in the network #817

Closed
ZhangTao1596 opened this issue Jun 12, 2019 · 43 comments
Labels
Bug Used to tag confirmed bugs Critical Issues (bugs) that need to be fixed ASAP

Comments

@ZhangTao1596
Copy link

ZhangTao1596 commented Jun 12, 2019

Recently we find that some consensus nodes only proposal one minter transaction. you can find details on neoscan from 3846424 to 3856930. During this time, there only 2 out of 7 consensus nodes proposal non-miner transactions.
And there is an old problem that sometimes one certain consensus node won't proposal transaction except miner transaction. This issue has been discussed before at #474 . But now seems not solved.
I will do some test first to find the reason.
First, I will make certain whether the transactions arrived in the mempool of the problem cn.
Do you have any idea about this? Is a solution to this currently in progress?
@vncoelho @igormcoelho @jsolman

@vncoelho
Copy link
Member

Nice, my friend, good innitiave.

Their mempool will be probably empty, it is a problem in the tx rebroadcasting.
It should be P2P optimizations with extra filter details problems.

@erikzhang erikzhang added the Discussion Initial issue state - proposed but not yet accepted label Jun 12, 2019
@ZhangTao1596
Copy link
Author

ZhangTao1596 commented Jun 19, 2019

I have tested on testnet.

  1. I use a tool keep sending transaction to seed9.ngd.network:20333 via p2p.
  2. When I find ngd consensus node produce empty block, I get mempool of the node via rpc and most
    of the time get nothing. But sometimes get 1 or 2 transaction. I use show state check the peers. There are more than 10 connected.
  3. Then I use my tool to send tx to the consensus node directly via p2p. The node works again and relay
    transactions too and never produce empty block.
  4. Then I stop sending directly and send transactions to seed9.ngd.network:20333, the issue reproduce.
  5. Then I send transaction to seed1 testnet and seed5 testnet, the issue still exists.
  6. I restart my consensus node. It works again.

All in all, I think it seems we have route problem?
Once a node started, its peers keep unchanging, right?
Do you guys have any idea, why the cn can't get transactions from its peers?

@lock9
Copy link
Contributor

lock9 commented Jun 19, 2019

Good work!
Now we need to know if the problem is with the node itself, or it's peers. It's funny because it should affect non-CN nodes too, right?
If it always produces blocks when you send tx to it, it means they are not "processing" these transactions.
It is weird because we know it is receiving network messages 🤔 Maybe it is discarding these transactions?

Did you manage to log the messages received 'per node'? We now need confirmation that:

  • TXs messages are not arriving; OR
  • TXs are being discarded.

Is there another possible cause? I can only think about these two considering the evidence you provided.

Edit: If you have to reboot the node to make it work, is it evidence that the error is in the node itself, and not with other peers?

@vncoelho
Copy link
Member

vncoelho commented Jun 19, 2019

@KickSeason, it is this. This was one of the hypotheses, now you tested and found a fact. Nice work.
In general, the classes connection, peer, taskmanager and protocolhandle need a better organization and possible some improvements.

The class remotenode has an strange think related to the ack state, in which it is set to false every time data is sent on the TCP, correct, @erikzhang ?

private void SendMessage(Message message)
{
ack = false;
SendData(ByteString.FromBytes(message.ToArray()));
}

After that, it is again set to true OnAck:

protected override void OnAck()
{
ack = true;
CheckMessageQueue();
}

Which is a message usually created from here:

protected void SendData(ByteString data)
{
if (tcp != null)
{
tcp.Tell(Tcp.Write.Create(data, Ack.Instance));
}
else
{
ArraySegment<byte> segment = new ArraySegment<byte>(data.ToArray());
ws.SendAsync(segment, WebSocketMessageType.Binary, true, CancellationToken.None).PipeTo(Self,
success: () => Ack.Instance,
failure: ex => new Tcp.ErrorClosed(ex.Message));
}
}

I believe we need to better arrange the timeouts and this requirement set of false all the time looks strange.

@vncoelho
Copy link
Member

@lock9, aligned with this we have the priorities on messages, which might be one of the reasons in which they still participate in the consensus but "do not have time" to receive the other types of TCP data.

@igormcoelho
Copy link
Contributor

igormcoelho commented Jun 19, 2019

Interesting observations... perhaps message is received but not internally passed on Akka?

@ZhangTao1596
Copy link
Author

I will continue testing and record all p2p messages. If the node received transaction message, we can be sure it's akka problem or node internal problem.
@lock9 NGD have tried to reproduce on privatenet but not easy. I will carry on test in testnet.

@shargon
Copy link
Member

shargon commented Jun 23, 2019

any news @KickSeason ?

@vncoelho
Copy link
Member

@shargon, can you take a look at me last message above about the ack flag?

@ZhangTao1596
Copy link
Author

no progress. I'm testing on testnet. I can only access one consensus node. I have to wait it reproducing this issue.

@ZhangTao1596
Copy link
Author

I built a p2p-plugin and filter tx inv and tx message:
p2ptx.log
my code:

    public bool OnP2PMessage(Message message)  
    {  
        if (message.Command == "inv") {  
            InvPayload inv = message.GetPayload<InvPayload>();  
            if (inv.Type == InventoryType.TX) {  
                var hs = inv.Hashes.Where(h => Blockchain.Singleton.GetTransaction(h) == null && !Blockchain.Singleton.MemPool.ContainsKey(h)).ToList();  
                if (0 < hs.Count())  
                {  
                    string hashes = "[";
                    foreach (var hash in hs)
                    {
                        if (1 < hashes.Length)
                        {
                            hashes += ",";
                        }
                        hashes += hash.ToString();
                    }
                    hashes += "]";
                    log("inv", $"receive transaction inv, type: tx, count: {hs.Count()}, hashs: {hashes}");
                }                    
            }
        }
        if (message.Command == "tx") {
            Transaction tx = message.GetTransaction();
            string txid = tx.Hash.ToString();
            log("tx", $"receive transaction {txid}");
        }
        return true;
    }

I wonder why there are so many hashes that cn didn't get tx from? I searched those hashes. They aren't on blockchain.
Do you guys have any idea?

@shargon
Copy link
Member

shargon commented Jun 24, 2019

Those hashes are on know hashes but no received by inv or tx ?

@ZhangTao1596
Copy link
Author

ZhangTao1596 commented Jun 24, 2019

[inv] With this tag, It is an inv message.
[tx] With this, It is a tx message.
I only log the tx inv and tx message.
If you guys need consensus log, I will add.

If there are many nodes without SimplePolicy, the network will be full of illegal tx. I think this may have a negative effect.

image
Look. Although consensus nodes are producing block, 2 out of them pack more than 10 every time. Others only a few.

@ZhangTao1596
Copy link
Author

Ngd have set up another testnet for repoduce the issue.
I add log into v2.10.2 and deploy on testnet as consensus node to record p2p messages and tasks info. Hope I will help.

@shargon
Copy link
Member

shargon commented Jun 27, 2019

Could NGD test it too with the last patch #865?

@ZhangTao1596
Copy link
Author

We just set up another testnet to try reproducing this issue. But it haven't happened yet.
If we can't reproduce, we can't tell whether it works or not.
We will test this patch a few days later.

@shargon
Copy link
Member

shargon commented Jun 27, 2019

Thanks for all @KickSeason !

@superboyiii
Copy link
Member

@shargon It's already happened on Testnet No.2, we're making analysis for consensus log.

@shargon
Copy link
Member

shargon commented Jul 2, 2019

Good news, the first step is "reproduce it"

@ZhangTao1596
Copy link
Author

ZhangTao1596 commented Jul 4, 2019

hello guys. I add log in neo based on v2.10.2 and reprodued the issue. Here is the log file:

2019-07-02.log
and neo-scan of this net.

At height 31922, this node is Primary. It proposal empty block.
There are a few inv tx hashes received. But this node only rqeust to get one tx data.
it's tx 0x6e5ff1844dc4050e5fd85d3be495c6a200e491b3f8da12fd7c5f415f92d844de. Which is confirmed at 31914. But you cann't find inv of this tx. Maybe it's long time before.

@shargon @vncoelho @jsolman @igormcoelho
I took a quick view at the file just now. Now we can be sure the node received tx hash. But it didn't try to get them.
We should do more deep investigate.
If you need more information I will add more log and reproduce

I think there are so many high priority messages in TaskManagerMailbox that no new task for tx.
Am I right?

I will add mailbox queue count logs.

And the block messages. This node can handle block messages. Block message is also hight priority letter.
Poor network node can only handle cosensus and block messages. It may have no more capacity for more work.

@shargon
Copy link
Member

shargon commented Jul 5, 2019

I took a quick view at the file just now. Now we can be sure the node received tx hash. But it didn't try to get them.

It is like a knownhash problem?

If you need more information I will add more log and reproduce

Add more logs on memorypool please

@jsolman
Copy link
Contributor

jsolman commented Jul 5, 2019

Hmm, if it receives the transaction but doesn’t think it is valid, but adds it in known hashes, but it later becomes valid, that could be a problem if knownhashes won’t allow it to receive it again. We should check if this is what is happening.

@jsolman
Copy link
Contributor

jsolman commented Jul 5, 2019

However that would not cause all blocks by that CN to be completely empty of all but the miner transaction. So that may not explain what is happening.

@ZhangTao1596
Copy link
Author

There are knownhashes count in log file. About 154000 hashes, it's around 4MB.

@ZhangTao1596
Copy link
Author

Will add mempool log and test

@shargon
Copy link
Member

shargon commented Jul 5, 2019

Could you test in parallel, with and without patch?

@ZhangTao1596
Copy link
Author

ZhangTao1596 commented Jul 5, 2019

You mean the knownhashes fix patch?

@shargon
Copy link
Member

shargon commented Jul 5, 2019

Yes, just for confirm if this solve the problem, at least in the same time frame

@ZhangTao1596
Copy link
Author

Btw

private void OnInvMessageReceived(InvPayload payload)
        {
            UInt256[] hashes = payload.Hashes.Where(p => knownHashes.Add(p)).ToArray();
            if (hashes.Length == 0) return;
            switch (payload.Type)
            {
                case InventoryType.Block:
                    using (Snapshot snapshot = Blockchain.Singleton.GetSnapshot())
                        hashes = hashes.Where(p => !snapshot.ContainsBlock(p)).ToArray();
                        Logger.Info("ProtocolHandler", $"receive inv, type: block, hashes: {hashes.HashesToString()}");
                    break;
                case InventoryType.TX:
                    using (Snapshot snapshot = Blockchain.Singleton.GetSnapshot())
                        hashes = hashes.Where(p => !snapshot.ContainsTransaction(p)).ToArray();
                        Logger.Info("ProtocolHandler", $"receive inv, type: tx, hashes: {hashes.HashesToString()}");
                    break;
            }
            if (hashes.Length == 0) return;
            system.TaskManager.Tell(new TaskManager.NewTasks { Payload = InvPayload.Create(payload.Type, hashes) }, Context.Parent);
        }

This is where I add inv tx log. I think it is nothing to do with knownHases in ProtocolHandler if there is a print log.

@ZhangTao1596
Copy link
Author

ZhangTao1596 commented Jul 8, 2019

I think we make a great progress. I add more logs and reproduced again in our testnet2.
2017-07-07.log. This all logs of a fault node.
consensus.log. Consensus log I filtered out.
TaskManagerMailbox.log. TaskManagerMailbox log I filtered out.
you can have a look where I add log here: https://github.com/KickSeason/neo/tree/test/log

you can look at the TaskManagerMailbox.log first. It's is crowding by a lot of low priority messages.
image
@vncoelho @igormcoelho @shargon @jsolman

@shargon
Copy link
Member

shargon commented Jul 8, 2019

13.300 low priority messages????

@ZhangTao1596
Copy link
Author

yes

@shargon
Copy link
Member

shargon commented Jul 8, 2019

TaskManagerMailbox.log. TaskManagerMailbox log I filtered out.

Is the same url, is this right?

@ZhangTao1596
Copy link
Author

Sorry, mistake.

@shargon
Copy link
Member

shargon commented Jul 8, 2019

I would like to know how many duplicates, we should check the hash before append to the queue

@shargon
Copy link
Member

shargon commented Jul 8, 2019

How many hours until the issue appear?

@ZhangTao1596
Copy link
Author

ZhangTao1596 commented Jul 8, 2019

3 days after restart.Only 1 node behave like this.
Now only queue messages count, if you need detail, I will add extra log.

@shargon
Copy link
Member

shargon commented Jul 8, 2019

I would like to know the type of the messages in the queue, something like
{ "GetBlock": 1123, "GetPeers": 999 ... }

is that possible?

Thanks for your work!

@ZhangTao1596
Copy link
Author

Will do.

@jsolman
Copy link
Contributor

jsolman commented Jul 10, 2019

We should cache the consensus messages on the CN nodes in case they arrive out of order. Other nodes should cache also for a time and drop duplicates and prevent storms of consensus messages.

@shargon
Copy link
Member

shargon commented Jul 10, 2019

Yes, we can have a consensus message pool

@jsolman
Copy link
Contributor

jsolman commented Jul 10, 2019

Also, any consensus message from a height less than the current block can just be dropped immediately and not forwarded. This alone will probably fix the issue.

@lock9 lock9 added Critical Issues (bugs) that need to be fixed ASAP Bug Used to tag confirmed bugs and removed Discussion Initial issue state - proposed but not yet accepted labels Jul 11, 2019
@ZhangTao1596
Copy link
Author

I will lose this because #900 #899 ?

Thacryba pushed a commit to simplitech/neo that referenced this issue Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Used to tag confirmed bugs Critical Issues (bugs) that need to be fixed ASAP
Projects
None yet
Development

No branches or pull requests

8 participants