-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP packet drops and strange core usage behavior #152
Comments
Hi @vigodeltoro Could you confirm with If I'm reading correctly, a single machine is currently processing 150k flows? Any possibility to scale horizontally (eg: with ECMP)? Are you running the docker version with host networking? Or is it a build with docker? Because the Dockerfile uses alpine, an issue could be coming from musl library. With IPFIX it could be the templates lock if it bursts regularly? |
Hi Louis, I'm in holiday for one week since today evening.. but I pinged my colleagues.. Thanks :) for the fast reply |
Hi Louis, Okay.. I can't hold my fingers.. ;)
That sounds very promising.. Yes at the moment we try to run 150k on one server. But horizontal scaling is an option but need a little bit more time to provide. So, 12 sockets are opened.. and I'm seeing now 11 processes of goflow2 (htop) but core usage (~70%) at 8 cores. Drop amount: 382 drops at udp_queue_rcv_skb+3df (0xffffffff972d7c6f) If I change the net.core.rmem_max to double I see 12 goflow2 processes: dropwatch: Last test without docker .. native goflow2 ( Commit : f542b64) ./goflow2 -reuseport -format=pb -format.protobuf.fixedlen=true -listen=netflow://IP:2055?count=12 -mapping=/root/goflow2/compose/kcg/mapping/mapping.yml -transport=kafka -transport.kafka.brokers=IP:9094 -transport.kafka.topic=ipfix -transport.kafka.hashing=true -format.hash=SrcMac -workers=12 net.core.rmem_max=33554432 ss -ulpn Only one socket that's interesting.. Trying that command: ./goflow2 -reuseport -format=pb -format.protobuf.fixedlen=true -listen=netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055 -mapping=/root/goflow2/compose/kcg/mapping/mapping.yml -transport=kafka -transport.kafka.brokers=IP:9094 -transport.kafka.topic=ipfix -transport.kafka.hashing=true -format.hash=SrcMac -workers=12 $ ss -ulpn dropwatch Seems to be less drops.. That is the the most promising setup until now.. best |
Would you be able to run the tests using #150 ? ./goflow2 -listen sflow://:6343?count=10,netflow://:2055?count=10 |
Hi, I tried to run the test but I struggle to install the new release. After compiling the new flow.proto file with "make proto" and the goflow2 binary with "make build", I try to start goflow2 but I get: INFO[0000] Starting GoFlow2 goroutine 38 [running]: I'm on CentOS 7.9 Kernel: 3.10.0-1160.71.1.el7.x86_64 GO111MODULE="" The previous goflow versions were functional. If I use the prebuilt RPM from this repo I get: goflow2: /lib64/libc.so.6: version I tried it with the knowledge that this a test only because I need to add our changed flow.proto for a functional running goflow2. Do you have a clue what's wrong ? Best and thanks |
Hi @vigodeltoro, |
Hi @lspgn |
Thank you for confirming :)
I tried to keep as many fields as possible while clearing out ones that could be replaced by a custom mapping. |
Hi, so it took a little bit longer.. sorry for that. But we had major problems to fix our pipeline with the new protobuf schema. With goflow2-f542b64 commit and 12 cores we have nearly 0 udp-receive errors with 200-250k messages/s but with the new 1.3.3 release we loose around 40% of the messages because of udp-receive errors. I don't have a clue why, may be you have an idea. That are our kernel values which are running smoothy with the old commit: net.core.netdev_budget = 300 We split ( with both goflow versions the stream to 2 sockets ) and start goflow2 with the following parameters: old commit ( start command ): ./goflow2 -reuseport -format=pb -format.protobuf.fixedlen=true -listen=netflow://xxx.xxx.xxx.xxx:2055,netflow://xxx.xxx.xxx.xxx:2055,netflow://xxx.xxx.xxx.xxx:2055,netflow://xxx.xxx.xxx.xxx:2055 -mapping=/etc/goflow2/mapping.yml -transport=kafka -transport.kafka.brokers=xxx.xxx.xxx.xxx:xxx -transport.kafka.topic=ipfix -transport.kafka.hashing=true -transport.kafka.version=3.4.0 -format.hash=SrcMac -metrics.addr=xxx.xxx.xxx.xxx:8081 v1.3.3 start command: ./goflow2 -reuseport -format=pb -format.protobuf.fixedlen=true -listen=netflow://xxx.xxx.xxx.xxx:2056?count=8 -mapping=/etc/goflow2/mapping.yml -transport=kafka -transport.kafka.brokers=xxx.xxx.xxx.xxx:xxx-transport.kafka.topic=ipfix -transport.kafka.hashing=true -transport.kafka.version=3.4.0 -format.hash=SrcMac -metrics.addr=xxx.xxx.xxx.xxx:8082 -workers 8 ./goflow2 -reuseport -format=pb -format.protobuf.fixedlen=true -listen=netflow://xxx.xxx.xxx.xxx:2055?count=4 -mapping=/etc/goflow2/mapping.yml -transport=kafka -transport.kafka.brokers=xxx.xxx.xxx.xxx:xxx -transport.kafka.topic=ipfix -transport.kafka.hashing=true -transport.kafka.version=3.4.0 -format.hash=SrcMac -metrics.addr=xxx.xxx.xxx.xxx:8081 -workers 4 BTW: In case of sending the ipfix packages read by goflow to /dev/null I also have the udp receive errors.. so I can exclude a back pressure issue with Kafka.. Best and thanks |
Hi @vigodeltoro Would you be able to test another versions? In the case of v1.3.3, have you tried without In regards to v2, were the issues with ClickHouse? I shuffled the schema quite a bit, is it the timestamp posing an issue? |
Hi @lspgn no problem.. we have a lot of traffic.. so it's hard to test with that amount of load. stay tuned.. I will come back with new results.. |
Hi @vigodeltoro ! |
Hi @lspgn |
Hi @lspgn I tested on my new server ( 8 cores, 8 GiB Ram) with the following kernel parameter configurations: net.core.rmem_max=2048000000 net.core.rmem_default=2048000000 I'm on it but I need to ask for more performance for the testing server ( like the one we are using for production now) I write to dev/null for testing A first carefully result... But handle that results carefully, I will repeat the tests when I got more performance and I will be able to run goflow2-f542b64 without packetloss Best regards |
Hi @lspgn I'm having 16 cores Intel(R) Xeon(R) CPU E5-2630L v4 @ 1.80GHz now. But still packetloss.. even with the old version We are trying to split traffic with IPTables and natting.. to split traffic per switch. Let's see .. BTW: What is the worker parameter in detail ? The documentation is not really clear about that parameter .. ? Thanks a lot.. best |
I use a worker pool design where a worker is allocated for decoding of a sample (in v2 there is one worker per "socket"). Regarding the performance: the only way forward might be to provide patches and live testing. |
Hi Louis,
we have some strange behavior we can't explain to us, may be you can help.
We have missing IPFix data in comparison with another system in our company getting the data with another type of IPFix collector.
Further investigations lead us to UDP receive errors / drops at udp_receive_queue on our Goflow2 IPFix collector server. (monitored with drop watch)
8 drops at tpacket_rcv+5f (0xffffffff973642df)
464 drops at udp_queue_rcv_skb+3df (0xffffffff972d7c6f)
988 drops at tpacket_rcv+5f (0xffffffff973642df)
5 drops at tcp_rcv_state_process+1bc (0xffffffff972c096c)
10 drops at tcp_v4_do_rcv+80 (0xffffffff972cafc0)
3 drops at tpacket_rcv+5f (0xffffffff973642df)
8 drops at tpacket_rcv+5f (0xffffffff973642df)
7 drops at tpacket_rcv+5f (0xffffffff973642df)
11 drops at tpacket_rcv+5f (0xffffffff973642df)
7 drops at tpacket_rcv+5f (0xffffffff973642df)
7 drops at tpacket_rcv+5f (0xffffffff973642df)
6 drops at tpacket_rcv+5f (0xffffffff973642df)
8 drops at tpacket_rcv+5f (0xffffffff973642df)
7 drops at tpacket_rcv+5f (0xffffffff973642df)
1049 drops at tpacket_rcv+5f (0xffffffff973642df)
434 drops at udp_queue_rcv_skb+3df (0xffffffff972d7c6f)
5 drops at tcp_v4_do_rcv+80 (0xffffffff972cafc0)
3 drops at tcp_v4_rcv+87 (0xffffffff972cc147)
The system is a 12 Core Server with 16 GB RAM
Oracle Linux 3.10.0-1160.71.1.el7.x86_64
net.core.rmem_max=16777216
net.core.rmem_default=212992
net.core.wmem_max=16777216
net.core.rmem_default=212992
net.core.netdev_max_backlog = 8000
Our goflow2 is running in a docker container and it's the branch with the Nokia fix you did for us last year (#105 , #106)
The strange behavior starts that if I run the container I can see only 8 cores used by goflow2 and I see the mentioned drops. Core usage of that 8 cores are avg 55-60% at 130-150k messages ipfix (protobuf to kafka)
Our compose config:
version: "3"
services:
goflow:
build:
context: ../../
dockerfile: Dockerfile
network_mode: host
ports:
- IP:8080:8080
- IP:2055:2055/udp
restart: always
command:
- -reuseport
- -format=pb
- -format.protobuf.fixedlen=true
- -listen=netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055,netflow://IP:2055
- -mapping=/etc/mapping/mapping.yml
- -transport=kafka
- -transport.kafka.brokers=IP:PORT
- -transport.kafka.topic=ipfix
- -transport.kafka.hashing=true
- -format.hash=SrcMac
volumes:
- ./mapping:/etc/mapping
- ./logs:/tmp/logs
Our next idea was to update the goflow2 version but because of the camelcase fixes the last version I can use is commit f542b64
So we did it as a docker container and a direct compiled goflow2 without docker.
With the "new" docker container I have the same core usage issue and less usage percentage and much more drops and less read IPfix packages (see in Grafana goflow2 metrics around 80-100k ) and 10 times more drops
With the direct compiled process I get all cores used !! That is interesting .. but same numbers of drops and read packages ( 80-100k read and 10 times more drops ). To reach that I have to set workers to 12
Without 12 workers is less..
command to run:
./goflow2 -reuseport -format=pb -format.protobuf.fixedlen=true -listen=netflow://IP:2055?count=12 -mapping=/root/goflow2/compose/kcg/mapping/mapping.yml -transport=kafka -transport.kafka.brokers=IP:9094 -transport.kafka.topic=ipfix -transport.kafka.hashing=true -format.hash=SrcMac -workers=12
Maybe you have another idea what could explain that strange behavior..
Thanks lot and best regards
Christian
The text was updated successfully, but these errors were encountered: