Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS connections hang after PR #86 #89

Closed
a2geek opened this issue Aug 12, 2019 · 17 comments
Closed

TLS connections hang after PR #86 #89

a2geek opened this issue Aug 12, 2019 · 17 comments

Comments

@a2geek
Copy link

a2geek commented Aug 12, 2019

Uncertain exactly why this occurs, but between the 6/26 commit and 7/15 commit (PR #86) the TLS connectivity no longer works. Commit hashes are shown in the go.mod file below if that assists.

Please let me know if I need to alter behavior of the code. For now, I'll lock into the 6/26 code.

My environment:

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ virsh version
Compiled against library: libvirt 4.0.0
Using library: libvirt 4.0.0
Using API: QEMU 4.0.0
Running hypervisor: QEMU 2.11.1

Bad run sample output (the hang lasts for hours - we left town today and it was still hanging when I returned):

$ go run main.go
TCP connection to 'odin:16514' established
Libvirt client created
^Csignal: interrupt

Good run sample output:

$ go run main.go
TCP connection to 'odin:16514' established
Libvirt client created
Connected to libvirt
Version: 4.0.0
ID	Name		UUID
--------------------------------------------------------
17	vm-852ea340-3a46-471e-7b33-16b30df9bcba	852ea3403a46471e7b3316b30df9bcba

go.mod file:

module tls

go 1.12

// GOOD
require github.com/digitalocean/go-libvirt v0.0.0-20190626172931-4d226dd6c437

// BAD
//require github.com/digitalocean/go-libvirt v0.0.0-20190715144809-7b622097a793

main.go sample code (sorry, a bit long and hard-coded to locations on my filesystem):

package main

import (
	"crypto/tls"
	"crypto/x509"
	"fmt"
	"io/ioutil"
	"log"

	libvirt "github.com/digitalocean/go-libvirt"
)

const connectstring = "odin:16514"
const keyfile = "/home/rob/libvirt-cert/clientkey.pem"
const certfile = "/home/rob/libvirt-cert/clientcert.pem"
const cafile = "/home/rob/libvirt-cert/cacert.pem"

func main() {
	keyfilexml, err := ioutil.ReadFile(keyfile)
	if err != nil {
		panic(err)
	}

	certfilexml, err := ioutil.ReadFile(certfile)
	if err != nil {
		panic(err)
	}

	cafilexml, err := ioutil.ReadFile(cafile)
	if err != nil {
		panic(err)
	}

	cert, err := tls.X509KeyPair([]byte(certfilexml), []byte(keyfilexml))
	if err != nil {
		panic(err)
	}

	roots := x509.NewCertPool()
	roots.AppendCertsFromPEM([]byte(cafilexml))

	cfg := &tls.Config{
		Certificates: []tls.Certificate{cert},
		RootCAs:      roots,
	}

	conn, err := tls.Dial("tcp", connectstring, cfg)
	defer conn.Close()
	if err != nil {
		panic(err)
	}
	fmt.Printf("TCP connection to '%s' established\n", connectstring)

	client := libvirt.New(conn)
	fmt.Printf("Libvirt client created\n")
	err = client.Connect()
	defer client.Disconnect()
	if err != nil {
		panic(err)
	}
	fmt.Printf("Connected to libvirt\n")

	v, err := client.Version()
	if err != nil {
		log.Fatalf("failed to retrieve libvirt version: %v", err)
	}
	fmt.Println("Version:", v)

	domains, err := client.Domains()
	if err != nil {
		log.Fatalf("failed to retrieve domains: %v", err)
	}

	fmt.Println("ID\tName\t\tUUID")
	fmt.Printf("--------------------------------------------------------\n")
	for _, d := range domains {
		fmt.Printf("%d\t%s\t%x\n", d.ID, d.Name, d.UUID)
	}

	if err := client.Disconnect(); err != nil {
		log.Fatalf("failed to disconnect: %v", err)
	}
}
a2geek added a commit to a2geek/libvirt-bosh-cpi that referenced this issue Aug 12, 2019
@trapgate
Copy link
Member

I think what's happening here is that the change from PR #86 has affected the listen() function in rpc.go. I'd suggest instrumenting that function to see what's happening. It may be that the call to pktlen() or extractHeader() is now returning an error, where it once returned a short read instead. That could cause the listening goroutine to exit, or just to ignore an important packet. This hasn't affected non-TLS connections, at least to localhost; those are still succeeding, at least for me.

@a2geek
Copy link
Author

a2geek commented Aug 12, 2019

I've never even debugged a Go application, believe it or not. :-)

But, the debugger worked for me and luckily the code was pretty easy to track down.

All of this is in rpc.go as you surmise.

On line 158 of listen(), the packet returned length is 16MB. Which seems fishy. At some point, it then does a io.ReadFull(...) for that size buffer. Which never shows up.

Tracing into pktlen(...) I added comments for the return values, if that's useful:

func pktlen(r io.Reader) (uint32, error) {
	buf := make([]byte, constants.PacketLengthSize /* 4 bytes? */)

	// read exactly constants.PacketLengthSize bytes
	_, err := io.ReadFull(r, buf)
       /* tracing through ReadFull indicates this is 4 bytes, as would be expected */
	if err != nil {
		return 0, err
	}

	/* buf is 1,0,0,0 (index 0,1,2,3) */
	return binary.BigEndian.Uint32(buf), nil
}

@trapgate
Copy link
Member

I agree that a returned packet length of 16MB is pretty fishy.

The listen() function works like this:

  1. Read 4 bytes from the input stream, decode that as an integer packet length.
  2. Read 24 bytes from the input stream, decode that as a header.
  3. Read (PacketLength - 28) bytes from the stream, that's the payload.

But, notice the error handling: if step 1 or 2 returns an error, the routine just does a continue. That's almost certainly not correct: if step 2 returns an error for some reason, then we will re-enter the loop and read the next 4 bytes as the length of the following packet, without having read the payload of the previous packet. I think something like this is happening in your case, so I'm curious whether either pktlen or extractHeader returned an error prior to the call where pktlen returned a 16MB packet.

@a2geek
Copy link
Author

a2geek commented Aug 27, 2019

Running through the debugger, notes:

  • No errors at all.
  • Looks to be 1st packet ever exchanged

From listen():

  • listen = 16777216; the buffer in ptklen(...) is [1,0,0,0].
  • header h:
    • Program = 606077056
    • Version = 2248146944
    • Procedure = 16777216
    • Type = 1107296256
    • Serial = 16777216
    • Status = 16777216

... so the values are definitely funky.

@a2geek
Copy link
Author

a2geek commented Aug 27, 2019

Ran the debug session again, pasted in the old loop into pktlen(...) and observed something interesting!

The loop in pktlen ran twice:

  1. Reads 1 byte with a value of 1. But the bytes read (1) is less than the cap(buf) (4).
  2. Reads 4 bytes with the values of 0,0,0,36. Bytes read (5) is larger than cap(buf).
  3. Returns length value of 36 which is much more sensible.

So... is packet length 4 or 5 bytes? Or is there a different part of the header (maybe a protocol version or something) hidden in there?

@a2geek
Copy link
Author

a2geek commented Aug 27, 2019

Finally, just in case this is useful, I created an Echo connection with a basic Read function like this:

func (e Echo) Read(b []byte) (n int, err error) {
	n, err = e.conn.Read(b)
	if n > 0 {
		fmt.Printf("b=")
		for i := 0; i < n; i++ {
			fmt.Printf("%02x ", b[i])
		}
		fmt.Printf("\n")
	}
	return n, err
}

The output is like this (byte sequences, each line is a Read):

$ go run -mod=vendor main.go
TCP connection to 'odin:16514' established
Libvirt client created
b=01 
b=00 00 00 24 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 
^Csignal: interrupt

(it hung at the ^C)

Not certain why there is an initial 0x01 in the connection.

@a2geek
Copy link
Author

a2geek commented Aug 29, 2019

As another experiment, I ran a simple program with a socket connection followed by the TLS connection. The echo output suggests that the initial 0x01 in the stream is the only one:

Connected to '/var/run/libvirt/libvirt-sock'
Libvirt client created
write: 00 00 00 38 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 00 00 00 00 01 00 00 00 00 01 00 00 00 00 00 00 0e 71 65 6d 75 3a 2f 2f 2f 73 79 73 74 65 6d 00 00 00 00 00 00 
read: 00 00 00 24 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 
write: 00 00 00 38 20 00 80 86 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 02 00 00 00 00 01 00 00 00 00 00 00 0e 71 65 6d 75 3a 2f 2f 2f 73 79 73 74 65 6d 00 00 00 00 00 00 
read: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 02 00 00 00 00 
Connected to libvirt
write: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 9d 00 00 00 00 00 00 00 03 00 00 00 00 
read: 00 00 00 24 20 00 80 86 00 00 00 01 00 00 00 9d 00 00 00 01 00 00 00 03 00 00 00 00 00 00 00 00 00 3d 09 00 
Version: 4.0.0
write: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00 
read: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 00 


TLS connection to 'odin:16514' established
Libvirt client created
write: 00 00 00 38 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 00 00 00 00 01 00 00 00 00 01 00 00 00 00 00 00 0e 71 65 6d 75 3a 2f 2f 2f 73 79 73 74 65 6d 00 00 00 00 00 00 
read: 01 
read: 00 00 00 24 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 
^Csignal: interrupt

If I then swallow that initial byte, it seems to work:

TLS connection to 'odin:16514' established
Libvirt client created
write: 00 00 00 38 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 00 00 00 00 01 00 00 00 00 01 00 00 00 00 00 00 0e 71 65 6d 75 3a 2f 2f 2f 73 79 73 74 65 6d 00 00 00 00 00 00 
read: 00 00 00 24 20 00 80 86 00 00 00 01 00 00 00 42 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 
write: 00 00 00 38 20 00 80 86 00 00 00 01 00 00 00 01 00 00 00 00 00 00 00 02 00 00 00 00 01 00 00 00 00 00 00 0e 71 65 6d 75 3a 2f 2f 2f 73 79 73 74 65 6d 00 00 00 00 00 00 
read: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 02 00 00 00 00 
Connected to libvirt
write: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 9d 00 00 00 00 00 00 00 03 00 00 00 00 
read: 00 00 00 24 20 00 80 86 00 00 00 01 00 00 00 9d 00 00 00 01 00 00 00 03 00 00 00 00 00 00 00 00 00 3d 09 00 
Version: 4.0.0
write: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00 
read: 00 00 00 1c 20 00 80 86 00 00 00 01 00 00 00 02 00 00 00 01 00 00 00 04 00 00 00 00 

Is this something with TLS I need to deal with? I can't figure out where that little bit of noise is coming from...

@trapgate
Copy link
Member

It sure looks to me like that leftover byte comes from the TLS handshake, and should have been consumed by the TLS library before it completed. I did some googling to see if there were any bugs reported against the golang TLS library that looked like this, but I came up empty. What version of go are you using?

One approach would be to run tcpdump and observe the packets transferred during the TLS handshake. The last one would have to end in a byte with a value of 0x01, and that value would have to not be consumed by the TLS library.

Another possibility would be a bug in libvirtd, but such a bug would probably affect non-TLS connections too, so that seems unlikely.

Here's the documentation on the wire protocol libvirt uses: https://libvirt.org/internals/rpc.html.

Hope this helps!

@a2geek
Copy link
Author

a2geek commented Aug 30, 2019

I guess that's possible. But what confuses me is that prior to PR #86, everything works fine. However, I think my debugging might suggest the original code was buggy (instead of reading 4 bytes, that pktlen routine actually read 5). I'm sitting on the commit prior to that PR and I have observed no issues.

I don't think it's Libvirt, as virsh is happy with the TLS connection:

$ virsh -c qemu+tls://odin/system?pkipath=/home/rob/libvirt-cert
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # version
Compiled against library: libvirt 4.0.0
Using library: libvirt 4.0.0
Using API: QEMU 4.0.0
Running hypervisor: QEMU 2.11.1

virsh # list
 Id    Name                           State
----------------------------------------------------
 30    vm-e7ceede4-b0e6-4b3d-566e-8490c7168d22 paused

virsh # quit

As far as libraries, I'm using Go 1.12. Current version is:

$ go version
go version go1.12.9 linux/amd64

Originally, this was 1.12.6, but I updated based on the suggestion there may be a bug.

Is there sample code somewhere that I can review for making a correct TLS connection? Maybe I've got something wonked, but I don't see it.

@a2geek
Copy link
Author

a2geek commented Aug 30, 2019

I tried Wireshark and was able to capture the sequence. There are 3 packets, the initiating packet, this 0x01 packet and then the expected packet. Uncertain what specifically to capture, but this packet comes after the initiation packet to Libvirt (includes the qemu:///system content) - so this appears to be all Libvirt and the first response after...

Frame 22: 96 bytes on wire (768 bits), 96 bytes captured (768 bits) on interface 0
Ethernet II, Src: 00:00:00_00:00:00 (00:00:00:00:00:00), Dst: 00:00:00_00:00:00 (00:00:00:00:00:00)
Internet Protocol Version 4, Src: 127.0.1.1, Dst: 127.0.0.1
Transmission Control Protocol, Src Port: 16514, Dst Port: 45632, Seq: 1830, Ack: 1959, Len: 30
Secure Sockets Layer
    TLSv1.2 Record Layer: Application Data Protocol: Application Data
        Content Type: Application Data (23)
        Version: TLS 1.2 (0x0303)
        Length: 25
        Encrypted Application Data: 000000000000000118ff54e9f96357ef1c7cda384027ba8f...

Frame:
0000   00 00 00 00 00 00 00 00 00 00 00 00 08 00 45 00
0010   00 52 5f e7 40 00 40 06 db bc 7f 00 01 01 7f 00
0020   00 01 40 82 b2 40 8a 6b c9 08 ec f8 6b 83 80 18
0030   05 5d ff 46 00 00 01 01 08 0a 62 1f cd e8 f0 63
0040   3d 02[17/03 03/00 19/00 00 00 00 00 00 00 01 18  <== type, TLS 1.2, length, enc. packet
0050   ff 54 e9 f9 63 57 ef 1c 7c da 38 40 27 ba 8f 1b] <== continuation of encrypted packet

Decrypted:
0000   01

@trapgate
Copy link
Member

trapgate commented Sep 4, 2019

I've reproduced this with a go program that doesn't import go-libvirt. After the call to tls.Dial() connects successfully to the libvirt daemon, a call to conn.Read() with a 1-byte buffer will return 1 byte. Subsequent Read calls will block waiting for the daemon to send something.

This is either a bug in libvirt, or a bug in the go TLS library. The old version of go-libvirt masked this bug - it accidentally ignored that odd 1-byte packet.

I should have some more time to look at this tomorrow.

@a2geek
Copy link
Author

a2geek commented Sep 7, 2019

Good to hear I'm not crazy! (Although it would be easier if I just had a bug in my code.)

Let me know if you need me to test anything or look into something.

@benlemasurier
Copy link
Contributor

It looks like @trapgate's investigation points to an upstream bug in libvirtd/tls. I'm going to close the issue as I don't think we should handle the problem here in go-libvirt. Please feel free to continue commenting.

@weizhouapache
Copy link
Contributor

@a2geek @benlemasurier @trapgate

Gents, I face this issue recently.

I use latest go 1.15.6 and Ubuntu 20.04.
The following works

github.com/digitalocean/go-libvirt v0.0.0-20190626172931-4d226dd6c437

Any chance to test other OSes like centos ?
if same issue happens with some other OSes, it would be nice to find a workaround in go-libvirt.
Thanks !

@trapgate
Copy link
Member

trapgate commented Jan 5, 2021

That older version of go-libvirt does avoid this particular bug, but it suffers from crashes when responses from libvirt are fragmented into multiple packets - that's the issue fixed by #86.

I have no reason to believe that this bug is OS-dependent.

The workaround to this problem would be to drain any packets that are received by the client after the tls.Dial call completes, before calling libvirt.New() with the tls connection.

@weizhouapache
Copy link
Contributor

@trapgate thanks a lot.

per you suggestion, I add a line before libvirt.New(conn) to drop a byte.

conn.Read(make([]byte, 1))

it works fine now

@berrange
Copy link
Contributor

I've reproduced this with a go program that doesn't import go-libvirt. After the call to tls.Dial() connects successfully to the libvirt daemon, a call to conn.Read() with a 1-byte buffer will return 1 byte. Subsequent Read calls will block waiting for the daemon to send something.

This is either a bug in libvirt, or a bug in the go TLS library. The old version of go-libvirt masked this bug - it accidentally ignored that odd 1-byte packet.

This is not a bug, it is part of the libvirt RPC protocol when running over TLS.

https://gitlab.com/libvirt/libvirt/-/blob/1dddd444519717c785a9fd93f393c3313c87c342/src/rpc/virnetserverclient.c#L300

    /* Checks have succeeded.  Write a '\1' byte back to the client to
     * indicate this (otherwise the socket is abruptly closed).
     * (NB. The '\1' byte is sent in an encrypted record).
     */

And on the client side

https://gitlab.com/libvirt/libvirt/-/blob/1dddd444519717c785a9fd93f393c3313c87c342/src/rpc/virnetclient.c#L986

    /* At this point, the server is verifying _our_ certificate, IP address,
     * etc.  If we make the grade, it will send us a '\1' byte.
     */

...snip...

    len = virNetTLSSessionRead(client->tls, buf, 1);
    if (len < 0 && errno != ENOMSG) {
        virReportSystemError(errno, "%s",
                             _("Unable to read TLS confirmation"));
        goto error;
    }
    if (len != 1 || buf[0] != '\1') {
        virReportError(VIR_ERR_RPC, "%s",
                       _("server verification (of our certificate or IP "
                         "address) failed"));
        goto error;
    }

Yes, we can argue this extra byte is pretty much redundant, but its there in the protocol for coming on for 15 years, so we can't remove it due to backcompat.

A compliant client must read this single byte after TLS handshake is done and verify that it is the value 0x1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants