Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LWIP memory leak? #3419

Closed
kovikti opened this issue Jul 14, 2017 · 7 comments
Closed

LWIP memory leak? #3419

kovikti opened this issue Jul 14, 2017 · 7 comments

Comments

@kovikti
Copy link

kovikti commented Jul 14, 2017

Basic Infos

Main question: can there be a situation during a tcp communication session, where lwip crashes or memory is not freed up? If yes, when, how to avoid it?
Can anything cause this on the server/network side?

Hardware

Hardware: ESP-12E
Core Version: most recent git master, also tried git sdk 2.1.0

Description

I'm experiencing some strange issues with (assuming) the TCP/IP library.
It is very difficult to debug and and I'm still investigating it, however it already took more time than I wanted. Sorry in advance...
So I have a rather large piece of software, POSTing data to a server regularly, every 10 sec.
Everything was working fine (weeks without any issues), now I changed the server to a private one instead of a shared one at a hosting company. PC to his server connection is fine.
Now what I see is that sometimes (quite random, from every few requests to one out of several hours), data is not sent by WiFiClient (debug does not show ":sent"), and the buffer seems to be not freed. client.stop() s called after a timeout period. So slowly memory runs out and no more request can be made. The system is still running (more or less), but useless.
I debugged my heap usage, I free up everything that I allocated, I'm pretty sure the problem is in LWIP or close to it.
On the server side (Apache) I see missing requests (nothing appears in the access log) or request timeouts as no data arrived to the server (such as [12/Jul/2017:12:22:41 +0200] "-" 408 0 "-" "-").
At the first case the connection did not build up successfully, at the second, connection is ok, but packets were lost. Normally this shouldn't happen with tcp.
I did tcp dumps on the server that confirms this, some packets didn't reach the server. (However providing an ssh tunnel to the server from a localhost provided better results - still investigating).
I added delay after the client.write() function, it seemed to help, sending smaller chucks to write, also helped, but the problem still persists.
It occures pretty random, so not easy to tell.
According to wireshark if the connection was built up and the problem arises, I saw several duped ACKs, overlapping ACKs, strange things, but simetimes the tcp stream is just cut off during the session:
Server side: SYN, SYN ACK, ACK, PACKET, ACK, and nothing more, however an other packet should have been sent by the ESP.

Earlier I assumed WiFiClient.write is blocking, until ack received, so I assumed the missing packes were sent from the ESP but it looks like not. I also had some suspicion of the dynamic memory allocation for ClientContext in WiFiClient, maybe it is not freed up?
I also tried switching to the async tcp library, in order to have better control over the packets, which seemes great, but the problem seems to persist. For now it seems much more rare, but difficult to tell. I haven't yet seen memory leaks here, only crashes.
See the log at the bottom, an ack should have arrived.

I assume there might be some strange constellation between some kind of a network error (?) and a bug in lwip: not handling retransmission or acks in some strange situation.

Still investigating, but I wanted to know if anyone has experienced anything like this.

EDIT: I managed to reproduce the problem with a simple sketch. See below.
Try changing host and chunksize.
With google.com only small chunk size works (<=1460), but with for example httpbin.org size 2000 works without problems.
However, if I connect the ESP to mobile hotspot, it works!
I'm really confused now.

Settings in IDE

IDE: VMICRO+Visual Studio
Module: Generic ESP8266 Module
Flash Size: 4MB
CPU Frequency: 80Mhz
Flash Mode: qio
Flash Frequency: 40Mhz
Upload Using: SERIAL
Reset Method: nodemcu

Sketch

#include <ESP8266WiFi.h>

const char* ssid = "...";
const char* password = "...";

//const char* host = "httpbin.org";
const char* host = "google.com";

int chunksize = 1000;// 2000;

int contentLength = 2000;

void setup() {
	Serial.begin(115200);
	delay(10);

	// We start by connecting to a WiFi network

	Serial.println();
	Serial.println();
	Serial.print("Connecting to ");
	Serial.println(ssid);

	WiFi.mode(WIFI_STA);
	WiFi.begin(ssid, password);

	while (WiFi.status() != WL_CONNECTED) {
		delay(500);
		Serial.print(".");
	}

	Serial.println("");
	Serial.println("WiFi connected");
	Serial.println("IP address: ");
	Serial.println(WiFi.localIP());
}

WiFiClient client;

void loop() {
	//client.setNoDelay(true);
	delay(2000);
	Serial.printf("\nHeap: %d\n", ESP.getFreeHeap());
	++value;
	
	Serial.printf("Time: %d\n", millis()/1000);

	Serial.print("connecting to ");
	Serial.println(host);

	const int httpPort = 80;
	if (!client.connect(host, httpPort)) {
		Serial.println("connection failed");
		return;
	}
	Serial.println("Connected.");
	
	String url = "/";

	Serial.print("Requesting URL: ");
	Serial.println(url);

	
	String s = String("POST ") + url + " HTTP/1.1\r\n" +
		"Host: " + host + "\r\n" +
		"Content-type: application/octet-stream\r\n" +
		"Connection: close" + "\r\nContent-length: " + (String)contentLength + "\r\n\r\n";

	int dataLen = contentLength + s.length();
	uint8_t data[dataLen];

	memcpy(data, s.c_str(), s.length());

	int pos = 0;

	auto time1 = millis();
	while (true){
		int len = dataLen - pos;
		if (len == 0){
			break;
		}
		if (len > chunksize){ len = chunksize; }
		client.write(data+pos, len);
		pos += len;
		yield();
	}
	auto time2 = millis();

	Serial.printf("send time: %d\n", time2-time1);

	unsigned long timeout = millis();
	while (client.available() == 0) {
		if (millis() - timeout > 5000) {
			Serial.println(">>> Client Timeout !");
			client.stop();
			return;
		}
	}

	// Read all the lines of the reply from server and print them to Serial
	while (client.available()){
		String line = client.readStringUntil('\r');
		Serial.print(line);
	}

	Serial.println();
	Serial.println("closing connection");
	client.stop();
}

Debug Messages

Async Httpclient start connecting
Async Httpclient connected
Async Httpclient write len: 1400
Async Httpclient ACK len: 1400 time: 58
Async Httpclient write len: 748
Async Httpclient send finished
mac 985

ets Jan 8 2013,rst cause:4, boot mode:(1,6)

wdt reset

(second ACK should have arrived!)

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 15, 2017

I tried your sketch with chunksize=4000 and lwIPv2 PR #3362.
I'm not sure I run it the right way, but it seems stable (heap is stable, google data length is > 1460).

Heap: 35832 (or 35664)
Time: ...768...(edit:)40000
connecting to google.com
send time: 19 (or 18)
Allow: GET, HEAD
Content-Length: 1589
Connection: close

(is this similar to #2925 ?)

@kovikti
Copy link
Author

kovikti commented Jul 16, 2017

Thanks for the info, https://github.com/esp8266/Arduino/issues/2925 looks similar, but I don't have high latency.
I tried branches lwip2 and lwip2v2, it sure seemd to work better, however it worked only in STA mode, whenever I changed to AP_STA, I got crashed, wdt or Panic.

Log for the sample sketch (git, master version, not lwip2):

WiFi connected
IP address: 
192.168.1.110

Heap: 43408
Time: 1
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
>>> Client Timeout !
:ur 1
:close
:del

Heap: 38384
Time: 8
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
pm open,type:2 0
>>> Client Timeout !
:ur 1
:close
:del

Heap: 32616
Time: 15
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
>>> Client Timeout !
:ur 1
:close
:del

Heap: 27624
Time: 22
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
>>> Client Timeout !
:ur 1
:close
:del

Heap: 22632
Time: 29
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
>>> Client Timeout !
:ur 1
:close
:del

Heap: 17640
Time: 36
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
>>> Client Timeout !
:ur 1
:close
:del

Heap: 12648
Time: 43
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 2
>>> Client Timeout !
:ur 1
:close
:del

Heap: 7656
Time: 50
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 0
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 0
send time: 3
>>> Client Timeout !
:ur 1
:close
:del

Heap: 2664
Time: 57
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 -1
:sent 1280
:wr 464 464 1280
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 -1
send time: 29
:sent 464
>>> Client Timeout !
:ur 1
:close
:del

Heap: 2304
Time: 64
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 -1
:sent 1280
:wr 464 464 1280
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 -1
send time: 19
:sent 464
>>> Client Timeout !
:ur 1
:close
:del

Heap: 2304
Time: 71
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 -1
:sent 1280
:wr 464 464 1280
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 -1
send time: 19
:sent 464
>>> Client Timeout !
:ur 1
:close
:del

Heap: 2304
Time: 78
connecting to xyz.xyz.com
:ref 1
Connected.
Requesting URL: /
:wr 2000 2000 0
:wrc 256 2000 0
:wrc 256 1744 0
:wrc 256 1488 0
:wrc 256 1232 0
:wrc 256 976 0
:wrc 256 720 -1
:sent 1280
:wr 464 464 1280
:wrc 256 464 0
:wrc 208 208 0
:wr 126 126 0
:wrc 126 126 -1
send time: 28
:sent 464
>>> Client Timeout !
:ur 1
:close
:del

@kovikti
Copy link
Author

kovikti commented Jul 16, 2017

Also strange, that the server that does not work is a subdomain (ie abc.xyz.com), sending to the top level (ie xyz.com) works fine! Same IP address.

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 16, 2017

lwip2v2 is supposed to work better. did you experience ap+sta crash with it ?
with the same sketch as above ?

@kovikti
Copy link
Author

kovikti commented Jul 16, 2017

yup, same sketch, ap+sta exception.

However I recompiled the original lwip with MSS=536 and memory limited to 8000 from 16000 now it seems to work better, MSS is truly limited. Now I'll run it for a long time and will see.

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 16, 2017

I tried with lwip2v2, ap+sta is working like with sta alone, even with my mobile connected to AP.
Are you sure you experienced this crash with lwip2v2 not the other one ?
Let us know if limiting mss to 536 with original lwip works for you.

@devyte
Copy link
Collaborator

devyte commented Feb 8, 2018

Closing due to lack of feedback in several months.

@devyte devyte closed this as completed Feb 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants