-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Self-test failures when using non-default page/sector/block sizes. #77
Comments
Hi, Yes you're most certainly safe. These testcases are pretty complicated and Just to make sure, after each failed test the framework spits out "last
|
I get -10001 (ERR_FULL). However, these come out of the fs consistency check. Test case result for just '-f long_run' |
I did a bunch more runs with different combinations: All sector sizes from 4096 - 65536, block = sector and block = sector * 2, and page between 128 and 1024. Every single combination fails at least one test (except for 65536, 65536 * 2, 256). The output s=12 means sector_size=2^12, block=0 is block_size = sector_size * 2^0, page=10 is page_size=2^10 all-res/s=12,block=0,page=10:9 failed The various error codes from the various runs: all-res/s=12,block=0,page=10: spiffs errno:-10001 |
I'm not surprised. It is very hard finding test case params for suitable for all possible combinations of configs. E.g. the nodemcu config is something i wouldn't dreamed off. 😉 Cheers |
Okidoki, let's begin with the error codes: errors -10001 ( errors -10002 ( errors -10007 ( errors -10019 ( errors -10024 ( error -10072 ( So why the heck didn't I implement testcases that takes all possible configs? Well, because it is darn complicated. Depending on pagesize and blocksize, you get different amount of metadata, and different amount of effective data in the fs. Yes, this can be calculated and used in the testcases, but that would definitely make the cases pretty unreadable. Also depending on page- and blocksizes, some config switches simply won't work since there isn't one spare bit to store stuff (e.g. magic). E.g. all-res/s=16,block=1,page=9 with config magic off makes all testcases pass where 60 fails with magic on. Calculating beforehand what switches work with what configs would produce a pretty massive matrix full of maths. Of course it is possible to do all this, but as of now, I simply do not have the time to write that kind of rigid test framework. If someone pays me, I might (but just might) :) As for your original question - the test framework is stupid in a (or many) way(s). It outputs So, the original problem is you getting Different config, more metadata, less data than expected in the testcases - it won't fit. Bam. You're as safe as anyone running spiffs when all tests pass (da-da-daaaah). As an epilogue to all this rambling, if I were a NodeMCU maintainer, I would consider trying |
Nodemcu has two filesystem sizes -- a really small one in the range 32k to maybe 100k (this is when there is a 512k flash chip) and about a 3.5Mb file system when we have a 4MB cache. I guess that different parameters may be appropriate for each size. I'm going to dig into these errors more tonight and see if I can futz with the tests to get the nodemcu configurations to pass..... In particular, we run without MAGIC so I'll focus on those. Thanks for your comments above. The error -10019 all happen when the page size is set to 128 (and almost any config of block and sector). This may be too small to do anything decent. May be worth rejecting this size..... |
Ok, thanks for the input. Yes, having different configs for different sizes of flash is a good idea, but not mandatory of course. Good luck with the futzing, yell if you need help or find something. I'll check the 128 byte size here, I've been running a 128 byte page size spiffs on an ESP for some time now. It is odd, when defining test pagesize to128 some tests fail that specifically set the pagesize to 256. I smell a bug somewhere... |
The test SECTOR=4096, BLOCK=4096, PAGE=1024, MAGIC=0 fails in test write_small_files_chunks_1 with failed size, 0 != 512 I managed to get most of the tests to pass by reducing the size of the LARGE file to 1/12 of the flash size (from 1/3 of the flash size). In some cases, I can get away with the LARGE file at 1/6 of flash if I increase the FD_BUF_SIZE. |
I now understand the problem -- but the solution is not immediately obvious..... It turns out that when we want to write the LARGE file (as part of the test), essentially the entire file system consists of deleted pages. The GC is being asked for around 350k, and after five tries, it has only freed up 5 blocks. There are plenty of other blocks that could be freed, and the GC knows that they are there, but the logic only frees blocks once per "try". This is why, presumably, things work better with large block sizes. I could increase the number of tries for the GC (quite a lot), but that limit is probably there for a reason. The GC might be smarter and be able to erase more blocks in a single pass. I'm not sure what the right approach is. BLOCK=SECTOR=4096 PAGE=256 MAGIC=0 FD_BUF_SIZE=1024 Failing test: long_run_config_many_medium |
Ah yes, I tracked it down to the same. Good finding! Your reasoning is correct above. Two things: first, when writing the test cases I had 64k blocks in mind. Secondly, this influenced the cap to being 5, as a 64k block takes more time to erase than a 4k block, and I assumed that ~300k would be a "very large" file in an embedded system, if I remember correctly. The cap is there for safety. If there is a bug in the GC algo making it go into an eternal loop, we would wear the flash hard and never return. This is still unseen though. With this in mind, you could crank it up to something more valid for a 2 megabyte, 4k block sized system. I'd say 256 perhaps would be a better figure. It all depends how the system is used. I also took a review of the test bench and found some ugly bugs. Fix will come up in an hour or two. This fix will probably make you're exhaustive list look a lot better ;) Also had some thinking on the nodemcu config. With 4096 bytes blocks and 256 bytes pages, you get 16 pages per block. In each block, one page is stolen for the lookup table. Cranking up the block size to the double would give better usage of the spi flash I think. EDIT: Oh, and I am sorry to say, no - the GC is not smarter. Actually, in this sense it might be a bit stupid as it sometimes actually frees full block, yielding no free pages. The reason is that spiffs honor wear leveling before freeing space. Say you create a large file spanning many blocks. Then you create and delete a lot of smaller files during a longer period. All this time, the large file will hog its blocks, which will lead to more wear on the other blocks. Instead, spiffs recognizes this by cleaning blocks not being cleaned for a long time. The assumption is that by then, the file system will be a bit more fragmented, which will break up the huge hog of a file over the system, making wear leveling fair. |
Thanks for the test changes. I re-ran my battery of tests, and now 51 pass cleanly and 39 fail. I have put my code in pjsg/spiffs:tests -- there is a perl script at the top level which will run a battery of 90 configurations and put the results in all-gcf subdirectory.
then reveals all the failed configurations. For example, with configuration: Sector=4096, Block=8192, Page=1024, Magic=0
It looks as though all the page=1024 failed. A number failed with -10024 with magic=1. I suspect that this is due to an invalid combination of page/block/sector sizes. There are a few other failures that may be related to large blocks and small pages. |
I'm one of the nodemcu-firmware maintainers and we use spiffs as our filesystem. I wanted to upgrade to the latest version, but when I tried to run your self-tests with our page/sector/block parameters, I get self-test failures.
My diff against master is:
After doing make clean all test, I get:
I did try some other combinations of block/sector/page sizes, and I get other sets of tests failing. If I use the values from master, then all the tests pass.
Before I dig into this, is this something that you would expect with these values? Do you think it is more likely to be an issue with the test case?
Thanks
The text was updated successfully, but these errors were encountered: