Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointer's copy_from|copy_to|move_from|move_to are inherently broken. #4589

Closed
funny-falcon opened this issue Jun 18, 2017 · 27 comments · Fixed by #4960
Closed

Pointer's copy_from|copy_to|move_from|move_to are inherently broken. #4589

funny-falcon opened this issue Jun 18, 2017 · 27 comments · Fixed by #4960
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler topic:stdlib
Milestone

Comments

@funny-falcon
Copy link
Contributor

Following lines certainly lead to wrong behavior on 64bit platforms with large arrays cause of #to_u32.

Intrinsics.memcpy(self.as(Void*), source.as(Void*), (count * sizeof(T)).to_u32, 0_u32, false)

Intrinsics.memmove(self.as(Void*), source.as(Void*), (count * sizeof(T)).to_u32, 0_u32, false)

Intrinsics.memset(self.as(Void*), 0_u8, (count * sizeof(T)).to_u32, 0_u32, false)

BTW, why Crystal has neither size_t nor intptr_t equivalent? Some places uses UInt64, some uses UInt32 (like this one). It is just... infantile.

For example, Go defines int to be ssize_t and uint to be size_t, and has intptr and uintptr (though, looks like on all supported platforms they have same size to int).
Having platform dependent int has profit in low-level programming, and dis-convenient for application level programming, so it certainly should not be default type.
But still any place that deals with low-level memory handling should operate on platform dependent types/aliases instead of raw UInt32 and/or UInt64. This bogus lines are just tip of iceberg.

@refi64
Copy link
Contributor

refi64 commented Jun 18, 2017

I wouldn't really consider it inherently broken, since, if you're copying >2GB of data at once, your program is probably screwed up beyond belief anyway.

EDIT: lol I just randomly found this 5 years later; younger me was being extremely stupid here.

@refi64
Copy link
Contributor

refi64 commented Jun 18, 2017

Actually, no; it'd have to be four gigabytes of data in one copy in order to cause trouble.

@funny-falcon
Copy link
Contributor Author

If you think, real programs doesn't do such "crazy" things, you are not right.
It is broken. You may say "it is workable on useful sizes". But it is still broken on "acceptable" size.
Either error should be thrown, or behavior should be fixed.

@ysbaddaden
Copy link
Contributor

Crystal isn't C. Please keep this in mind and be nice when asking questions or reporting issues.

All sizes in Crystal (e.g. String, Array, StaticArray, Slice or Pointer) are clamped at Int32::MAX. This is by design and that won't change. We use a 32 bit number because larger values are considered broken (i.e. don't bloat memory, stream). We use a signed number because we don't want to leak unsigned integers that would lead to confusing behaviors (e.g. when operating on Slice#size). A maximum of 2GB is already plenty enough.

Now, since we clamp values, and users may try inconsiderate sizes, we should have bound checks and raise an ArgumentError when values are too large. Would you be willing to open a pull request, with passing specs?

Relates to #3209.

@RX14
Copy link
Member

RX14 commented Jun 18, 2017

@ysbaddaden I think that String and Array should use Int32 but Slice should have a 64-bit size. Slice is more low level than the other types and isn't resizable so it makes sense for it to support larger sizes: it's a "window" on some memory, you don't always control the memory or it's size. Pointer obviously supports 64bit already.

@ysbaddaden
Copy link
Contributor

Reading pointer.cr you're right. It does support 64 bit addresses, but it's always 64 bit (even on 32 bit platforms?), it sometimes casts sizes to UInt64 (malloc, realloc) but to UInt32 in other places (clear, copy, move). This seems illogical and incompatible from a quick look.

@RX14 that would defeat the purpose: if I can allocate a Slice with Int64 size, while should String or Array be limited?

@funny-falcon
Copy link
Contributor Author

@RX14, @ysbaddaden unfortunately mentioned lines are from pointer.cr , so it is Pointer broken.
At least you should fix this lines of Pointer.

We use a 32 bit number because larger values are considered broken (i.e. don't bloat memory, stream).
A maximum of 2GB is already plenty enough.

I'm not against using Int32 as base type for size of all standard-library containers. That is really ok.
But language that claims it suitable for some low-level programming have to have convenient types for.
Lets have Int32 for container's size and IntSize for allocation and IntPtr for pointer arithmetic.
I don't think It is ok to use UInt64 in many places if your platform is 32bit.

All sizes in Crystal (e.g. String, Array, StaticArray, Slice or Pointer) are clamped at Int32::MAX

Sizes, but not memory size of allocated pointer. How much will consume a = Array({Float64, Float64, Float64, Float64}).new(1<<29 - 1)? It will allocate 1<<34 bytes. Yes, it will be correctly allocated now cause in appropriate places you use to_u64 (hope it never breaks on 32bit platforms). But a.unshift 1.0 is currently broken for such array.

@RX14
Copy link
Member

RX14 commented Jun 19, 2017

@ysbaddaden because array uses realloc while slice doesn't. Array owns the memory it allocates and uses realloc assuming the memory is small. String because 2gb strings you've gone too far. Whereas slice represents a pointer and a size. A window on memory. It makes no assumptions about what's in that memory it just let's you access it. That memory could come from a C program which allocates more than 2gb of memory. A slice could be used to represent a mmap of a file on disk. These are perfectly valid usecases for a slice larger than 2gb.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jun 19, 2017

@RX14 This is a valid claim, but Crystal doesn't aim for this, and we shouldn't care about potential large allocations C. If we assume Int32 is plaintly enough, then Slice must follow. Developers are discouraged to use Pointer and to use Slice instead. If Slice.new(5.gibabytes) is allowed, we'll get complains that making a String from a large slice is broken... and so on.

We're already kinda forced to have Pointer handle large sizes, due to Int32::MAX * sizeof(T) being possible —though maybe we should just raise, because nobody should do that; I mean with T=Int32, that's 8GB, with T=Int64, that's 16GB... seriously?

@funny-falcon
Copy link
Contributor Author

Either you should raise, or just fix move_from, copy_from and clear.
I agree with you intention to simplify application programming.
But since language provides low-level features, they should just work.

Lets be clear: Slice, Array, String - they all are part of Standard Library, cause they could be implemented without intrisicts/primitives.

Pointer is a part of language, cause it is known by compiler and implemented with intrisicts.

@akzhan
Copy link
Contributor

akzhan commented Jun 19, 2017

Reading very large files is a common task even for applications with 32-bit addressing. These are tools such as grep, diff, databases, video processing and much more. In this case, the Flle MMAP is fine and yes, it requires 64-bit addressing in files and memory.

@zatherz
Copy link
Contributor

zatherz commented Jun 23, 2017

Disappointed with these decisions. "Crystal is not C" isn't a valid excuse for not making a low level feature work properly.

akzhan added a commit to akzhan/crystal that referenced this issue Jun 23, 2017
@oprypin
Copy link
Member

oprypin commented Jun 23, 2017

I hope that these are just leftovers from the past. There certainly has been an effort to clean up these "32isms". Also note that LibC::SizeT etc. do already exist.
#823 4d35089

And yeah, some comments here have been worrying. It's never OK to have randomly breaking code when someone happens to allocate a memory region over 2GB.

akzhan added a commit to akzhan/crystal that referenced this issue Jun 23, 2017
akzhan added a commit to akzhan/crystal that referenced this issue Jun 23, 2017
akzhan added a commit to akzhan/crystal that referenced this issue Jun 26, 2017
…g two pointers.

ptrdiff_t is used for pointer arithmetic and array indexing, if negative values are possible.

Refs crystal-lang#823, crystal-lang#4589.
akzhan added a commit to akzhan/crystal that referenced this issue Jun 26, 2017
…g two pointers.

ptrdiff_t is used for pointer arithmetic and array indexing, if negative values are possible.

Refs crystal-lang#823, crystal-lang#4589.
@asterite
Copy link
Member

All of this should be fixed, but it's hard. The main issue is that if Pointer#address changes of type depending on the architecture, then some methods might break. For example maybe a method accepts UInt32 but Pointer#address is sometimes UInt32, sometimes UInt64. I don't see an easy way to fix this. But having Pointer#address be UInt64 all the time seems good, because UInt32 fits in UInt64.

Those to_u32 calls are wrong, they are probably there from before Instrinsics.memcpy was updated to provide a 64-bit alternative.

As @oprypin says, most of these things are left-over from the past. We should fix them slowly, though I don't know what maximum sizes should Slice, String, Array, Hash, etc. have. Maybe Slice can have a size of Int64. When trying to create a big string from it we can simply raise an exception.

@zatherz
Copy link
Contributor

zatherz commented Sep 11, 2017

There's an easy fix for this, it's making a new numeric class like PointerAddress that uses 32 or 64 bit ints under the hood.

@akzhan
Copy link
Contributor

akzhan commented Sep 12, 2017

@zatherz As we pointed - there is LibC::SizeT and we need simply LibC::PtrDiffT too. But it need some time to rework things inside the compiler (#4624 (review))

@asterite
Copy link
Member

So, what's the real issue here? What code breaks?

I think we can:

  • Keep Pointer#address as UInt64 (in 32 bits systems, this will just be a value less than or equal to UInt32::MAX)
  • Fix Pointer#copy_from, etc., to not cast some values with to_u32
  • Maybe make Slice have a size that's UInt64, and also update String to check that the slice given to it in the constructor has a size less than some maximum allowed.

Is there any other thing that needs a fix?

@funny-falcon
Copy link
Contributor Author

So, what's the real issue here? What code breaks?

$ # still correct, cause sizeof(array) = 2 GB
$ crystal eval 'a=Array(Int64).new((1<<28)-1){|i| i.to_i64}; p a.reduce{|s,v|s+v}; a.unshift(1_i64); p a.reduce{|s,v|s+v}'
36028796616310785
36028796616310786
$ # wrong, cause sizeof(array) = 4 GB (but number of elements = 512M)
$ crystal eval 'a=Array(Int64).new((1<<29)-1){|i| i.to_i64}; p a.reduce{|s,v|s+v}; a.unshift(1_i64); p a.reduce{|s,v|s+v}'
144115187270549505
144115186733678596
$ # segfault?
$ crystal eval 'a=Array(Int64).new(1<<28){|i| i.to_i64}; p a.reduce{|s,v|s+v}; a.unshift(1_i64); p a.reduce{|s,v|s+v}'
36028796884746240
Invalid memory access (signal 11) at address 0x20000000
[0x55cc41f78e55] *CallStack::print_backtrace:Int32 +117
[0x55cc41f6d71d] __crystal_sigfault_handler +61
[0x7f972e499670] ???
[0x55cc41fb505e] *Pointer(Int64) +14
[0x55cc41fb4ecb] *Array(Int64) +267
[0x55cc41fb4db2] *Array(Int64) +18
[0x55cc41f5e0cf] ???
[0x55cc41f6d619] main +41
[0x7f972dabe3f1] __libc_start_main +241
[0x55cc41f5d76a] _start +42
[0x0] ???

@asterite
Copy link
Member

We can probably limit the size of an array. Why do you want an array that big? What's a use case for that?

@funny-falcon
Copy link
Contributor Author

@asterite you asked wrong question.
Whenever one write program that accepts arbitrary data, there is a chance to have enough data to trigger limit.
512M elements is not too big array. I had to deal with such arrays at least twice solving real world issues. (I didn't use crystal for solving them). I don't remember exact tasks. Something about collecting and aggregating data.

What will change if it will be not Int64, but Tuple{Int64, Int64, Int32, String} ? Array size will be limited by (1<<32)/sizeof(Tuple{Int64, Int64, Int32, String}) ?

2G elements is annoying, but acceptable limit. If there will be stricter limit, Crystal will loose a lot of usability.

And, any way, Pointer is not container. Pointer shall not have any restrictions beside hardware/operating system. Otherwise Crystal is just a toy.

@asterite
Copy link
Member

Java's List#size is int: https://docs.oracle.com/javase/7/docs/api/java/util/List.html#size()

So I guess Java is a toy?

@asterite
Copy link
Member

I understand now. We can have Int32::MAX elements, but it still behaves wrong. The problem is that I can't seem to understand the issue from those one-liners. Would you care to explain what's wrong with Array implementation that makes it work wrong, and why my comments here ( #4589 (comment) ) wouldn't fix that issue?

@funny-falcon
Copy link
Contributor Author

Would you care to explain what's wrong with Array implementation that makes it work wrong

I think, problem is on Pointer side.

why my comments here ( #4589 (comment) ) wouldn't fix that issue?

I think, it will fix. (Note, if flag? in intrinsics.cr should check aarch64 as well).

Part of discussion is about: isn't it better to use SizeT and IntPtrT in such places?

@RX14
Copy link
Member

RX14 commented Sep 12, 2017

@funny-falcon adding type aliases for integers which will probably change between platforms probably isn't something we want to do, as unlike C our integers don't automatically convert (a good thing). We already have that issue with some C types (for example, the bigint PR that broke on 32bit), lets not spread it around the stdlib too.

I agree with @asterite's suggestion of what to do entirely.

@asterite
Copy link
Member

I'll send a PR to fix all of this later today.

@funny-falcon
Copy link
Contributor Author

@RX14
bigint PR is a bad example: it was broken exactly cause every one forget about distinction between types on different platforms, and how libgmp uses this type.
Now there is a risk to forget, how external libraries uses intptr_t, uintptr_t, ptrdiff_t. So, not using native platform type increases risk of bugs, not decrease, as you assume.

@asterite
Copy link
Member

Sent PR: #4960

@asterite asterite added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler topic:stdlib labels Sep 13, 2017
@asterite asterite added this to the Next milestone Sep 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:compiler topic:stdlib
Projects
None yet
8 participants