Improve consistancy of some base codes #114

ben221199 · 2023-08-12T12:35:42Z

I know that many of you will go wild when seeing this pull request, but I think it could be useful. Also, I dare to do it, because base2 is a candidate and not a default.

The reason for this change is simple:

Base 2 (binary), having the digits 0 and 1. Taking the highest to use as code, 1. This is not the case, so lets change it.
Base 8 (octal), having the digits 0, 1, 2, 3, 4, 5, 6 and 7. Taking the higest, 7 to use as code. This is already the case.
Base 10 (decimal), having the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Taking the higest, 9 to use as code. This is already the case.
Base 16 (hexadecimal), having the alphanumerals 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. Taking the higest, F to use as code. This is already the case. Same for lowercase.

If people disagree on changing it, we can also add a new base2 record with 1 as code. Eventually, we later decide if we drop code 0 or give it another purpose. (For example, unary. I like that one.)

vmx

I would say base2 encoding is more of a theoretical exercise. I'd be OK with changing it. I don't think it's widely used. I also don't know if there's any good or historic reason to use 0. @Stebalien do you know?

Stebalien · 2023-08-14T15:08:47Z

I don't think there's a good reason one way or the other, no. And honestly, nobody should use these and I'd prefer to just remove them. But I'd have no objection to changing it.

ben221199 · 2023-08-14T15:39:12Z

I find them all useful, certainly because binary and octal are also used in programming.

vmx

At the last IPLD community/sync call we talked about this issue. The folks in the call see Base2 useful for educational purpose and agree that it should be made consistent with other base encodings.

@bumblefudge do you want this to get in before or after #109?

ben221199 · 2023-08-16T14:36:01Z

Did I saw some IANA considerations? That would indeed be nice.

vmx · 2023-08-16T15:08:33Z

Did I saw some IANA considerations? That would indeed be nice.

Yes. Though only the ones we currently have labeled as "default" will be moved over there (if it is accepted).

aschmahmann · 2023-08-16T16:16:24Z

@vmx @rvagg

base2 - Rod: educational/experimental, but not implemented by anyone

https://github.com/multiformats/go-multibase/blob/master/base2.go. So this is used by anyone that currently imports the base encodings from go-multibase.

Does this mean there are tons of users of ipfs.io, kubo, or others use base2? Are there lots of educational materials showing people the base2 multibase? Probably not, but I don't know not sure if anyone's done much investigating here.

For example: https://cloudflare-ipfs.com/ipfs/00000000101010101000000000000000101000001

Out of curiosity though. I feel like most times I've seen unary represented it's with a 1 rather than a 0 (e.g. https://en.wikipedia.org/wiki/Unary_numeral_system). So perhaps despite not fitting the pattern, even aside from the multibase history here, it makes sense for (0,1) to be binary and (1) to be unary with 0 as the binary prefix and 1 as the unary prefix.

ben221199 · 2023-08-16T17:29:28Z

Did I saw some IANA considerations? That would indeed be nice.

Yes. Though only the ones we currently have labeled as "default" will be moved over there (if it is accepted).

Agreed. Hopefully it will be accepted. The others could follow if becoming default.

ben221199 · 2023-08-16T18:01:24Z

@vmx @rvagg

base2 - Rod: educational/experimental, but not implemented by anyone

https://github.com/multiformats/go-multibase/blob/master/base2.go. So this is used by anyone that currently imports the base encodings from go-multibase.

Does this mean there are tons of users of ipfs.io, kubo, or others use base2? Are there lots of educational materials showing people the base2 multibase? Probably not, but I don't know not sure if anyone's done much investigating here.

For example: https://cloudflare-ipfs.com/ipfs/00000000101010101000000000000000101000001

I didn't investigate anything at the moment. I wasn't even aware of this Cloudflare URL either. If the amount of usage is very little, or above URL is the only real usage (which is just decoding, because last byte 01000001 = A), I think we just can make 0-code binary multibase invalid. The other option is keep 0, but add 1 with the same format (and then eventually invalidate 0 later). I agree on investigating that some little more.

Out of curiosity though. I feel like most times I've seen unary represented it's with a 1 rather than a 0 (e.g. https://en.wikipedia.org/wiki/Unary_numeral_system). So perhaps despite not fitting the pattern, even aside from the multibase history here, it makes sense for (0,1) to be binary and (1) to be unary with 0 as the binary prefix and 1 as the unary prefix.

Many cultures would agree, because | (capital I) and 𓏤 (Egyptian one), etc. are all strokes. Mathematically, however, using 1 seems incorrect to me. Also, https://en.wikipedia.org/wiki/Tally_marks#Writing_systems is using 0 here. Also came across https://math.stackexchange.com/questions/2157887/how-to-write-zero-in-the-unary-numeral-system.

If we all disagree on unary, we also can give prefix 0 a total other purpose. Random base, for example. 😳😂

aschmahmann · 2023-08-16T18:48:46Z

I mostly don't understand the why this is worth doing. Yeah, it arguably should've been 1 for binary, but someone chose 0 a while ago.

Does this matter?
It might actually be correct to give 1 unary and therefore 0 binary (i.e. they swap because unary is more commonly represented by 1s than 0s).

If people really want to break things because it's not nice and no one is using base2 in practice (I hope not), and it's not in use in tutorials/educational materials (I don't know, but could certainly see it happen) then 🤷, ok I guess. It seems bad practice to break things because they seem slightly off (e.g. https://en.wikipedia.org/wiki/HTTP_referer) especially if the alternative is potentially just worse (to introduce unary we now have to use 0 or choose another less obvious character at random).

That link was mostly just evidence that code that does base2 multibase decoding is/has been deployed. You can take any CID and chuck it in there and it'll work https://cloudflare-ipfs.com/ipfs/0000000010111000000010010001000001111111000011100101100010110011010100000000001101111001010101000000111110000100010000100110001011000010010111101010010001111011101011110010100111101101111001110111011010001100001111011001110010001101101101001101011010001111110001111011101111101110111100000.

vmx · 2023-08-17T08:29:20Z

I mostly don't understand the why this is worth doing. Yeah, it arguably should've been 1 for binary, but someone chose 0 a while ago.
1. Does this matter?

I don't have a strong opinion on it, but I think consistency within specs is always great. There are sometimes those little small details that don't match up in multiformats/ipld/ipfs specs that are only there for historic reasons as someone once made some decision before they had the full picture. I think we should be free to break those things more often. For me this is such a case.

bumblefudge · 2023-08-21T10:47:10Z

I mostly don't understand the why this is worth doing. Yeah, it arguably should've been 1 for binary, but someone chose 0 a while ago.

Does this matter?

It might, or it might not, depending on how important to you is the convention that each code is the "highest codepoint" allowed in that encoding alphabet, a convention that seems to have been a guiding principle at some point but never really enforced or codified. In the current spec, it's barely mentioned, and I've tried to make it more explicit but still nonbinding in my big PR that makes everything I can find more explicit to force differences of interpretation to the fore before taking things to IETF.

It might actually be correct to give 1 unary and therefore 0 binary (i.e. they swap because unary is more commonly represented by 1s than 0s).

This is actually a valid argument, IMHO, for not being consistent about the "highest codepoint" rule, since it seems to reflect ergonomic preferences of people who wade around in lower-level programming and raw bytes all day (this is not my personal experience so I defer to the experts here!). I think "whatever will be intuitive and ergonomic for the developers who will use that codec in that use case" definitely trumps the "highest codepoint" rule, if you think it's the case with binary and unary!

If people really want to break things because it's not nice and no one is using base2 in practice (I hope not), and it's not in use in tutorials/educational materials (I don't know, but could certainly see it happen) then 🤷, ok I guess. It seems bad practice to break things because they seem slightly off (e.g. https://en.wikipedia.org/wiki/HTTP_referer) especially if the alternative is potentially just worse (to introduce unary we now have to use 0 or choose another less obvious character at random).

That link was mostly just evidence that code that does base2 multibase decoding is/has been deployed. You can take any CID and chuck it in there and it'll work https://cloudflare-ipfs.com/ipfs/0000000010111000000010010001000001111111000011100101100010110011010100000000001101111001010101000000111110000100010000100110001011000010010111101010010001111011101011110010100111101101111001110111011010001100001111011001110010001101101101001101011010001111110001111011101111101110111100000.

Wow, this research is turning up so many implementation details I never knew about! It's super cool that Cloudflare's gateway already has this multibase at scale, I think asking them to break it in prod would be a huge ask. I'm calling this "strong argument #2" for leaving binary and unary as they are :D

rvagg

Here's another rabbit hole for you: https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md#string-representation

I don't believe we can do this, and the README even states as much under "Reserved":

1 - Base58 encoded identity multihashes used by libp2p peer IDs.

Even in current libp2p code it makes these peer IDs: https://github.com/libp2p/go-libp2p/blob/37319a699f336e3e062c7894f44d47db3fea4dc2/core/peer/peer.go#L179-L190

If the key is <42 bytes it "inlines" it using identity multicodec (0x00) and then just does a straight base58btc encode of that string, which turns 0x00 into 1...

These strings are everywhere and it means there's code like this: https://github.com/libp2p/go-libp2p/blob/37319a699f336e3e062c7894f44d47db3fea4dc2/core/peer/peer.go#L132

And in the spec linked above it has to have this line:

If it starts with 1 or Qm, it's a bare base58btc encoded multihash. Decode it according to the base58btc algorithm.

So I think that's a decisive no for this PR, and the reservation of 1 has to stay alive for "historical reasons", just like the reservation of the null byte.

vmx · 2023-08-22T09:29:16Z

Thanks everyone who dug deeper into the matter. I only had a quick look. I'm now convinced that we just keep the base 2 encoding as it is.

ben221199 · 2023-08-22T10:00:54Z

I completely missed the reserved 1. 😅

ben221199 · 2023-08-23T07:27:51Z

Yeah, I think it should stay 0, because of 1 reserved for P2P. However, now we have a discussion to point to if it comes up somewhere in the future. Maybe it is possible to add the reserved codes (/, Q and 1) into the multibase list, but then with a description Reserved by ... or something. I think that it will likely prevent such confusion in the future.

rvagg · 2023-08-24T00:40:06Z

I think so @ben221199; but good news! With #109 merged, all 3 are now listed in the table as "reserved" and there's a section describing why.

ben221199 · 2023-08-24T07:29:50Z

I see a extra column with codepoint information. That is nice. The status reserved is exactly what I had in mind, but maybe the column name could be state instead of status. Also I don't know if having none as encoding name is the right way to do. I think I prefer it being empty. Also I think that a relevant description is more useful than (No base encoding).

ben221199 · 2023-08-24T07:32:21Z

Also, identity is removed for 0x00. Is something planned to add it back with another codepoint?

Change code of base2

bf05d86

vmx requested a review from rvagg August 14, 2023 10:09

vmx reviewed Aug 14, 2023

View reviewed changes

vmx approved these changes Aug 16, 2023

View reviewed changes

rvagg requested changes Aug 22, 2023

View reviewed changes

rvagg closed this Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve consistancy of some base codes #114

Improve consistancy of some base codes #114

ben221199 commented Aug 12, 2023 •

edited

Loading

vmx left a comment

Stebalien commented Aug 14, 2023

ben221199 commented Aug 14, 2023

vmx left a comment

ben221199 commented Aug 16, 2023

vmx commented Aug 16, 2023

aschmahmann commented Aug 16, 2023 •

edited

Loading

ben221199 commented Aug 16, 2023

ben221199 commented Aug 16, 2023

aschmahmann commented Aug 16, 2023

vmx commented Aug 17, 2023

bumblefudge commented Aug 21, 2023

rvagg left a comment

vmx commented Aug 22, 2023

ben221199 commented Aug 22, 2023

ben221199 commented Aug 23, 2023

rvagg commented Aug 24, 2023

ben221199 commented Aug 24, 2023

ben221199 commented Aug 24, 2023

Improve consistancy of some base codes #114

Improve consistancy of some base codes #114

Conversation

ben221199 commented Aug 12, 2023 • edited Loading

vmx left a comment

Choose a reason for hiding this comment

Stebalien commented Aug 14, 2023

ben221199 commented Aug 14, 2023

vmx left a comment

Choose a reason for hiding this comment

ben221199 commented Aug 16, 2023

vmx commented Aug 16, 2023

aschmahmann commented Aug 16, 2023 • edited Loading

ben221199 commented Aug 16, 2023

ben221199 commented Aug 16, 2023

aschmahmann commented Aug 16, 2023

vmx commented Aug 17, 2023

bumblefudge commented Aug 21, 2023

rvagg left a comment

Choose a reason for hiding this comment

vmx commented Aug 22, 2023

ben221199 commented Aug 22, 2023

ben221199 commented Aug 23, 2023

rvagg commented Aug 24, 2023

ben221199 commented Aug 24, 2023

ben221199 commented Aug 24, 2023

ben221199 commented Aug 12, 2023 •

edited

Loading

aschmahmann commented Aug 16, 2023 •

edited

Loading