Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve consistancy of some base codes #114

Closed
wants to merge 1 commit into from

Conversation

ben221199
Copy link
Contributor

@ben221199 ben221199 commented Aug 12, 2023

I know that many of you will go wild when seeing this pull request, but I think it could be useful. Also, I dare to do it, because base2 is a candidate and not a default.

The reason for this change is simple:

  • Base 2 (binary), having the digits 0 and 1. Taking the highest to use as code, 1. This is not the case, so lets change it.
  • Base 8 (octal), having the digits 0, 1, 2, 3, 4, 5, 6 and 7. Taking the higest, 7 to use as code. This is already the case.
  • Base 10 (decimal), having the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Taking the higest, 9 to use as code. This is already the case.
  • Base 16 (hexadecimal), having the alphanumerals 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. Taking the higest, F to use as code. This is already the case. Same for lowercase.

If people disagree on changing it, we can also add a new base2 record with 1 as code. Eventually, we later decide if we drop code 0 or give it another purpose. (For example, unary. I like that one.)

@vmx vmx requested a review from rvagg August 14, 2023 10:09
Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say base2 encoding is more of a theoretical exercise. I'd be OK with changing it. I don't think it's widely used. I also don't know if there's any good or historic reason to use 0. @Stebalien do you know?

@Stebalien
Copy link
Member

I don't think there's a good reason one way or the other, no. And honestly, nobody should use these and I'd prefer to just remove them. But I'd have no objection to changing it.

@ben221199
Copy link
Contributor Author

I find them all useful, certainly because binary and octal are also used in programming.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the last IPLD community/sync call we talked about this issue. The folks in the call see Base2 useful for educational purpose and agree that it should be made consistent with other base encodings.

@bumblefudge do you want this to get in before or after #109?

@ben221199
Copy link
Contributor Author

Did I saw some IANA considerations? That would indeed be nice.

@vmx
Copy link
Member

vmx commented Aug 16, 2023

Did I saw some IANA considerations? That would indeed be nice.

Yes. Though only the ones we currently have labeled as "default" will be moved over there (if it is accepted).

@aschmahmann
Copy link

aschmahmann commented Aug 16, 2023

@vmx @rvagg

base2 - Rod: educational/experimental, but not implemented by anyone

https://github.com/multiformats/go-multibase/blob/master/base2.go. So this is used by anyone that currently imports the base encodings from go-multibase.

Does this mean there are tons of users of ipfs.io, kubo, or others use base2? Are there lots of educational materials showing people the base2 multibase? Probably not, but I don't know not sure if anyone's done much investigating here.

For example: https://cloudflare-ipfs.com/ipfs/00000000101010101000000000000000101000001

Out of curiosity though. I feel like most times I've seen unary represented it's with a 1 rather than a 0 (e.g. https://en.wikipedia.org/wiki/Unary_numeral_system). So perhaps despite not fitting the pattern, even aside from the multibase history here, it makes sense for (0,1) to be binary and (1) to be unary with 0 as the binary prefix and 1 as the unary prefix.

@ben221199
Copy link
Contributor Author

Did I saw some IANA considerations? That would indeed be nice.

Yes. Though only the ones we currently have labeled as "default" will be moved over there (if it is accepted).

Agreed. Hopefully it will be accepted. The others could follow if becoming default.

@ben221199
Copy link
Contributor Author

@vmx @rvagg

base2 - Rod: educational/experimental, but not implemented by anyone

https://github.com/multiformats/go-multibase/blob/master/base2.go. So this is used by anyone that currently imports the base encodings from go-multibase.

Does this mean there are tons of users of ipfs.io, kubo, or others use base2? Are there lots of educational materials showing people the base2 multibase? Probably not, but I don't know not sure if anyone's done much investigating here.

For example: https://cloudflare-ipfs.com/ipfs/00000000101010101000000000000000101000001

I didn't investigate anything at the moment. I wasn't even aware of this Cloudflare URL either. If the amount of usage is very little, or above URL is the only real usage (which is just decoding, because last byte 01000001 = A), I think we just can make 0-code binary multibase invalid. The other option is keep 0, but add 1 with the same format (and then eventually invalidate 0 later). I agree on investigating that some little more.

Out of curiosity though. I feel like most times I've seen unary represented it's with a 1 rather than a 0 (e.g. https://en.wikipedia.org/wiki/Unary_numeral_system). So perhaps despite not fitting the pattern, even aside from the multibase history here, it makes sense for (0,1) to be binary and (1) to be unary with 0 as the binary prefix and 1 as the unary prefix.

Many cultures would agree, because | (capital I) and 𓏤 (Egyptian one), etc. are all strokes. Mathematically, however, using 1 seems incorrect to me. Also, https://en.wikipedia.org/wiki/Tally_marks#Writing_systems is using 0 here. Also came across https://math.stackexchange.com/questions/2157887/how-to-write-zero-in-the-unary-numeral-system.

If we all disagree on unary, we also can give prefix 0 a total other purpose. Random base, for example. 😳😂

@aschmahmann
Copy link

I mostly don't understand the why this is worth doing. Yeah, it arguably should've been 1 for binary, but someone chose 0 a while ago.

  1. Does this matter?
  2. It might actually be correct to give 1 unary and therefore 0 binary (i.e. they swap because unary is more commonly represented by 1s than 0s).

If people really want to break things because it's not nice and no one is using base2 in practice (I hope not), and it's not in use in tutorials/educational materials (I don't know, but could certainly see it happen) then 🤷, ok I guess. It seems bad practice to break things because they seem slightly off (e.g. https://en.wikipedia.org/wiki/HTTP_referer) especially if the alternative is potentially just worse (to introduce unary we now have to use 0 or choose another less obvious character at random).

That link was mostly just evidence that code that does base2 multibase decoding is/has been deployed. You can take any CID and chuck it in there and it'll work https://cloudflare-ipfs.com/ipfs/0000000010111000000010010001000001111111000011100101100010110011010100000000001101111001010101000000111110000100010000100110001011000010010111101010010001111011101011110010100111101101111001110111011010001100001111011001110010001101101101001101011010001111110001111011101111101110111100000.

@vmx
Copy link
Member

vmx commented Aug 17, 2023

I mostly don't understand the why this is worth doing. Yeah, it arguably should've been 1 for binary, but someone chose 0 a while ago.

1. Does this matter?

I don't have a strong opinion on it, but I think consistency within specs is always great. There are sometimes those little small details that don't match up in multiformats/ipld/ipfs specs that are only there for historic reasons as someone once made some decision before they had the full picture. I think we should be free to break those things more often. For me this is such a case.

@bumblefudge
Copy link
Contributor

I mostly don't understand the why this is worth doing. Yeah, it arguably should've been 1 for binary, but someone chose 0 a while ago.

  1. Does this matter?

It might, or it might not, depending on how important to you is the convention that each code is the "highest codepoint" allowed in that encoding alphabet, a convention that seems to have been a guiding principle at some point but never really enforced or codified. In the current spec, it's barely mentioned, and I've tried to make it more explicit but still nonbinding in my big PR that makes everything I can find more explicit to force differences of interpretation to the fore before taking things to IETF.

  1. It might actually be correct to give 1 unary and therefore 0 binary (i.e. they swap because unary is more commonly represented by 1s than 0s).

This is actually a valid argument, IMHO, for not being consistent about the "highest codepoint" rule, since it seems to reflect ergonomic preferences of people who wade around in lower-level programming and raw bytes all day (this is not my personal experience so I defer to the experts here!). I think "whatever will be intuitive and ergonomic for the developers who will use that codec in that use case" definitely trumps the "highest codepoint" rule, if you think it's the case with binary and unary!

If people really want to break things because it's not nice and no one is using base2 in practice (I hope not), and it's not in use in tutorials/educational materials (I don't know, but could certainly see it happen) then 🤷, ok I guess. It seems bad practice to break things because they seem slightly off (e.g. https://en.wikipedia.org/wiki/HTTP_referer) especially if the alternative is potentially just worse (to introduce unary we now have to use 0 or choose another less obvious character at random).

That link was mostly just evidence that code that does base2 multibase decoding is/has been deployed. You can take any CID and chuck it in there and it'll work https://cloudflare-ipfs.com/ipfs/0000000010111000000010010001000001111111000011100101100010110011010100000000001101111001010101000000111110000100010000100110001011000010010111101010010001111011101011110010100111101101111001110111011010001100001111011001110010001101101101001101011010001111110001111011101111101110111100000.

Wow, this research is turning up so many implementation details I never knew about! It's super cool that Cloudflare's gateway already has this multibase at scale, I think asking them to break it in prod would be a huge ask. I'm calling this "strong argument #2" for leaving binary and unary as they are :D

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another rabbit hole for you: https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md#string-representation

I don't believe we can do this, and the README even states as much under "Reserved":

  • 1 - Base58 encoded identity multihashes used by libp2p peer IDs.

Even in current libp2p code it makes these peer IDs: https://github.com/libp2p/go-libp2p/blob/37319a699f336e3e062c7894f44d47db3fea4dc2/core/peer/peer.go#L179-L190

If the key is <42 bytes it "inlines" it using identity multicodec (0x00) and then just does a straight base58btc encode of that string, which turns 0x00 into 1...

These strings are everywhere and it means there's code like this: https://github.com/libp2p/go-libp2p/blob/37319a699f336e3e062c7894f44d47db3fea4dc2/core/peer/peer.go#L132

And in the spec linked above it has to have this line:

If it starts with 1 or Qm, it's a bare base58btc encoded multihash. Decode it according to the base58btc algorithm.

So I think that's a decisive no for this PR, and the reservation of 1 has to stay alive for "historical reasons", just like the reservation of the null byte.

@vmx
Copy link
Member

vmx commented Aug 22, 2023

Thanks everyone who dug deeper into the matter. I only had a quick look. I'm now convinced that we just keep the base 2 encoding as it is.

@ben221199
Copy link
Contributor Author

I completely missed the reserved 1. 😅

@rvagg rvagg closed this Aug 23, 2023
@ben221199
Copy link
Contributor Author

Yeah, I think it should stay 0, because of 1 reserved for P2P. However, now we have a discussion to point to if it comes up somewhere in the future. Maybe it is possible to add the reserved codes (/, Q and 1) into the multibase list, but then with a description Reserved by ... or something. I think that it will likely prevent such confusion in the future.

@rvagg
Copy link
Member

rvagg commented Aug 24, 2023

I think so @ben221199; but good news! With #109 merged, all 3 are now listed in the table as "reserved" and there's a section describing why.

@ben221199
Copy link
Contributor Author

I see a extra column with codepoint information. That is nice. The status reserved is exactly what I had in mind, but maybe the column name could be state instead of status. Also I don't know if having none as encoding name is the right way to do. I think I prefer it being empty. Also I think that a relevant description is more useful than (No base encoding).

@ben221199
Copy link
Contributor Author

Also, identity is removed for 0x00. Is something planned to add it back with another codepoint?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants