-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace hashed email with manifest-based offline check #31
Comments
An idea I had is to use a CLI tool to authenticate to GitHub and prompt to install the SponsorLink app, and then get a certificate from a backend service and put it in the certificate store. This way we address two of the biggest complaints with the existing approach:
|
This is what I believe the author should consider as well. The validation process must work in offline mode. If I haven't forgot how asymmetric encryption works then achieving this should be possible. |
If the premise is that users need the SponsorLink app installed on their dev boxes, perhaps that app could require the user to sign into GitHub (e.g. like the Co-pilot plugin does) and then it could query GitHub for the projects the user sponsors. That information could then be made available locally on the machine to the analyzer running on it. That way, there is no "behind the scenes" communication going on at build time. SponsorLink app downloads and caches a list of sponsored projects to the user's machine. It could also enable users that are part of an organization that sponsors a project to be counted as a sponsor, i.e. the organizations are the sponsors and not the individual devs working for the company. In many places, organizations are able to deduct sponsorships from taxes, whereas individuals are not, and would thus have to go through expensing this to their work if the work is to cover the sponsorship. |
My thought was to put the certificate in the OS's certificate store, marking it non-exportable if possible. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This is of course not at all true. You can collect all the data you want in the EU as long as you just ask the user and the user gives consent. That is the entire problem with the current implementation. One or more emails are collected without asking permission first. |
Read the source code, nothing is happening if you don't install the github sponsor app and give explicit consent and also if you are not using an IDE |
What about api keys? NuGet, Azure and all the others have api keys. Not sure how a hash of an email is somehow less private that an api key that directly maps to your email on another service. However, just in case, may e people can just use a GitHub access token to read? So the token is the key? Then corporates can just set an envvar on ci/local and the build can do that? |
You make two faulty assumptions.
|
The point is that you have to get consent from a user before collecting their info. A 3rd party app or web site should not just go and collect your info without your consent, nor should it go and fish out API keys you may have access to without asking permissions. For example, If SponsorLink uses OAuth2 to login to GitHub, it can request the users email address, and the user has to click the big green "Install and authorize" button. That is the user giving consent, and in that case, SponsorLink is not doing anything wrong. |
If I read the explanation in the main readme, that does not seem to be true: Check number 3. explicitly uses the email collected in 2. to send a remote request. All without any user consent. |
@macco3k the SHA256 of the email. But point taken. That's what this issue is about. @mattleibow I was thinking just depending on the GH CLI, so I don't have to deal with auth at all. |
No licensing systems sends you certificate with a private key. They sign the license with their private key. |
And obviously the private key material is stored somewhere and administrator can backup/export the private key. |
Hashing does not anonymize data. A hash of PII should be treated as PII, as it effectively produces an easily-reproducible unique identifier. Email addresses are such a small domain of values that there is an infinitesimal chance of collisions and one can produce the hashes for simpler email addresses (combinations of common names @ common domains) very rapidly. For those that require FIPS-compliance in the US, this is not an approved use of any hash algorithm. Hash algorithms are only approved for verifying message integrity, in approved digital signature algorithms, and in approved key-derivation functions. The current implementation is not sound from a technical standpoint. A license, digitally signed by you and verifiable at runtime, would be a better implementation. Though do allow for multiple types of algorithms as nothing is universally acceptable. I'm not even going to get into the privacy laws you'll now need to be comply with, nor the legal liability you'll be taking on with it being a paid product. Those are better discussed with your attorney, and not random people on the internet. As a developer of some FOSS libraries that next to nobody uses, this move sucks. I don't monetize those projects, and consequently won't use any that require any monetary investment from me. As a manager of a development team, this would never pass muster, and would be an instant "pin/fork a version that doesn't require payment or else eliminate it as a dependency". And I say that not because we're cheap and don't want to pay for things; I say that because the implementation is poor, the value provided is minimal, and there is no option to sponsor at an organizational level to unburden my team. And perhaps most importantly, it makes me look like an idiot to my boss when we suddenly have an unplanned expense (no matter how small). For example:
While I am certain there are some organizations that prioritize getting their builds out with no delay or interference, mine is not one of them. You add 1, 2, or even 10 seconds to my CI builds, then you're wasting an almost-free amount of compute time. If you add a millisecond to my developers' unit test times, we're going to break out the pitchforks and torches. Moq was already painfully slow, and only gets a pass because it was historically more convenient to write tests with than arranging an integration test. But with AI making integration tests far less tedious to write, and which typically execute as fast or faster than Moq while providing much greater assurances of correctness with greater readability... Moq's current value lies almost entirely in not needing to rewrite existing tests. And while that is a tangible and measurable amount of value, it's not an intolerable one. But the biggest issue--the underlying one that I haven't seen mentioned in this discussion--is the stain this places on the FOSS community in general. People want to monetize their work, and that is understandable. And while I don't know what the right way is, this is definitely one of the wrong ways. In this case, suddenly charging for something that was previously free is exactly what businesses that were wary of using FOSS were afraid would happen. There was so much FUD around FOSS in years past that it took ages to be allowed to use it in our code. Once we got past that (look, even Microsoft is doing open-source now), it put developers in the driver's seat and we became able to stop re-inventing the wheel and use existing packages because they saved us time. We didn't use free ones simply because we were cheap; we used them because it was the path of least resistance. They offer value to our productivity that doesn't require justifying a purchase order. Once money is involved, things get painful. But today, FOSS has become yet-another vector for malware. And where once we feared submarine patents or unfixed defects in abandoned projects, we're now being scalded by submarine licensing fees. We either need to admit we made a poor judgement call by asking the company to pay for it, or else pay for it out of our own pockets. It's a breach of trust that affects the whole community. |
And now I present the most overkill solution to preserving anonymity: Use a k-anonymity model with the repo name (this prevents giving out email digests willy nilly, the digests are always scoped to the repo). Digest is calculated as H(sponsorEmail || sponsoredRepo) where H is the hash function. API exposes an endpoint that will take the first 5 characters of the digest, and return the suffixes of matching digests.
There is no solution that prevents the storage of PII given that what you want to do here is identify individual persons (sponsors), but you can at least stop users who have not agreed to share their email with you by linking their account sending their PII to you. |
IIRC, haveibeenpwned.com does something similar to this to avoid ever sending email/password hashes over the wire: https://www.troyhunt.com/understanding-have-i-been-pwneds-use-of-sha-1-and-k-anonymity/ |
I've been exploring creating a GH CLI extension (initial proof of concept here). It might be the easiest to integrate since it automatically conveys that you have to be signed in to the GH CLI to operate: The idea is that the CLI tool would allow you to run At (IDE-only, like today) build time, the analyzer would look for this file and verify your (local git configured) email against it, entirely offline, and at most (say) once a day. So the email would never leave your machine if you hasn't installed the GH app at all, and neither would it if you had. Since a calendar month is the cut-off for sponsorships (one-time or recurring), you'd need to download this file periodically, as it will therefore have an "expiration" built-in. The file would need to contain sufficient information to:
So, based on @Perksey suggestion, the algorithm that would also support org-wide sponsorships might look like the following:
Does that sound correct? From @DavidAtImpactCubed's link to Troy's most excelent blog post, I take it that this set up would be fast, PII-preserving and accurate, even if we switched to SHA-1? Thanks a lot for the feedback! |
I would probably switch to SHA256 or SHA3-256 given the known cryptographic problems with SHA1 to cover your backside a bit, but otherwise sounds good to me. |
Troy's fairly recent post on the topic makes a strong case for actually sticking to SHA1 though: https://www.troyhunt.com/understanding-have-i-been-pwneds-use-of-sha-1-and-k-anonymity/ |
On reflection, I'm not sure that k-anonymity (a-la-haveibeenpwned.com) is achievable in your use case. As I understand it, it relies on being able to segment the reference data set (in your case sponsors) into groups which share a common hash prefix which are large enough that the response contains many possible matches (the "k" in k-anonymity). In the case of haveibeenpwned, the responses contain a large number of matching hashes (see an example here: api.pwnedpasswords.com/range/11111) In your case: The flip side of this is that if the number of sponsors is public, and is a short list, then why not respond with a list of hashes of all sponsors, and do all checks client side? If your list of sponsors is not public, none of these approaches are valid. |
@kzu The use cases are different. HIBP protects known-public data therefore the SHA-1 just acts as a minor obfuscation as well as handling the compromised data in a uniform manor. I strongly discourage using a known-insecure hash for data that is not strictly public. @DavidAtImpactCubed As discussed, the fact that sponsors will have PII stored and exposed as pseudononymised is unavoidable, k-anonymity isn't being used to prevent this. This model is only used to prevent entire hashes of emails being sent to the server by those who have not agreed to it (those who have their data stored agree to it at time of linking the GitHub app) |
When you say "people" do you mean the people actively proposing solutions on this thread or the people just saying it's not done properly and not doing a single thing proactive to suggest a compliant solution to be implemented? |
Translation:
I'm sorry that GDPR's lack of wiggle room offends you. I've offered a compliant solution: Don't do any checking and just put in the message regardless. Guaranteed 100% compliant with GDPR. |
@cmjdiff you're not the only GDPR-aware developer here. I'm in the UK and I'm in data business. I've done this before, so as countless others. Just because you don't know how it can be done, doesn't mean no one else does. Maybe @kzu knows how it's done, maybe he doesn't. Advise him on how it can be done, then. Not how it's impossible. We've all gave him a list of "One Million Things That You Can't Do Under GDPR". We've all criticised him to the hell and back, because he has done it wrong. Hell, I've given 6 different metaphors to explain what's wrong. He gets it. He's trying to improve it. Give the man a break. |
Thanks @harunlegoz for the suggestions. The auth I was talking about was not about getting (locally) the user's sponsors list. The tricky part is coming up with some sort of SponsorLink-backend signed manifest, not something just done on the client-side. Otherwise, anyone could generate a manifest, share it with the whole team and there would be no way to assert a given manifest has been validated/signed by SL itself. So, gathering sponsors locally is already doable (it's what my PR to the other gh-sponsors repo does already), since the user running the GH CLI is already authenticated to GH. On the backend, similary, the user has to install the GH App (via github.com), which grants the permissions to the SL backend to read the users' email(s). The missing piece is matching up the locally calculated sponsors with the backend-determined sponsors (this info would be shared by the sponsorable account when they set up the webhook to notify SL of active sponsors) for the "current user" (which is authenticated to the GH CLI but not to the SL endpoint. 🤔 Does it make sense? |
@kzu, ah, I see. Okay, now I understand. Sorry for making the wrong assumption. Why not have gh sponsors app have a running background agent that exposes a localhost API (with CORS set to only accept localhost requests)? It can accept the package name true/false to the caller, never store the values anywhere other than memory (and maybe in a SecureString?), and can do checks on behalf of the calling packages? If done right, it wouldn't take more than maybe a couple MB of memory, and literally no CPU usage. Again, I know I'm not well-versed into the SponsorLink's internal workings and only know it by your descriptions, so apologies if I'm not hitting the target. |
@Perksey GDPR rules are not intricate. They are very straightforward, and not aimed at lawyers; they are aimed at us, the processors of data. If you think the GDPR is intricate, oh man, just wait for user stories.
A number of solutions have been proposed. One should not need to propose a solution, because this is a solved problem, and has been for a very long time. If you want to get paid, charge for your product. That means change the license it is published under, and use a license key. It doesn't require a rocket scientist, brain surgeon, or rocket surgeon. It may require a lawyer, because you no longer get that "get out of lawsuit free" card that FOSS licenses tend to give you. If you publish under a FOSS license, then waste a developer's time or a business' compute time, then you are undermining FOSS. So in short, the proper way to handle this is:
Full stop. |
Well, previously I worked at a company in the health sector which involved sending/transfering estimates which contained A LOT of very critical PII, not about just the patient but also about their diagnosis and stuff, so I know a thing or two about GDPR since the company HAD to comply with it. But no lawyer yes. What's your professional background on GDPR?
No, it wasn't. There was no mention of it, unless people started asking and all of that boiled up. And even that only happened because the pipelines failed after upgrading (for once because of it being emitted as warning which fails for "Warning as Errors" Pipelines and some other even through exceptions during build, preventing people to ship stuff) In worst case no one would even have noticed the PII being leaked to kzu! |
@harunlegoz that wouldn't be an improvement over just checking a local file, I think. If anything, it would look even MORE suspicious to have a running process/agent.
Awesome, finally someone understands what this is about, since that's precisely what is being proposed. The "literally anything else if they don't" part, in particular. If there's no sponsorlink file, do nothing, just ask for donation. Now, if the user IS sponsoring, the analyzer should be able to do something with that information. It can be changing the diagnostics message (or removing it entirely), or even feature-toggling some IDE-only feature. That's the part that needs solving in order for SL to actually provide value beyond just showing a static diagnostics message begging for sponsorships.
I very much doubt that was the case. The analyzer NEVER run in CI. You can see the extensive checks for in editor only conditions I performed since day 1: https://github.com/devlooped/SponsorLink/blob/main/src/Package/SessionManager.cs#L51-L74 |
How well tested is that |
It ALSO has to be IsEditor |
@kzu if you did want some more checks in this area, you can find a number of CI provider that are checked for as part of the Cake Build Orchestration system. You can find that code here: https://github.com/cake-build/cake/tree/develop/src/Cake.Common/Build Each one has a |
In addition the hashing, which might still be concerning since knowing your sponsors' emails and sponsored accounts might still allow to reconstruct the hash, adding a random, per-install GUID completely removes this possibility. The new Session handles these environment variables so we don't even incur any I/O down the road from the analyzer: * SPONSORLINK_INSTALLATION: a GUID created if not already present (can be cleared by the user at any time to completely change all future hashes as needed), used for salting all hashes. * SPONSORLINK_TOKEN: an access token used to invoke the SponsorLink API to sign the manifest hashes. This is done only to allow integrity verification at analyzer/check time. Since the hashes are now effectively irreproducible by the server, all the server would do is sign the JWT received in the `/sign` endpoint with the corresponding private key, but otherwise the JWT remains intact (only the expiration date is set from the server-side too when signing). Related to devlooped/SponsorLink#31
In addition the hashing, which might still be concerning since knowing your sponsors' emails and sponsored accounts might still allow to reconstruct the hash, adding a random, per-install GUID completely removes this possibility. The new Session handles these environment variables so we don't even incur any I/O down the road from the analyzer: * SPONSORLINK_INSTALLATION: a GUID created if not already present (can be cleared by the user at any time to completely change all future hashes as needed), used for salting all hashes. * SPONSORLINK_TOKEN: an access token used to invoke the SponsorLink API to sign the manifest hashes. This is done only to allow integrity verification at analyzer/check time. Since the hashes are now effectively irreproducible by the server, all the server would do is sign the JWT received in the `/sign` endpoint with the corresponding private key, but otherwise the JWT remains intact (only the expiration date is set from the server-side too when signing). Related to devlooped/SponsorLink#31
In addition the hashing, which might still be concerning since knowing your sponsors' emails and sponsored accounts might still allow to reconstruct the hash, adding a random, per-install GUID completely removes this possibility. The new Session handles these environment variables so we don't even incur any I/O down the road from the analyzer: * SPONSORLINK_INSTALLATION: a GUID created if not already present (can be cleared by the user at any time to completely change all future hashes as needed), used for salting all hashes. * SPONSORLINK_TOKEN: an access token used to invoke the SponsorLink API to sign the manifest hashes. This is done only to allow integrity verification at analyzer/check time. Since the hashes are now effectively opaque to he server, all the server would do is sign the JWT received in the `/sign` endpoint with the corresponding private key, but otherwise the JWT remains intact (only the expiration date is set from the server-side too when signing). Related to devlooped/SponsorLink#31
…wing your sponsors' emails and sponsored accounts might still allow reconstructing the hash, adding a random, per-install GUID completely removes this possibility. The new Session handles these environment variables so we don't even incur any I/O down the road from the analyzer: * SPONSORLINK_INSTALLATION: a GUID created if not already present (can be cleared by the user at any time to completely change all future hashes as needed), used for salting all hashes. * SPONSORLINK_TOKEN: an access token used to invoke the SponsorLink API to sign the manifest hashes. This is done only to allow integrity verification at analyzer/check time. * SPONSORLINK_MANIFEST: last sync'ed manifest JWT to check for sponsorships. Since the hashes are now effectively opaque by the server, all the server would do is sign the JWT received in the `/sign` endpoint with the corresponding private key, but otherwise, the JWT remains intact (only the expiration date is set from the server-side too when signing). Related to devlooped/SponsorLink#31
…wing your sponsors' emails and sponsored accounts might still allow reconstructing the hash, adding a random, per-install GUID completely removes this possibility. The new Session handles these environment variables so we don't even incur any I/O down the road from the analyzer: * SPONSORLINK_INSTALLATION: a GUID created if not already present (can be cleared by the user at any time to completely change all future hashes as needed), used for salting all hashes. * SPONSORLINK_TOKEN: an access token used to invoke the SponsorLink API to sign the manifest hashes. This is done only to allow integrity verification at analyzer/check time. * SPONSORLINK_MANIFEST: last sync'ed manifest JWT to check for sponsorships. Since the hashes are now effectively opaque by the server, all the server would do is sign the JWT received in the `/sign` endpoint with the corresponding private key, but otherwise, the JWT remains intact (only the expiration date is set from the server-side too when signing). Related to devlooped/SponsorLink#31
Since we're moving to an entirely offline and manifest-based SL check, we don't need any of the complex stuff we had before. We just need a SINGLE endpoint to sign the JWT manifest sent by the gh-sponsors extension (see devlooped/gh-sponsors#8). Everything else is unnecessary now. The functionality is reduced but simplified significantly, with a massive improvement in PII/GDPR compliance. There is no longer any direct nor indirect telemetry tracked since the user must explicitly run a command and install an extension in their machine before getting a signed manifest. We also no longer provide a webhook for the GH apps, which will be deprecated too. Fixes #31
Since we're moving to an entirely offline and manifest-based SL check, we don't need any of the complex stuff we had before. We just need a SINGLE endpoint to sign the JWT manifest sent by the gh-sponsors extension (see devlooped/gh-sponsors#8). Everything else is unnecessary now. The functionality is reduced but simplified significantly, with a massive improvement in PII/GDPR compliance. There is no longer any direct nor indirect telemetry tracked since the user must explicitly run a command and install an extension in their machine before getting a signed manifest. We also no longer provide a webhook for the GH apps, which will be deprecated too. Fixes #31
This was previously ignored from a glob too. Fixes #31
This was previously ignored from a glob too. Fixes #31
While I do agree with you that this project is a horrible idea, I don't think your GDPR claims are valid. In fact I think the proposed solution in OP is GDPR-compliant in principle. No personal data is processed unless the data subject consents to it for the sake of verifying they are a paid customer. The consequences of the refusal to provide consent are arguably minor: this is akin to a user buying an ad-free version of the open source software (what a dystopian sentence to say). |
@Moxinilian consent is only a small part of GDPR. It doesn't matter if there is consent and the technical implementation is ironclad if there isn't well documented processes for how the data is stored, extracted and deleted according to the tenants for transparency, portability and right to be forgotten. In all fairness GDPR also applies for a commercial licens model where there are paying customers with persisted information somewhere. The administrative overhead must still be handled. |
Yes, hence why I said it is compliant in principle. The rest is formalities. |
It was brought to my attention that this wasn't sufficiently anonymizing given that especially for corporations, the pattern for emails is not hard to probe if you have a list of emails from somewhere, but attempting to access the cloud URL for the sponsorship.
The current proposal (client-side CLI implementation PR) would work as follows:
gh sponsors sync
....gh-sponsors
CLI extension (basically runninggh extension install devlooped/gh-sponsors
).gh sponsors
(sync
being the default command). On the first run, the tool explains again what's going to happen and performs the following steps:a. Creates a user envvar with a random GUID to use for salting all hashes
b. Gets the user's active sponsored accounts
c. Gets the user's verified emails
d. Gets the user's organizations and their verified domain(s), as well as their sponsored accounts
e. Hashes each user's email c) with each sponsored account b) (salted with a)) and turns them into JWT claims ("hash"=hash)
f. Hashes each verified org domain d) with sponsored accounts too (salted with a too), and turns them into JWT claims too
g. POSTs this to a GitHub-authenticated SponsorLink endpoint that signs the JWT with SL private key. All the endpoint validates is that the logged in GitHub user (via Auth0) is the same as for the GH CLI.
h. The backed responds with a signed JWT with an expiration that covers the current month (sponsorships expire at the end of each month).
i. The token is saved to the envvar checked in 1)
On a subsequent build:
gh sponsors
again.hash(salt, email, sponsorable)
and tries to find that claim within the JWT (all local). It also does a fallback check forhash(salt, domain(email), sponsorable)
, to support org sponsorships.Of note:
The goal is for integrators to just have a documented standard mechanism for verifying the JWT manifest token, even without any SL-provided code. But a simple loose file "helper" should be provided for simplicity.
GH CLI experience is similar to the following:
The text was updated successfully, but these errors were encountered: