-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch LGPL'd chardet for MIT licensed charset_normalizer #5797
Conversation
That looks like an easy fix to a long-standing problem and licencing problem. It would be really great if this change gets merged and released. For the maintainers of requests - I know You have limited time for testing but If you think there is any need for extra testing, I am happy to help to get it approved. |
ea671c4
to
d8cf70e
Compare
Dear maintainers of requests library. Would it be possible to hear from you what you think about this idea? It's super-important for us - basically for the whole Apache Software Foundation but for Apache Airflow and Apache Liminal projects particularly. It would be great if we know whether this or similar change will be possible to accept by you or not and whether we should try some alternatives. Implementing non-LGPL bound dependency by This is just a kind request for feedbck and comments, rather than asking to merge it immediately, we just need to know what our options are. |
The library is under a rather restrictive (for the maintainers' own sanity) feature-freeze at the moment and unfortunately changing a dependency which can have a ripple-effect on the usage of the library is very unlikely to be accepted - regardless of licensing |
Is there anything we can help with to make this happen ? Maybe we can somehow step-in and help you with testing/changes and step-up gradually to become maintainers eventually? I think we could probably try to bring it to the attention of more experienced devs that could help with that? One alternative I can think of is to maintain fork of requests in the Apache-owned organization (and any of the library that uses that uses requests that we use) but obviously this is not something that we'd love - might be good for short-term, tactical approach but not really great for long-term maintenance/strategic solution. Is there any timeline for the feature freeze you mentioned ? |
Can we also know the reason for the feature-freeze, we like Ash & Jarek said would be happy to help in any testing and maintaining if that is one of the main concerns. This change should help most of the downstream projects which are used at lot of organization who have strict licensing policies. |
@sigmavirus24 I can appreciate that. Is this a feature freeze "for ever", or is there some planned time frame? |
Hi everyone, Like @sigmavirus24 said, this is highly unlikely to happen. The absence of any kind of answer is a good index. I worked in several companies that have a license check mechanism, sometimes used right sometimes not. In those companies, I proposed a really dirty way to fix this by using their internal proxy switching chardet to charset-normalizer via a hacky package. Not a good way to go, but worked. The only concern I would have regarding this PR is the deployment scale, although this alternative package is stable and has been used by many it is no match to requests usage. But we have to ponder this with the actual usage of the 'apparent_encoding' property. This concern can be eliminated with many solutions at the maintainers' disposal like using the 'pre-release' mark from Pypi, using a separate branch automatically synced with the master one, etc... If this project has any future major release I would vouch in favor of removing charset detection from it as it is not 'HTTP Client' related. Like httpx did sometime this year. Otherwise, technically, correct me if I am wrong, I feel confident to say that charset-normalizer is way more reliable than chardet. And can be proven easily. (ex. i. poor charset coverage, ii. sometimes return encoding that cannot decode given bytes) Maintainer @nateprewitt did already take a risk when bumping chardet from 3.x to 4.0 earlier this year. (In a matter of ~80 hours after the tag (4.0 chardet) was published to Github) So one could argue that such a change is not impossible. In case this PR would fail, I like the idea of @potiuk having a fork in the Apache org with this patch and synced with upstream at all times. Regards, |
@sigmavirus24 - what do you think? Any chance we can avoid forking? I think we will need to solve it before the next release of Airflow which is likely going to happen next week, so we do not have a lot of time. |
@potiuk This is unlikely to be accepted and released within a week. |
Within a week is tight, yeah 🙂 but even just a "yes, we are working to accept this" is probably enough for us to not have to fork it in the short term (which we'd clearly like to avoid.) |
Hey @sigmavirus24 . FYI: We have started to discuss this in LEGAL JIRA of the Apache Software Foundation. Seems that in the ASF we have now ~50 projects that are using requests library and the discussion came to the conclusion that basically we have to migrate out of the requests library because of the Here is the discussion: https://issues.apache.org/jira/browse/LEGAL-572 We are discussing about the solution. Forking requests and asking others to use the fork is one thing. One more thing that came out in the discussion is to propose you to donate What do you think? Would you be willing to have a second thought about getting rid of the LGPL dependency and merging the PR ? I think we are really close to start a bigger effort of not only converting 50 ASF projects but also asking a number of 3rd-party libraries to switch to our non-LGPL depending fork that we are planning to make. We really want to play it nicely, please don't treat it as a hostile move, but I think we have no other choice here. |
This started with the release of 2.0. The idea was to slow down development to a reasonable pace. Lots of features were being thrown over the wall at us and the library's reach was beginning to sprawl way too much. It's been a feature freeze for quite a few years.
To be clear, you're suggesting becoming maintainers to stop a feature freeze created to keep the surface area minimal, to keep churn down to provide a stable library, and not because we're inexperienced as developers or maintainers. I'm sure of course you weren't trying to call us inexperienced because we have a default policy of "No" that doesn't belong in Requests. That said, this appears to have been sent as a PR in sudden urgency that feels manufactured as chardet has been a dependency of this library almost as long as the library has existed. It was also created with the message of "We can do testing if needed" as in you did no testing expecting us to just merge it because the ASF needed it, which defeats the purpose of designing for stability.
So, The PSF is fairly established but doesn't force things.
As I said, I don't trust the PR. Y'all have given me 0 confidence that this actually doesn't break things for users. So that's a hard pass for me on merging this. Now that Pip is off of Python 2, however, I think a Requests 3 that's Python 3 only is well within sight and just dropping the character detection altogether is the right choice. I don't think it works well either way with any dependency in particular. As for when Requests 3 might ship, who knows. I don't particularly have a lot of time for that and neither does Nate. Also it probably won't be requests 3. But that's a separate thing altogether |
I am truly amazed by the situation. Lots of things learned in the past month. You do even realize that the frustration lived by the community is born out of those kinds of weird situations?
OpenSource can only grow and evolve by confronting each other's opinions, being as truthful as possible, only together that we will succeed. I am truly convinced that honesty is gold. Even if brutal and cold as long as said cordially.
Yes, maybe requests3 is meant to be created by others. Who knows.
That is AN opinion, feel free to check out actual facts. Finally, thank you for giving us a response even if negative. Others are still hoping. Hopefully, there is a better future for requests and for us all. |
Contrary to the story you seem to have constructed about my opinion (and only mine, I don't speak for Nate), I was keeping an open mind that someone would say "We did these tests to verify this library is easily swapped for chardet, we think this is safe as an alternative". The whole conversation hasn't been about backwards compatibility but instead about a license which has been present since before Requests 1.0.
chardet has had consistently top-notch quality releases. It's much like we upgrade urllib3 pretty quickly. I feel like I'm missing something in this conversation though since you seem very hung up on this. 4.0, to the best of my knowledge (which is likely incomplete) didn't cause any issues. Dropping Python 2 will get us onto the latest idna but once again, no one has done any testing to indicate that they've found it to have backwards compatibility. Like this PR, folks just send it and expect it to get merged or expect us to do that testing ourselves. If we were to write that kind of testing into our CI, we'd get low quality bug reports from linux distro maintainers about those tests talking to the open internet or even worse, failing as they package incompatible versions together. Just smashing merge doesn't save us any time and only irritates users who are broken by those changes.
I'm genuinely confused by this. Are you arguing that the automatic character detection works well? I have years of issues, stackoverflow questions, emails, and blog posts indicating that it's terrible for a great number of people. Maybe not 100% of users, but without any kind of telemetry I can only look at the data available to me and make the determination that users are liable to be less confused by Requests' behaviour if something like chardet wasn't used. That's also orthogonal to the only concern the ASF seems to have which is the license. |
Text is not a good form of exchange, with too much incomprehension.
I don't doubt your good intentions. What I am saying today is that something could have been handled in a better way. Many things you oppose to this PR are stated as things that could have been saying earlier. Not admitting that is not the way to go I think.
Comparing chardet to urllib3 is a bit of a stretch. Many open-source programs are well released. Ok for urllib3 but not chardet, I have studied the code and it does not answer the "top quality" safe content.
Future tense. Took the risk and waited. Could have been a disaster. As you said, I hung on that. The release calendar is abrupt. This is reasonable to raise the question.
I think one of the main issues is your feeling regarding those PRs. A contribution could be meet halfway, if you wish nothing to be done to your end, that okay. Be a mentor, guide them. Let them do the actual work, share your knowledge, expand your horizon with them. If you wish to be a solo reviewer/merger, that okay for everyone.
Everyone knows that and agrees with it. You saw in people's message only that. But let me assure you that it wasn't just about that.
No, never said that. Some solution brings more stability than other for sure. Chardet is far behind cchardet/uchardet, should be completely blind to not see it. I am as well placed as you to answer things on this matter.
You really need to make some more research. There is more data than "what you have". You would know it if you looked at what was made. Otherwise, I insist that if you opened up earlier. No one would have insisted for weeks...
Yes, this is true. But alternatives were not there until recently and would have been ridiculous to propose it sooner. Needed more time to gain maturity across usage. cchardet excluded due to binding/build constraint. Regards, |
@Ousret I think you've interpreted things and added context to my actions that simply isn't there and made assumptions that have led you to believe I've wasted time or disrespected others' time. You mention the CoC PR. I did all of that while on my phone picking up groceries or walking my dog because it was easy and didn't need nuanced communication. This issue, however, I have wanted to provide clear communication about and doing that from my phone is nearly impossible. I had time today at a computer, but genuinely most of my time at my computer these days is not spent on Open Source or even this website. What little time my personal computer is on it is in search of something I or my family needs or some other task that needs to be done and then I'm away from it rather immediately. You don't get to see that. This 'social' platform doesn't give you that context. Some things I can handle quickly without needing to communicate clearly, others not so much. Early on, I misread that folks were planning to do more testing. Only later on a re-read after the umpteenth direct ping about this did I realize I'd misread it and that was an afterthought on the author's part. Any contribution I've made to another project I've ensured was tested and I've tried my best to test. I've even reached out to the author's in advance of sending the PR to see if they have advice or if my tests would be sufficient.
"Far behind" and the three, from the last several times I tested it returned wildly different results. Swapping out dependencies that produce different results is not backwards compatible. But apparently I'm blind because I have constraints that those projects haven't bothered to research or consider or even talk to me about before declaring themselves the clearly superior project in all cases, especially this one. |
@sigmavirus24 I'm sorry for all the raised tempers that this PR has envoked, this was far from my intent. My understanding is then that the requests library is in critical fixes only mode, and that you don't view this as a critical fix, so I'll close this PR as it's clear you won't accept it because you don't want to break code for existing users, which is something I can entirely appreciate! If there is any chance you would accept this PR I will do the work to make it happen, but I don't think it'll happen. Please tell me if I'm mistaken. |
@ashb I'm sorry for the miscommunications and how little time I have to test this. One thing I can think of is that if the ASF can run a test against some of the most popular websites and determine for the ones that don't declare an encoding does this dependency and chardet agree? There are open lists of these websites, sadly I expect most of them to actually declare their encoding but it's worth a shot as a sample. This is one of the tests I've run in the past but neither have sponsored hardware to use nor time to orchestrate |
@sigmavirus24 No problem, I understand the pressures of being an opensource maintainer! Perhaps another test we can do is to capture some sites in various encodings and strip out the encoding headers/meta tags? |
Yes. The main thing we need is reasonable confidence this is mostly backwards compatible (I don't expect full backwards compatibility because I know chardet's algorithm is far from perfect and I know the alternatives are also far from perfect). I do except something greater than 75% compatibility roughly. The last time I tested things the compatibility was roughly 60% even accounting for the fact that chardet reports some encodings differently (because the models are a little more dated). Sadly when Rackspace killed their F/OSS credit program I lost that data because I hadn't backed it up elsewhere |
This is great that we can discuss now how we can help with testing :). I would be very happy to help. Especially that better part of yesterday and today I implemented and tested replacing requests with httpx for Apache Airflow (we need it in order to release it). Actually one other thing that the ASF projects can do - if we have alpha/beta release of |
I ran a quick test using both the Alexa top 50 domains and a list of 500 domains I grabbed from moz.com. The results were fairly positive, though it would be worth actually grabbing the full Alexa list for a few countries (US, JP, RU etc.) to repeat the test. Quick and dirty script is here: https://gist.github.com/da1910/79c168294a8dfe2957a8cbc61daa1710 The reported encoding, chardet's and charset-normalizer's determinations are included in the csv files, and the table below shows the results where the two packages differ. Alexa Top 50 (2 timed out, 12 reported no encoding, 3 had a different result):
Moz's top 500 domains (17 timed out, 92 reported no encoding, 11 had different results):
Note: photos1.blogger.com returned a gif for me, so it's most definitely not windows-1252... |
I will try to run more tests tomorrow with more sites. Just for @sigmavirus24 and others who read that - so that you was aware why it is important for us and what is the problem we are trying to address. Maybe you simply are not aware what is the extent of the problem. For Airflow - we are just about to release 2.1, and I just finished the PR (took me more than a day) where I removed requests as core dependency (basically we replaced But as PMC of Airflow we have no choice now - we are obliged to do it to release new version (now that we are aware of the problem with Luckily we could do it rather "quickly" because we've already split airflow into core and optional parts and this OK with the policy of ASF to have optional dependency on LGPL. But we cannot have mandatory one. But this does not even touch all the optional dependencies and transitive ones. We have 43 3rd-party packages that use requests (https://issues.apache.org/jira/browse/LEGAL-572?focusedCommentId=17341767&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17341767) where requests is used (GCP/Azure/Docker/Kubernetes - they are optional but it's hard to imagine anyone using Airflow withou tthose). Unfotunately some other projects of ASF are using Airflow as dependency (for example Apache Liminal and they require docker, Kubernetes optional parts of Airflfow. So they are in much worse situation - because for them pretty much a chain of dependencies would have to be updated. So for now Liminal is pretty much blocked. Those ~50 projects of ASF could either do the same as I just did, or test requests with |
Hi, First of all, I would like to thanks @sigmavirus24 Then @ashb The results are also fairly positives. The remaining only problem is that you missed filtering out the warning about "trying to detect from.." as did @da1910 Here are what I got from running I took into account that We have a firm 78 % backward compatibility exact results. Based on +400 files. And these numbers increase if we tolerate different encoding that produces equal Unicode output. Even more, if we tolerate minor differences.
There is the question of You may find my JSON outputs :
To retrieve your own outputs: Run Regards, |
@potiuk Requests 2.26.0 should now be generally available on PyPI with these changes. |
Thanks @nateprewitt. Already informed everyone! Thanks again for being responsive to the ASF needs ! |
Some test results are coming already with thumbs up! apache/trafficcontrol#6011 (comment) |
Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more.
) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more.
Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd. By changing out this dependency for one under MIT we remove all license ambiguity. As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added, `requests[lgpl]`. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd. By changing out this dependency for one under MIT we remove all license ambiguity. As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added, `requests[lgpl]`. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Some of our unit tests that create mock HTTP responses were affected by this. Here's a minimal repro: from requests import models
import io
data = '{}' # Empty JSON body
resp = models.Response()
resp.raw = io.BytesIO(data.encode())
print(resp.json()) This used to work, but with the latest release raises the following error:
Setting |
Someone already reported this defect. jawah/charset_normalizer#58 Thanks for your vigilence, |
Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd. By changing out this dependency for one under MIT we remove all license ambiguity. As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added, `requests[lgpl]`. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
…che#16974) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more.
…che#16974) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more. (cherry picked from commit c46e841)
…che#16974) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more. (cherry picked from commit c46e841)
) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more. (cherry picked from commit c46e841)
) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more. (cherry picked from commit c46e841)
At least for Python 3 -- charset_normalizer doesn't support Python2, so for that chardet is still used -- this means the "have chardet" path is also still tested.
Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd.
By changing out this dependency for one under MIT we remove all license ambiguity.
As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added,
requests[lgpl]
. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it)I've read #4115, #3389, and chardet/chardet#36 (comment) so I'm aware of the history, but I hope that the approach in this PR will allow this to be merged, as right now, the Apache Software Foundation doesn't allow projects to depend upon LGPL'd code (this is something I'm trying to get changed, but it is a very slow process)