Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_transcript not working #117

Closed
salonygupta76 opened this issue May 31, 2021 · 29 comments
Closed

get_transcript not working #117

salonygupta76 opened this issue May 31, 2021 · 29 comments

Comments

@salonygupta76
Copy link

salonygupta76 commented May 31, 2021

Hi, I'm in a Linux environment and have verified the following before raising this issue:

  1. The version in use is youtube-transcript-api==0.4.1
  2. From my local Ubuntu 18.04.5 LTS, I'm able to get desired response for https://www.youtube.com/watch?v=FStcqEIH9G0
  3. Trying to hit API from Ubuntu servers setup for dev and prod environments leads to not getting any transcripts. (Could not retrieve a transcript for the video  #74))
  4. I tried ping youtube.com from my dev and prod envs and both of them seem to default to an IPv6 address as can be seen in the screenshot:
    image

@jdepoix Any idea what could be the reason?

Thanks!

@jdepoix
Copy link
Owner

jdepoix commented Jun 1, 2021

Hi @salonygupta76, could maybe try and run curl 'https://www.youtube.com/watch?v=FStcqEIH9G0' from your server and upload the resulting HTML somewhere and post the link here?
Also, since you posted that specific video, is the error only on that video or every video you try?

@salonygupta76
Copy link
Author

salonygupta76 commented Jun 1, 2021

@jdepoix No, I'm getting issue with any video with available transcript that I try. Here's the result on running above command: link

@jdepoix
Copy link
Owner

jdepoix commented Jun 1, 2021

@salonygupta76 are you sure you pulled that html from the same host the module was failing on? The html you uploaded seems just fine and I have no problem extracting transcripts from it.

@salonygupta76
Copy link
Author

I've exposed this service as an API in dev environment and I'm trying to query the same using Postman from my local system. The response html is shared after ssh'ing into dev and running curl command that was shared by you.

@jdepoix
Copy link
Owner

jdepoix commented Jun 1, 2021

Just so that I am understanding you correctly: curling YouTube from your local machine returns the same html as curling YouTube from your server, yet this module works on your machine, but not on your server, right? That seems really odd! What python version are you using btw? And could you please post the exact error message which is returned by this module.

@vanyamlb
Copy link

@jdepoix

Hello, I don't know if it's the right place to post, if no, I am very sorry, maybe I should have created a separate issue. But I have problems with your tool using it from the EU zone countries when I do it from my command line. When I do with EU countries VPN on the sites like replit or pythonanywhere, it works fine (it sends requests from their IPs I suppose). When I use VPN for out-of-EU country, it works fine. When I do from the EU (and my friends), it doesn't work. Maybe they've applied some law about it because I have troubles with YouTube tool downloading live chats as well.

@jdepoix
Copy link
Owner

jdepoix commented Jun 14, 2021

@vanyamlb could you please explain your infrastructure a bit more, I am not sure if I am understanding you correctly. Also, what do you mean by live chats? This module only supports transcripts. And what version of this module are you using?

@vanyamlb
Copy link

@vanyamlb could you please explain your infrastructure a bit more, I am not sure if I am understanding you correctly. Also, what do you mean by live chats? This module only supports transcripts. And what version of this module are you using?

  1. Windows 10, 21H1, 64 bit. Python 3.9.1.
  2. I meant that I have another YouTube tool for live chats from another developer (not connected with this topic), but it also stops working in the same conditions.
  3. I live in Ukraine (so out of the EU sadly). Your tool (last version) and another tool for live chats from another developer work fine together, but once I turn EU country's VPN on - they stop working. When I change VPN to, for example, Israel - tools start working again.
  4. The same happens with my friends who actually live in that countries (for example, Germany and the UK).
  5. Thus, I came to the conclusion that there is some law about personal data maybe in the EU or something like that which changed the tools' behavior.

Date when the tools stopped working: the beginning of this year's April.

@vanyamlb
Copy link

@jdepoix

Just checked everything again.

image

1 - under Italy's VPN or any other EU
2 - under Israel's VPN or just my own Ukrainian's IP (or any other out-of-EU)

@jdepoix
Copy link
Owner

jdepoix commented Jun 15, 2021

Well, I live in Germany and I don't have a problem using this, so I'm pretty sure it's not a problem with EU law 😄
Sound like it's a problem with your VPN to me. What happens if you just open youtube.com from one of those VPNs were it's not working?

@vanyamlb
Copy link

vanyamlb commented Jun 15, 2021

@jdepoix the problem is that I am not alone, my friend from Germany has this issue too and friends from other EU countries :/ (without a VPN) If it was only with VPN, then sure I would think about that... Maybe it's IPS blocking it or youtube...? I don't know the explanation... I checked YouTube with my VPN and it worked fine... But once more, when I use the same VPN app for non-EU country - i get the tool working! :/

just wondering... could you please check one thing for me? my friend developed his own tool based on yours (it's called yxd and is installed through pip as well). with it, you can scan an entire channel to get transcripts using YouTube API key for video listing + your tool (by the way there was an issue about scanning the entire channel, you can tell that person she can use it). you just enter yxd, then enter your API, then enter yxd -c linktothechannel --first=10 and it starts downloading. Just interesting if it works for you living in Germany. Thanks in advance!

(if it says transcript unavailable while there is one, then it doesn't work, but if your tool works then it should lol)

@jdepoix
Copy link
Owner

jdepoix commented Jun 15, 2021

I'm sorry @vanyamlb but I can't provide support for other modules. However, I might be able to help you if you upload the HTML you receive when accessing any given video (with subtitles) on youtube.com through curl or a browser.

@vanyamlb
Copy link

@jdepoix with curl just got too many requests errors (429) with VPN :/ wondering why did they block the IP and how to avoid that..

@jdepoix
Copy link
Owner

jdepoix commented Jun 15, 2021

@vanyamlb probably you're sharing an IP address with other users of that VPN and that IP has been blocked because of too many requests. Is there any way to change the IP address?

@vanyamlb
Copy link

@jdepoix ok I realize that with VPN that's possible... but people who used this for themselves with their IPS and got the block (while I used it so so much too and didn't get it :/)... Maybe they have static IPs while I have dynamic...
do you know any way to get your IP unlocked without changing IP address?

yes, when I changed, it started working, but idk how to change it within the country to check if that will help (but looks like it should)

@jdepoix
Copy link
Owner

jdepoix commented Jun 15, 2021

Unfortunately, there is no way I know of to get around the block without changing the IP or simply waiting until the block gets removed. So there's not really anything you can do here.

I guess this issue lost track a bit. @salonygupta76 any news on your end?

@jdepoix
Copy link
Owner

jdepoix commented Jun 25, 2021

@salonygupta76 any news? Otherwise I will close this issue.

@salonygupta76
Copy link
Author

Hey @jdepoix Sorry about the delay in getting back. Unfortunately, at this point of time, I'm unsure what the root cause could be. For a video like this: url (where a transcript exists), sometimes the API simply throws this error instead of retrieving it:

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=-em-_gFlDfQ! This is most likely caused by:

The video is no longer available

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem! - -em-_gFlDfQ

Note that this happens only in case of trying to access the code available on one of dev/prod environments (through an API) and not while testing on my local environment.

Also of imp maybe, sometimes simply rebuilding the project from Jenkins resolves the problem.

@jdepoix
Copy link
Owner

jdepoix commented Jun 25, 2021

@salonygupta76 where did you deploy your application? Maybe you are running into a problem similar to what @vanyamlb is describing?

@salonygupta76
Copy link
Author

salonygupta76 commented Jun 30, 2021

@jdepoix
I'm deploying the code in Linux environments as a Flask application. I've been capturing the error tracebacks for a while now and some of them do look like the one @vanyamlb shared above. I'm even using proxies.

@jdepoix
Copy link
Owner

jdepoix commented Jun 30, 2021

@salonygupta76 but what infrastructure are you hosting your application on? If it is a cloud provider like GCP, AWS etc. it is likely that you are sharing a public IP with other users and therefore are being blocked by YouTube. What happens when you curl https://www.youtube.com/watch?v=-em-_gFlDfQ from that environment as it is throwing that error? Do you get a 429 as well? Otherwise could you maybe upload the returned html so that I can have a look at it?

@salonygupta76
Copy link
Author

@jdepoix Infra is AWS. Right now, the block has been lifted and I'm able to get results. Can share Curl response when it reverts to throwing errors.

@salonygupta76
Copy link
Author

salonygupta76 commented Jul 12, 2021

Hey @jdepoix , facing the issue yet again and error is mostly "Video is not available..." one when trying your code.

When I hit curl -L https://www.youtube.com/watch?v=-em-_gFlDfQ, I get the following as response:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.youtube.com/watch?v=-em-_gFlDfQ</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="rnH4HjXBSULi9Z3KDh5b_oQ9pgFANNtapJKcTK-CjTXBg8Hqc9N8hByEEhbopeLD7xbVzfe7oU7OpTu2BP-qMb83fsobbLndnTRr7AeMtdfr4xMa_to3VWg8EcfI33aWd52OwNaJVeDnOCdveOlL-WN5BgA8hH-srYfpjrhxv10PbtXDkvAFHkspxsQ40iQm5wnjZjtABLJaV6Pulwc3FGYsbviqJYwUyBaobFE"></div>
<input type='hidden' name='q' value='EgQNOG6qGLnCsIcGIhBk7JGbIK-s219AkHLO2dTFMgFy'><input type="hidden" name="continue" value="https://www.youtube.com/watch?v=-em-_gFlDfQ">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: DEV_IP_ADDRESS}<br>Time: 2021-07-12T11:02:18Z<br>URL: https://www.youtube.com/watch?v=-em-_gFlDfQ<br>
</div>
</div>
</body>
</html>

Is this the curl response you're looking for? There seems to be some request limit, which when surpassed throws this error.

@jdepoix
Copy link
Owner

jdepoix commented Jul 12, 2021

Hi @salonygupta76, thank you very much for the detailed information. That is exactly what I was looking for. Unfortunately, this confirms my assumption that you are being blocked by YouTube. The only way to work around this is to

  • Manually solve the captcha in a browser, then export the cookie and use it for future requests
  • Use a different IP address
  • Wait until the ban on your IP has been lifted

I am aware that none of these solutions are great, but it's all we can do unfortunately (at least afaik).
While this doesn't directly solve your problem, I could at least use this html to make sure a more suitable error is raised in this case. Maybe you could catch this error and implement a sleep, while hoping that the ban will be lifted in the meantime (unfortunately I haven't been able to figure out how long you'll have to wait until bans get lifted and I feel like it's not really consistent). If you want to implement a more sophisticated solution you could catch the error and trigger a change of IP address, which definitely will be the most reliable solution, but also the most expensive one.

@salonygupta76
Copy link
Author

salonygupta76 commented Jul 12, 2021 via email

@jdepoix
Copy link
Owner

jdepoix commented Jul 12, 2021

@salonygupta76 Unfortunately, I don't know how long the "sleeping interval" would have to be to be sufficient. You'd have to play around with that. But if you do so, I would greatly appreciate if you could share your findings!

If you want to look deeper into the requests which are being sent, you'll have checkout the code and add some logs or run it in a debugger. More specifically youtube_transcript_api._transcripts.TranscriptListFetcher._fetch_html is the method which does the actual request, so if you want to log the response, that's were you'll have to look.

@AaditBhatia
Copy link

I am using the statement transcripts=YouTubeTranscriptApi.get_transcripts((video_id)); in a for loop, however, there are a few videos which have their id disabled. Is there a success or failure call for YouTubeTranscriptApi.get_transcripts((video_id))

@jdepoix
Copy link
Owner

jdepoix commented Apr 22, 2022

@AaditBhatia what do you mean by success/failure call? You can simply wrap your call in a try/except and ignore the exception.

@jdepoix
Copy link
Owner

jdepoix commented Oct 7, 2022

I will close this issue now, as there isn't really much we can do here and the discussion went off rails a bit.

@jdepoix jdepoix closed this as completed Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants