Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not retrieve a transcript for the video #74

Closed
iercetin opened this issue Aug 30, 2020 · 37 comments
Closed

Could not retrieve a transcript for the video #74

iercetin opened this issue Aug 30, 2020 · 37 comments
Labels
bug Something isn't working

Comments

@iercetin
Copy link

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=98TQv5IAtY8! This is most likely caused by: The video is no longer available

It works on my local computer (Windows 10) but when I try to use it on Ubuntu 20.04(DigitalOcean Droplet) I get this error!
I assume the error is caused by sender I.P. address.

I got a similar problem using youtube-dl on my droplet and when I tried using "--force-ipv4" with youtube-dl It worked. Is there a similar solution to this?

Code
YouTubeTranscriptApi.get_transcript("98TQv5IAtY8", languages=['en'])

@jdepoix
Copy link
Owner

jdepoix commented Aug 31, 2020

Hi @iercetin
my first guess would've been, that you are running into some kind of rate limit because the IP address of your cloud machine is also used by others (see #60 for more on rate limits). However, the IPv6 thing is a really interesting clue!
As of right now, there's no functionality in this module to force IPv4 connections, but I could implement something like this if this turns out to be a common problem with YouTube. Unfortunately I don't have a machine or setup were I can replicate this error. So I'll need your help debugging this.
Could you do a IPv4 and IPv6 ping to YouTube from your local machine and from your cloud machine and see how that goes on either machine?

ping youtube.com
ping6 youtube.com

@iercetin
Copy link
Author

iercetin commented Sep 1, 2020

Sure,

-- Windows 10 --

ping youtube.com

Pinging youtube.com [172.217.169.142] with 32 bytes of data:
Reply from 172.217.169.142: bytes=32 time=44ms TTL=115
Reply from 172.217.169.142: bytes=32 time=68ms TTL=115
Reply from 172.217.169.142: bytes=32 time=49ms TTL=115
Reply from 172.217.169.142: bytes=32 time=55ms TTL=115

Ping statistics for 172.217.169.142:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 44ms, Maximum = 68ms, Average = 54ms

-- Ubuntu (Digital Ocean Droplet) --

ping youtube.com
PING youtube.com(ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e)) 56 data bytes
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=1 ttl=117 time=7.78 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=2 ttl=117 time=7.01 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=3 ttl=117 time=6.94 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=4 ttl=117 time=6.98 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=5 ttl=117 time=7.07 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=6 ttl=117 time=6.98 ms
--- youtube.com ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 8012ms
rtt min/avg/max/mdev = 6.935/7.100/7.779/0.243 ms

ping6 youtube.com

PING youtube.com(ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e)) 56 data bytes
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=1 ttl=117 time=7.83 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=2 ttl=117 time=7.07 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=3 ttl=117 time=7.07 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=4 ttl=117 time=7.24 ms
64 bytes from ams16s32-in-x0e.1e100.net (2a00:1450:400e:80c::200e): icmp_seq=5 ttl=117 time=6.99 ms

--- youtube.com ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 6.990/7.240/7.831/0.305 ms

@jdepoix
Copy link
Owner

jdepoix commented Sep 2, 2020

Thanks @iercetin
It definitely seems that your Ubuntu machine defaults to IPv6 when connecting to YouTube, which proves your point.
Could you try and get the video using curl on your Ubuntu machine, forcing it to use IPv4/6?

 curl -L -4 "http://youtube.com/watch?v=98TQv5IAtY8"
 curl -L -6 "http://youtube.com/watch?v=98TQv5IAtY8"

@iercetin
Copy link
Author

iercetin commented Sep 3, 2020

Sure,

curl -L -4 "http://youtube.com/watch?v=98TQv5IAtY8"

https://github.com/iercetin/testtest/blob/master/4.html

curl -L -6 "http://youtube.com/watch?v=98TQv5IAtY8"

https://github.com/iercetin/testtest/blob/master/6.html

@jdepoix
Copy link
Owner

jdepoix commented Sep 5, 2020

@iercetin mh, that's interesting. It in fact seems that the IPv6 requests returns a different response than the IPv4 response, which proves your point even further.
However, could it be that you maybe mixed up the 6.html and 4.html file names? Because from what I can tell 6.html actually is the file which contains the information needed to extract the subtitles, while 4.html doesn't. If that would be true IPv6 should the working request and not the other way around. Could you please double check that?

@sdtblck
Copy link

sdtblck commented Sep 6, 2020

I'm having the same problem - works fine locally (OS X) but on my server (Ubuntu 18.04.4 LTS (Bionic Beaver)) I get the same error as @iercetin

@jdepoix
Copy link
Owner

jdepoix commented Sep 6, 2020

@sdtblck could you please execute those commands on your Ubuntu machine and upload the results:

 curl -L -4 "http://youtube.com/watch?v=98TQv5IAtY8"
 curl -L -6 "http://youtube.com/watch?v=98TQv5IAtY8"

@jdepoix
Copy link
Owner

jdepoix commented Sep 24, 2020

@iercetin @sdtblck any news on this?

@adongu
Copy link

adongu commented Sep 30, 2020

Hi @jdepoix, I'm using this on GCP cloud functions and I think I'm facing similar issue. Would I have to route all the egress traffic through a IPV4 VPC network with static IP to test if IPV4 connections would help with the rate limiting on the shared machine?

@adongu
Copy link

adongu commented Sep 30, 2020

Hi @jdepoix, I'm using this on GCP cloud functions and I think I'm facing similar issue. Would I have to route all the egress traffic through a IPV4 VPC network with static IP to test if IPV4 connections would help with the rate limiting on the shared machine?

I was able to get the API to work again on GCP Cloud functions now with a static IP following a GCP guide! https://dev.to/alvardev/gcp-cloud-functions-with-a-static-ip-3fe9

@jdepoix
Copy link
Owner

jdepoix commented Sep 30, 2020

Hi @adongu,
that's great news, thanks for sharing! So the static IP is IPv4 I guess? I am not sure whether the solution is the IP being static or it being IPv4 instead of IPv6. Is there any way for you to find out whether your requests will now default to IPv4 instead of IPv6?

@adongu
Copy link

adongu commented Sep 30, 2020

Hey @jdepoix I did some digging and found this tidbit the documentations for Google VPC networks. https://cloud.google.com/vpc/docs/vpc#specifications

"VPC networks only support IPv4 unicast traffic. They do not support broadcast, multicast, or IPv6 traffic within the network; VMs in the VPC network can only send to IPv4 destinations and only receive traffic from IPv4 sources. However, it is possible to create an IPv6 address for a global load balancer."

It looks like all VPC traffic is IPV4, unless I create a IPV6 address on global LB, and route service all traffic first to the LB. The guide I followed didn't create any LB as far as I know and the VPC network routing mode is regional.

@jdepoix
Copy link
Owner

jdepoix commented Sep 30, 2020

So if I am understanding this correctly, you were probably doing IPv6 requests before setting up the VPC, while now you're doing IPv4 requests. Which would further support the assumption that this module can fail when sending IPv6 requests to YouTube. Thank you for sharing @adongu!

I guess my best bet would be to implement something which forces this module to use IPv4. I'll look into that when I have some time at hand.

@jdepoix jdepoix added the bug Something isn't working label Sep 30, 2020
@adongu
Copy link

adongu commented Sep 30, 2020

Hey @jdepoix , apologies if I was being vague. I think my issue might be related to #60 instead of this.

I'm not actually sure if it was serving via IPV6 before since I didn't have any global LB set up as it was a light project. I think forcing the function to go through a reserved static IP stopped youtube from limiting the shared machine my Cloud Function was running on. Apologies for spinning your wheel.

@jdepoix
Copy link
Owner

jdepoix commented Sep 30, 2020

Thanks for clarifying @adongu.
I guess, we will need further proof before forcing this module to use IPv4...

@cramdoulfa
Copy link

@sdtblck could you please execute those commands on your Ubuntu machine and upload the results:

 curl -L -4 "http://youtube.com/watch?v=98TQv5IAtY8"
 curl -L -6 "http://youtube.com/watch?v=98TQv5IAtY8"

I also have the same problem on AWS EC2.

To add my context, the requests initially worked fined on AWS. After a few thousands requests (at about 1 / sec), I started getting VideoUnavailable error and now I'm still getting those error 3 days after, even for a single request.

curl -L -4 is working fine though (returns the html), and curl -L -6 returns curl: (7) Couldn't connect to server

Thanks @jdepoix for you diligence in fixing it - I can try to take over making tests from the server!

@jdepoix
Copy link
Owner

jdepoix commented Nov 20, 2020

Hi @cramdoulfa,

Thanks for the information. Could you upload the HTML which is returned by calling curl -L -4 from the EC2 instance, so I can have a look at what is being returned?

@cramdoulfa
Copy link

Hi @cramdoulfa,

Thanks for the information. Could you upload the HTML which is returned by calling curl -L -4 from the EC2 instance, so I can have a look at what is being returned?

curl_L4_return.txt

Here it is @jdepoix for curl -L -4 "http://youtube.com/watch?v=98TQv5IAtY8"

@jdepoix
Copy link
Owner

jdepoix commented Nov 23, 2020

Thanks for the additional information @cramdoulfa!

This seems a bit odd though, as the information which is required for this module is actually being returned by your request. Are you sure that the module was still failing, while trying to retrieve this video, as you did the requests? Maybe there were some rate limits which did reset. Did you check this, before executing the curl request?

@cramdoulfa
Copy link

Thanks for the additional information @cramdoulfa!

This seems a bit odd though, as the information which is required for this module is actually being returned by your request. Are you sure that the module was still failing, while trying to retrieve this video, as you did the requests? Maybe there were some rate limits which did reset. Did you check this, before executing the curl request?

Hum very good point, the package is actually working again now! I will start a batch of query and update if it starts blocking again.
It's probably a matter of quota or rate limit.

@jacksonw765
Copy link

I had this issue also myself. It's due to youtube blocking your IP. I switched on a VPN and everything worked as expected.

@jdepoix
Copy link
Owner

jdepoix commented Nov 30, 2020

@jacksonw765 yeah, that's what I was guessing. It would be great though, if I could see what HTML YouTube returns after they blocked you, so that I can add a proper error message to this module.

@cramdoulfa
Copy link

@jdepoix here is a sample HTML page for a video with available transcripts when the API seems to be blocked:
curl_result_blocked_transcript_API.txt

@cramdoulfa
Copy link

I had this issue also myself. It's due to youtube blocking your IP. I switched on a VPN and everything worked as expected.

Slight sidetrack but I'm curious @jacksonw765 do you use a commercial VPN or did you configure one yourself with openVPN?
I could not find a nice VPN client for Linux AMI

@jdepoix
Copy link
Owner

jdepoix commented Dec 3, 2020

@cramdoulfa huh, that seems really odd. Once again the HTML seems to contain all the information needed by this module to retrieve the transcripts. The exception you got was a VideoUnavailable?

BTW you can try this out yourself by doing the following:

import requests
from youtube_transcript_api._transcripts import TranscriptListFetcher

html = '''HTML as string or load it from file'''
video_id = '<video_id>'

print(TranscriptListFetcher(requests.Session())._extract_captions_json(html, video_id))

If this returns a dict with data about the transcripts, without throwing an exception, it should work fine. This also is the only place where VideoUnavailable is thrown, so if that's the error you're getting the API must have been unblocked by the time you did the curl request, or there is something different when doing the curl request that I can't wrap my head around 🤔

@cramdoulfa
Copy link

Ups sorry for the false call - it seems that the API had indeed been de-blocked in the meantime!

Thanks for the code snippet, I will verify next time it happens and notify you if I find an HTML for which _extract_captions_json fails

@jdepoix
Copy link
Owner

jdepoix commented Dec 3, 2020

@cramdoulfa no worries, thanks for putting in the time trying to resolve this! 😊👍

@jacksonw765
Copy link

I had this issue also myself. It's due to youtube blocking your IP. I switched on a VPN and everything worked as expected.

Slight sidetrack but I'm curious @jacksonw765 do you use a commercial VPN or did you configure one yourself with openVPN?
I could not find a nice VPN client for Linux AMI

I use PIA

@cramdoulfa
Copy link

@cramdoulfa no worries, thanks for putting in the time trying to resolve this! 😊👍

Ok I think this is the right one this time. The page actually says 'We have been receiving large amounts of requests from your nework.'

curl_result_video_unavailable.txt

@jdepoix
Copy link
Owner

jdepoix commented Dec 4, 2020

Perfect, that's exactly what I was looking for @cramdoulfa! Thank you very much! 👍
I will add a custom error for this suggesting the user to wait for the rate limit to reset, or use a VPN/change IP. I am quite busy right now, as I am in the last weeks of writing my master thesis, so I probably won't be doing any coding on this module for a few weeks, but I'll try to get that done as soon as I can.

The other thing which still remains interesting is the IPv4 vs IPv6 thing suggested above. I would be great if you could try executing a IPv4 and IPv6 request next time you run into the rate limit and upload the results here. The responses which have been uploaded so far have been contradicting each other a bit and the people have unfortunately stopped replying.

@jdepoix
Copy link
Owner

jdepoix commented Dec 4, 2020

Also, could you make any guesses on how long the rate limit persists until it is reseted, or was it inconsistent for you @cramdoulfa ?

@cramdoulfa
Copy link

Also, could you make any guesses on how long the rate limit persists until it is reseted, or was it inconsistent for you @cramdoulfa ?

Great!
I have been making requests every 1.5 second. Not sure how consistent it is, but it started blocking after about 15,000 videos this time. I will try to monitor how long it takes before it resets.

@cramdoulfa
Copy link

Perfect, that's exactly what I was looking for @cramdoulfa! Thank you very much! 👍
I will add a custom error for this suggesting the user to wait for the rate limit to reset, or use a VPN/change IP. I am quite busy right now, as I am in the last weeks of writing my master thesis, so I probably won't be doing any coding on this module for a few weeks, but I'll try to get that done as soon as I can.

The other thing which still remains interesting is the IPv4 vs IPv6 thing suggested above. I would be great if you could try executing a IPv4 and IPv6 request next time you run into the rate limit and upload the results here. The responses which have been uploaded so far have been contradicting each other a bit and the people have unfortunately stopped replying.

curl -L -6 "http://youtube.com/watch?v=0vAfIcmpqzQ" returns curl: (7) Couldn't connect to server for me

Good luck with the thesis!

@jdepoix
Copy link
Owner

jdepoix commented Mar 23, 2021

In v0.4.0 the Exception TooManyRequests has been added which is raised when running into rate limits. This could be used to further investigate the issue. My guess with the IPv4 vs. IPv6 thing @iercetin mentioned is that the IPv6 has been blocked due to extensive usage, while IPv4 still works as it hasn't been blocked yet.

As this issue is kinda all over the place now with different things being reported (most of them most likely due to rate limits, which now have a more speaking error message) I will close this for now. If individual issues arise again feel free to open a new issue with a title more specific to that issue.

@mgoldenbe
Copy link

I will add a custom error for this suggesting the user to wait for the rate limit to reset, or use a VPN/change IP.

@jdepoix Is it possible to use an IP rotation service like https://scrapingant.com with this module?

@jdepoix
Copy link
Owner

jdepoix commented Jul 27, 2023

@mgoldenbe I haven't tried it yet, but in theory it should work. It would be great if you could report back what your experience has been in case you actually try it out! 😊

@mgoldenbe
Copy link

mgoldenbe commented Jul 27, 2023

I have been making requests every 1.5 second. Not sure how consistent it is, but it started blocking after about 15,000 videos this time.

@cramdoulfa Have you been able to determine what is the approximate maximal frequency of requests that does not result in being blocked?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants