Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Youtube data collector "Failed to locate a transcript for this video!" #2597

Closed
stdestro opened this issue Nov 7, 2024 · 8 comments
Closed
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.

Comments

@stdestro
Copy link

stdestro commented Nov 7, 2024

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Trying to collect transcript from Youtube transcript data connector
i have a local installation on MacOs that collect the transcript, while the docker instance on Ubuntu gives the error:

Failed to locate a transcript for this video!

The video link is the same, so the video is not the problem (tried with different videos, same result).
i got the same LLM model, same llm Agent in desktop app and docker instance
Docker instance is started with --cap-add SYS_ADMIN
I can scrape websites through data collector smoothly in the docker instance.
the only problem is collecting transcripts from youtube

Are there known steps to reproduce?

No response

@stdestro stdestro added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Nov 7, 2024
@stdestro stdestro changed the title [BUG]: [BUG]: Youtube collector Failed to locate a transcript for this video! Nov 7, 2024
@stdestro stdestro changed the title [BUG]: Youtube collector Failed to locate a transcript for this video! [BUG]: Youtube data collector "Failed to locate a transcript for this video!" Nov 7, 2024
@timothycarambat
Copy link
Member

The IP your ubuntu instance is on is probably being blocked by Google from reaching https://www.youtube.com/watch URLs, as that is the only thing that would prevent this. Since it works on other platforms and you can scrape sites in general.

It is also possible that when accessing the video from the Ubuntu IP the video is blocked in that geography associated with the IP.

@stdestro
Copy link
Author

stdestro commented Nov 8, 2024

It's not a geo restriction, tried with different videos.
I can reach the video from terminal using curl and using lynx, so it seems youtube is not blocking my ip

ubuntu@instance-2024xxx-xxx:~$ curl -I https://www.youtube.com/watch?v=ugpFyDQexlA
HTTP/2 200 
content-type: text/html; charset=utf-8
x-content-type-options: nosniff
cache-control: no-cache, no-store, max-age=0, must-revalidate
pragma: no-cache
expires: Mon, 01 Jan 1990 00:00:00 GMT
date: Fri, 08 Nov 2024 07:13:56 GMT
content-length: 920191
x-frame-options: SAMEORIGIN
strict-transport-security: max-age=31536000
origin-trial: AmhMBR6zCLzDDxpW+HfpP67BqwIknWnyMOXOQGfzYswFmJe+fgaI6XZgAzcxOrzNtP7hEDsOo1jdjFnVr2IdxQ4AAAB4eyJvcmlnaW4iOiJodHRwczovL3lvdXR1YmUuY29tOjQ0MyIsImZlYXR1cmUiOiJXZWJWaWV3WFJlcXVlc3RlZFdpdGhEZXByZWNhdGlvbiIsImV4cGlkjlkjòlkjòlkjkjkzE5OSwiaXNTdWJkb21haW4iOnRydWV9
report-to: {"group":"youtube_main","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/youtube_main"}]}
content-security-policy: require-trusted-types-for 'script'
cross-origin-opener-policy: same-origin-allow-popups; report-to="youtube_main"
permissions-policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-form-factors=*, ch-ua-platform=*, ch-ua-platform-version=*
p3p: CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=it for more info."
server: ESF
x-xss-protection: 0
set-cookie: YSC=-SoXXU7Xipk; Domain=.youtube.com; Path=/; Secure; HttpOnly; SameSite=none
set-cookie: __Secure-YEC=CgtZdkZXSlFtNUjfsjdksdjhsjkdhfkjhsjdhjfdjhwYGRobHB0eHw4PIBAREiEgEg%3D%3D; Domain=.youtube.com; Expires=Mon, 08-Dec-2025 07:13:55 GMT; Path=/; Secure; HttpOnly; SameSite=lax
set-cookie: VISITOR_PRIVACY_METADATA=CgJJVBIcEhgSFhMLhjjhgljhbvjhvjhHB0eHw4PIBAREiEgEg%3D%3D; Domain=.youtube.com; Expires=Mon, 08-Dec-2025 07:13:57 GMT; Path=/; Secure; HttpOnly; SameSite=none
set-cookie: VISITOR_INFO1_LIVE=; Domain=.youtube.com; Expires=Sat, 12-Feb-2022 07:13:57 GMT; Path=/; Secure; HttpOnly; SameSite=none
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000

@timothycarambat
Copy link
Member

So any Youtube video does not work on this instance?

@stdestro
Copy link
Author

stdestro commented Nov 8, 2024

exactly, tried 10 different videos from different regions. it works on the desktop app, not on the docker instance on Oracle VM

@timothycarambat
Copy link
Member

When viewing the docker logs and attempting a collection do we see a [collector] line item that shows any more information about that error besides the user facing one?

Hoping this error fires

console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);

@stdestro
Copy link
Author

stdestro commented Nov 8, 2024

this is the docker log after 3 tries

[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
[collector] info: -- Working YouTube https://www.youtube.com/watch?v=L1RMd96eHgo --
[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
[collector] info: -- Working YouTube https://www.youtube.com/watch?v=eyVDMJN0sa8 --
[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
[collector] info: -- Working YouTube https://www.youtube.com/watch?v=eyVDMJN0sa8 --

@timothycarambat
Copy link
Member

@stdestro Does this thread apply

The script we are using is a fork of that repo - we broke from it a long time ago to force patch something in that data connector but thinking of the network difference I wonder if this is the issue and its because the ipv4 and ipv6 responses from youtube.com are different?

@stdestro
Copy link
Author

stdestro commented Nov 8, 2024

so, for me
curl -L -4 is working fine (returns the html), and curl -L -6 returns curl: (7) Couldn't connect to server

i cannot ping ipv6

ubuntu@instance-20241107-1421:~$ ping6 youtube.com
ping6: connect: Network is unreachable

while ipv4


ubuntu@instance-20241107-1421:~$ ping youtube.com
PING youtube.com (216.58.205.46) 56(84) bytes of data.
64 bytes from mil04s24-in-f14.1e100.net (216.58.205.46): icmp_seq=1 ttl=117 time=8.07 ms
64 bytes from mil04s24-in-f14.1e100.net (216.58.205.46): icmp_seq=2 ttl=117 time=8.03 ms
64 bytes from mil04s24-in-f46.1e100.net (216.58.205.46): icmp_seq=3 ttl=117 time=8.02 ms
64 bytes from lhr48s23-in-f14.1e100.net (216.58.205.46): icmp_seq=4 ttl=117 time=8.06 ms

while from desktop i get this:

s@MacBookAir ~ % ping youtube.com       
PING youtube.com (142.251.209.14): 56 data bytes
64 bytes from 142.251.209.14: icmp_seq=0 ttl=118 time=12.378 ms
64 bytes from 142.251.209.14: icmp_seq=1 ttl=118 time=8.915 ms
64 bytes from 142.251.209.14: icmp_seq=2 ttl=118 time=9.180 ms
64 bytes from 142.251.209.14: icmp_seq=3 ttl=118 time=16.002 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.
Projects
None yet
Development

No branches or pull requests

2 participants