Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with some tumblr videos #103

Closed
KaMyKaSii opened this issue Sep 6, 2018 · 15 comments
Closed

Problems with some tumblr videos #103

KaMyKaSii opened this issue Sep 6, 2018 · 15 comments

Comments

@KaMyKaSii
Copy link

KaMyKaSii commented Sep 6, 2018

Some tumblr videos aren't detected (consequently not downloaded) and some are detected but aren't possible to download.

Example blog where videos are not being detected: http://abzcasal.tumblr.com
Example blog where some videos can't be downloaded: http://delicinhadele.tumblr.com

Both adult content, but I don't know if secure mode is enabled for these blogs (and if it influences something).
Attached are screenshots of the problem and my tumblr extraction setup.
screenshot_20180906-150100
screenshot_20180906-142945
img-20180906-wa0022

@mikf
Copy link
Owner

mikf commented Sep 6, 2018

Example blog where videos are not being detected: http://abzcasal.tumblr.com

These are inline videos and support for these was added just earlier today (see #102). Make sure you are using the latest git snapshot and try again.
I'm able to download all of the videos visible when visiting that blog in a browser. 5 are listed but there are only 3 unique ones and gallery-dl skips download URLs it has already seen.

Example blog where some videos can't be downloaded: http://delicinhadele.tumblr.com

All videos that can't be downloaded are not playable when visiting the blog in a browser, either.
The respective posts are either not shown at all (id 173861946703)
or the video doesn't play and the network inspector shows a 403 error (id 168771256454).
I don't really know what to do here, but I guess those videos just aren't accessible.

@KaMyKaSii
Copy link
Author

@mikf I just updated the gallery-dl. In the first blog there are three different videos but here the gallery-dl is downloading only two. What is not being downloaded (https://abzcasal.tumblr.com/post/177483231154) is being skipped because I set it to not include reblogs. The problem is that this is a reblog of a post that apparently does not exist anymore (deleted?). So you could set up for gallery-dl when downloading from tumblr and find a reblog of a post from the same blog, check if the original post exists before skip the download?

And about non-accessible videos, in which case would not it be better to suppress error messages?

@Hrxn
Copy link
Contributor

Hrxn commented Sep 7, 2018

[..] or the video doesn't play and the network inspector shows a 403 error (id 168771256454).
I don't really know what to do here, but I guess those videos just aren't accessible.

Yeah, that issue sounds familiar. Typical for the great Tumblr video purge. Videos still appear online but only return a 403.

@wankio
Copy link
Contributor

wankio commented Sep 7, 2018

tested, all downloaded , sometime if you are not enable reblog, you will missing some post

mikf added a commit that referenced this issue Sep 7, 2018
Setting 'reblogs' to "deleted" will check if the parent post of a
reblog has been deleted and download its media content if that is the
case, otherwise it will be skipped.

This is a rather costly operation (1 API request per reblogged post)
and should therefore be used with care.
@mikf
Copy link
Owner

mikf commented Sep 7, 2018

@KaMyKaSii I've extended the functionality of the reblogs option. Setting its value to "deleted" should do what you described - skip reblogs in general and download only those with deleted parent posts - but keep in mind that this is rather costly in terms of API requests. If you plan on using it permanently, you should get your own API key and secret

@KaMyKaSii
Copy link
Author

KaMyKaSii commented Sep 7, 2018

@mikf I have not yet tested the new version but it means that the reblogged media will be downloaded only if the original post is from the same blog and has been deleted, right?
And I used OAuth to log in and put the generated information (as in my screenshot above) into the configuration file, so I'm using my own key and secret, right? And I'm not a developer so I'm sorry if I'm wrong but it's not just necessary to check the result of "reblogged_root_name" instead of making more requests?
screenshot_20180907-180327

@mikf
Copy link
Owner

mikf commented Sep 7, 2018

but it means that the reblogged media will be downloaded only if the original post is from the same blog and has been deleted

It currently doesn't check if it is from the same blog, but otherwise it should work like you described.

... but it's not just necessary to check the result of "reblogged_root_name" instead of making more requests?

No, because you need to check if the original post still exists or not.

... so I'm using my own key and secret, right?

No, you aren't, it's sadly not that simple.
You generated an access-token and access-token-secret pair (not an api-key and api-secret pair, mind you) for the default gallery-dl application on Tumblr, which means you linked your Tumblr account to gallery-dl and granted it the means to issue API requests on your account's behalf. But that also means you are sharing the same rate limit as everyone else using the default API credentials (1k requests per hour, 5k per day).
You should therefore create your own Application, so you don't have to share this rate limit with everyone else, and then redo the oauth:tumblr step once you are done.

All in all there are 4 values you need:

  • api-key and api-secret, which describe the Application you are using. -> register your own Tumblr Application
  • access-token and access-token-secret, which link your personal Tumblr account to an Application -> gallery-dl oauth:tumblr

@KaMyKaSii
Copy link
Author

KaMyKaSii commented Sep 7, 2018

@mikf Honestly I did not want to have media from other blogs, could you configure to check if the original post is from the same blog that the user want to download?

@mikf
Copy link
Owner

mikf commented Sep 9, 2018

I think I've come up with another solution to your problem that doesn't need any extra API requests:

gallery-dl -o reblogs=true --filter "not reblogged or reblogged_root_name == blog['name']" http://abzcasal.tumblr.com/

You enable reblogs (-o reblogs=true)
and then you filter all reblogged posts that aren't from the blog owner.

  • not reblogged allows for all original/not-reblogged posts
  • reblogged_root_name == blog['name'] checks if original and reblogged post belong to the same person.

And since duplicate image- and video-URLs get filtered out automatically, you would download any reblogged media only once, regardless if the original has been deleted or not.

I also have a better implementation for reblogs="deleted" in mind (one that doesn't need any extra API requests), but I wanted to ask if this right here is OK too, before I do anything else.

@KaMyKaSii
Copy link
Author

@mikf I just ran the command and it worked perfectly, either with "reblogs": false or "reblogs": '"deleted". But anyway, for me it's great, so if you can do this better implementation, I thank you!
screenshot_20180909-101215

mikf added a commit that referenced this issue Sep 10, 2018
- rename "deleted" to "same-blog"
- change test for deleted original post to test if
  original post owner has the same UUID (full blog name) as the one
  being downloaded from
- add 'blog[uuid]' metadata to allow comparison with
  'reblogged_from_uuid'
@mikf
Copy link
Owner

mikf commented Sep 10, 2018

So I changed things a bit and everything should now work as you wanted, I hope. The name changed from "deleted" to "same-blog" to better reflect what is being tested for.

It now just checks if the original post is from the same owner as is being downloaded from. A check if the original post still exists isn't really necessary as explained above and it therefore doesn't need any extra API requests either.

@KaMyKaSii
Copy link
Author

@mikf But then I believe there is a problem with the system that skips duplicates. I just ran the command "gallery-dl -i tumblrs.txt" (to download my favorite tumblr blogs) and downloaded duplicate content from various blogs. Maybe you can replicate using the "gallery-dl muyanna.tumblr.com" command with "reblogs": false and then the same command with "reblogs": "same-blog"

@mikf
Copy link
Owner

mikf commented Sep 10, 2018

You will get "duplicate" files if you run it twice with different reblogs settings. That's to be expected.

For "reblogs": false you get the file from the original post and for "reblogs": "same-blog" it's the file from the newest reblog (the original file will be skipped). Those two files will be identical but have different filenames, since the default filename format includes the post id.

So either stick to one value for the reblogs option
or change the filename format to something where this doesn't happen. Maybe "{blog_name}_{hash}_{num:>02}.{extension}".

@KaMyKaSii
Copy link
Author

@mikf First I'm sorry for taking so long to respond. But I was left with a doubt, duplicate content being downloaded with "reblogs": "same-blog" is expected behavior or something that should be fixed? Because even after deleting the folder from a tumblr blog and starting the download from zero, is what is happening. I believe you can replicate with the http://muyanna.tumblr.com blog, since I noticed that it always happens here. Download it today with the same settings as me. Wait for some time until his owner reblog his own posts and finally downloads the blog again. You will see that the new reblogs will be downloaded again, thus creating duplicate content

@mikf
Copy link
Owner

mikf commented Oct 16, 2018

You will see that the new reblogs will be downloaded again, thus creating duplicate content

That is expected behavior. gallery-dl can't really know that files from a new post are files it has downloaded before, but here are possible solutions you could try:

  • As said before, use a filename format where identical files from different posts have the same filename (e.g. "{blog_name}_{hash}_{num:>02}.{extension}")
  • Use a download archive and use a archive key format where identical files have the same key (e.g. "{blog_name}_{hash}_{num}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants