Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More reliable favicon retrieval. #652

Merged
merged 9 commits into from
Jun 16, 2023

Conversation

aghoulcoder
Copy link
Contributor

Previously, the application looked for favicons in /favicon.ico. But favicons can be anywhere on a website and the location can be defined in HTML and JSON manifests. Websites can also have multiple favicons.

As a result, when a new site was added to stash-box which did not have a /favicon.ico, the site was created without a proper favicon. Additionally, since stash-box didn't check the filetype of the data returned by the /favicon.ico GET request, it stored various 404 and other redirected html pages.

A non-trivial amount of logic is required to discover all favicon locations, check filetypes, and sort them. It therefore makes sense to rely on a third-party package for this job. This patch uses go.deanishe.net/favicon.

@aghoulcoder
Copy link
Contributor Author

Merge conflict is because I wrote and tested this against v0.4.6 instead of master because master is currently broken due to #645

}

// Icons are sorted widest first. We currently get the first one (icons[0]).
req, err := http.NewRequestWithContext(ctx, http.MethodGet, icons[0].URL, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic with a runtime error if icons is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! thanks.

Previously, the application looked for favicons in `/favicon.ico`.
But favicons can be anywhere on a website and the location can be
defined in HTML and JSON manifests. Websites can also have multiple
favicons.

As a result, when a new site was added to stash-box which did not have a
`/favicon.ico`, the site was created without a proper favicon. Additionally,
since stash-box didn't check the filetype of the data returned by the
`/favicon.ico` GET request, it stored various 404 and other redirected
html pages.

A non-trivial amount of logic is required to discover all favicon
locations, check filetypes, and sort them. It therefore makes sense to
rely on a third-party package for this job. This patch uses
`go.deanishe.net/favicon`.
@aghoulcoder
Copy link
Contributor Author

aghoulcoder commented Jun 6, 2023

force-pushed after rebasing on master and resolving the merge conflict in go.mod.

@WithoutPants
Copy link
Collaborator

I haven't been able to get this to work reliably. There's a few issues with the icon downloading code which make debugging the issues quite difficult. I think the whole thing needs a bit of improving, but probably out of scope of what you're doing.

I tried creating a new site with URL https://twitter.com. The icon downloading code fails silently if the favicon_path config field isn't populated. I then populated that field and restarted. The second issue I found was that the downloadIcon code creates the icon file on the filesystem regardless of whether it successfully downloads the icon. This means that if the download process fails for any reason, then you have an empty file in the filesystem and you can't retrigger a download without deleting the file. I suspect that we probably shouldn't be trying to download the favicon via an http route, and there's no protection from multiple threads trying to download at once.

Finally, to the part that is the most relevant to you, the call to favicon.Find returns the following error fetch page: retrieve URL: Get \"/\": stopped after 10 redirects. This error is not logged any where and there's no indication that it failed or why without debugging. On the master branch, the twitter favicon is correctly downloaded and displayed.

@aghoulcoder
Copy link
Contributor Author

aghoulcoder commented Jun 6, 2023

There's a general absence of logging throughout this codebase even though there is a logger package implemented and used for only a couple of things. Logging is something I would like to help improve. However, I'm not sure if I should start using the current logger package as it relies on https://github.com/sirupsen/logrus which is no longer under development.

For now I'll look into the favicon.Find issue.
Thanks for checking.

EDIT: There are also two important config options logfile and loglevel which are not documented.

@aghoulcoder
Copy link
Contributor Author

the call to favicon.Find returns the following error fetch page: retrieve URL: Get \"/\": stopped after 10 redirects

Apparently, there's an issue with twitter.com:

$ curl -L https://twitter.com
curl: (47) Maximum (50) redirects followed

$ curl -v -L https://twitter.com                                                                                                                                                                                                                                  
*   Trying 104.244.42.65:443...                                                                                                                                                                                                                                                   
* Connected to twitter.com (104.244.42.65) port 443 (#0)                                                                                                                                                                                                                          
* ALPN: offers h2,http/1.1                                                                                                                                                                                                                                                        
* TLSv1.3 (OUT), TLS handshake, Client hello (1):                                                                                                                                                                                                                                 
*  CAfile: /etc/ssl/certs/ca-certificates.crt                                                                                                                                                                                                                                     
*  CApath: none                                                                                                                                                                                                                                                                   
* TLSv1.3 (IN), TLS handshake, Server hello (2):                                                                                                                                                                                                                                  
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):                                                                                                                                                                                                                          
* TLSv1.3 (IN), TLS handshake, Certificate (11):                                                                                                                                                                                                                                  
* TLSv1.3 (IN), TLS handshake, CERT verify (15):                                                                                                                                                                                                                                  
* TLSv1.3 (IN), TLS handshake, Finished (20):                                                                                                                                                                                                                                     
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):                                                                                                                                                                                                                       
* TLSv1.3 (OUT), TLS handshake, Finished (20):                                                                                                                                                                                                                                    
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384                                                                                                                                                                                                                           
* ALPN: server accepted h2                                                                                                                                                                                                                                                        
* Server certificate:                                                                                                                                                                                                                                                             
*  subject: C=US; ST=California; L=San Francisco; O=Twitter, Inc.; CN=twitter.com                                                                                                                                                                                                 
*  start date: Feb  5 00:00:00 2023 GMT                                                                                                                                                                                                                                           
*  expire date: Feb  5 23:59:59 2024 GMT                                                                                                                                                                                                                                          
*  subjectAltName: host "twitter.com" matched cert's "twitter.com"                                                                                                                                                                                                                
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS Hybrid ECC SHA384 2020 CA1                                                                                                                                                                                                       
*  SSL certificate verify ok.                                                                                                                                                                                                                                                     
* using HTTP/2                                                                                                                                                                                                                                                                    
* h2 [:method: GET]                                                                                                                                                                                                                                                               
* h2 [:scheme: https]                                                                                                                                                                                                                                                             
* h2 [:authority: twitter.com]                                                                                                                                                                                                                                                    
* h2 [:path: /]                                                                                                                                                                                                                                                                   
* h2 [user-agent: curl/8.1.2]                                                                                                                                                                                                                                                     
* h2 [accept: */*]                                                                                                                                                                                                                                                                
* Using Stream ID: 1 (easy handle 0x55ff0af3cf50)                                                                                                                                                                                                                                 
> GET / HTTP/2                                                                                                                                                                                                                                                                    
> Host: twitter.com                                                                                                                                                                                                                                                               
> User-Agent: curl/8.1.2                                                                                                                                                                                                                                                          
> Accept: */*                                                                                                                                                                                                                                                                     
>                                                                                                                                                                                                                                                                                 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):                                                                                                                                                                                                                             
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):                                                                                                                                                                                                                             
* old SSL session ID is stale, removing                                                                                                                                                                                                                                           
< HTTP/2 302                                                                                                                                                                                                                                                                      
< date: Tue, 06 Jun 2023 09:18:22 GMT                                                                                                                                                                                                                                             
< perf: 7626143928                                                                                                                                                                                                                                                                
< vary: Accept                                                                                                                                                                                                                                                                    
< server: tsa_o
< location: /
< set-cookie: guest_id=v1%3A168604310292854621; Max-Age=34214400; Expires=Sat, 06 Jul 2024 09:18:22 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=None
< content-type: text/plain; charset=utf-8
< x-powered-by: Express
< cache-control: no-cache, no-store, max-age=0
< content-length: 23
< x-transaction-id: 1aae4a21b4a8748b
< strict-transport-security: max-age=631138519
< x-response-time: 111
< x-connection-hash: 325f3f2bcd4b00f8806219564526f8a3783bd7130251c6000a66c8b1090e3466
< 
[continues...]

Previously, an empty file was created at the start of the downloadIcon
function and would remain as an empty file even when favicon retreival
failed. This empty file would prevent future attempts to download an
icon for the same site ID.

Now the icon file is created at the end, after everything else has
succeeded.
@aghoulcoder
Copy link
Contributor Author

aghoulcoder commented Jun 6, 2023

Basically, the way twitter.com works is that on the first request it gives you a 302 redirect to / and a cookie named guest_id. Then you do the next request and send it the guest_id cookie at which point it returns 200. If you don't send it the cookie it redirects infinitely.

With curl you can store the cookie like this:

$ curl -c /tmp/cookies -v -L https://twitter.com

So that's what's happening. But I'm still not sure how to handle it.

@aghoulcoder
Copy link
Contributor Author

I think this can be solved by using the WithClient function to specify an http.Client with a CookieJar. Looking into it...

Cookies are required for certain websites, even for simple requests.
See code comment for details.
@aghoulcoder
Copy link
Contributor Author

aghoulcoder commented Jun 7, 2023

  • ✅ I fixed the twitter.com redirect loop issue by using an http.client with a cookiejar. fee8a3d
  • ✅ I fixed the empty file issue by moving file creation to the end of the function. f238b86
  • ⏳ Logging is a problem throughout the codebase. I have opened [RFC] Logging improvements #655 for further discussion and suggestions.
    @WithoutPants

This more closely mirrors the GitHub workflow build process and helps
avoid CI build failures on pull requests.
@WithoutPants WithoutPants merged commit f0bdb06 into stashapp:master Jun 16, 2023
@aghoulcoder aghoulcoder deleted the favicon-bugfix branch June 16, 2023 11:47
feederbox826 pushed a commit to feederbox826/stash-box that referenced this pull request Nov 15, 2023
* More reliable favicon retrieval.

Previously, the application looked for favicons in `/favicon.ico`.
But favicons can be anywhere on a website and the location can be
defined in HTML and JSON manifests. Websites can also have multiple
favicons.

As a result, when a new site was added to stash-box which did not have a
`/favicon.ico`, the site was created without a proper favicon. Additionally,
since stash-box didn't check the filetype of the data returned by the
`/favicon.ico` GET request, it stored various 404 and other redirected
html pages.

A non-trivial amount of logic is required to discover all favicon
locations, check filetypes, and sort them. It therefore makes sense to
rely on a third-party package for this job. This patch uses
`go.deanishe.net/favicon`.

* Check length of icons to avoid runtime error when using icons[0] later.

* Specify go.deanishe.net/favicon as direct requirement

* favicon: crate favicon file only after retreival has succeded

Previously, an empty file was created at the start of the downloadIcon
function and would remain as an empty file even when favicon retreival
failed. This empty file would prevent future attempts to download an
icon for the same site ID.

Now the icon file is created at the end, after everything else has
succeeded.

* retreive the smallest available favicon for sites

* use http.client with cookiejar for favicon finder

Cookies are required for certain websites, even for simple requests.
See code comment for details.

* Makefile: include generate in the default make target

This more closely mirrors the GitHub workflow build process and helps
avoid CI build failures on pull requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants