Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android Authority feed #614

Closed
PhilC813 opened this issue Dec 17, 2024 · 17 comments
Closed

Android Authority feed #614

PhilC813 opened this issue Dec 17, 2024 · 17 comments
Assignees

Comments

@PhilC813
Copy link
Contributor

PhilC813 commented Dec 17, 2024

Feed URL

https://www.androidauthority.com/feed/

Add any details, links, or screenshots about the article layout that's missing or wrong

In the following article, the name of the different sections, which are headers ("h2" elements), are stripped out in full content mode.

Article:
https://www.androidauthority.com/new-android-apps-658839/

2024_12_17_02 38 54

Here's the stripped out HTML, which is rather simple:
<h2 id="[number]">[Header]</h2>

I would assume that the same would occur for other articles in the feed.

Unless I'm mistaken, header elements shouldn't be stripped out regardless of the feed; they are usually relevant to the content.

Thank you!!

@PhilC813 PhilC813 changed the title Headers stripped out in Android Authority article Elements stripped out in Android Authority article Dec 17, 2024
@PhilC813 PhilC813 changed the title Elements stripped out in Android Authority article Android Authority feed Dec 17, 2024
@PhilC813
Copy link
Contributor Author

PhilC813 commented Dec 17, 2024

I also noticed that the embedded YouTube videos are stripped out, but unlike the headers, they seem oddly integrated in the HTML as they only appear in the JSON data. I understand if this can't be improved in the parser.

"TED Tumblewords" app section:
{"resource":"nc-embed-youtube","video":{"youtubeId":"1Z9fVc6v2aY"}}

"Carrion" app section:
{"resource":"nc-embed-youtube","video":{"youtubeId":"M6NOM2-UZdw"}}

@jocmp jocmp moved this to On Deck in Capy Reader Dec 17, 2024
@jocmp jocmp self-assigned this Dec 17, 2024
@jocmp
Copy link
Owner

jocmp commented Dec 17, 2024

Thanks for all the details. The missing headers should be simple. I'll see what I can do for that YouTube video JSON.

@jocmp jocmp moved this from On Deck to In Progress in Capy Reader Dec 18, 2024
@jocmp
Copy link
Owner

jocmp commented Dec 18, 2024

I have a custom parser I'll test a little more before adding to the next release (jocmp/mercury-parser#27).

I haven't found a way to grab the YouTube videos - it looks like they're doing something weird with JavaScript for non-browser clients (like Capy). That said, headers are working. Stay tuned!

Before After

@PhilC813
Copy link
Contributor Author

Sounds promising! Thank you!

If you made a custom parser for this feed specifically, is it because you don't consider header elements to be relevant to the content for most feeds?

@jocmp
Copy link
Owner

jocmp commented Dec 18, 2024

I think headers are important. I want to avoid breaking the core parser since I don't fully understand all the different parts yet. So adding a custom fix is easiest now.

In general sites can misuse headers which is why the previous maintainers built the parser that way. Luckily they left a note on why the built it that way.

Remove any headers that appear before all other p tags in the
document. This probably means that it was part of the title, a
subtitle or something else extraneous like a datestamp or byline,
all of which should be handled by other metadata handling.

@jocmp
Copy link
Owner

jocmp commented Dec 22, 2024

Updated as of 2024.12.1085-dev

@jocmp jocmp closed this as completed Dec 23, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Capy Reader Dec 23, 2024
@PhilC813
Copy link
Contributor Author

PhilC813 commented Jan 7, 2025

So far, articles seem rendered properly with the update to the parser! Thank you so much! It's a feed that I only check occasionally however so I'll let you know if I notice other things down the line.


One small thing, albeit not important;
For polls like in this article, the poll and voting choices get rendered in Capy but naturally, can't be interacted with because it must lack the JavaScript.

Website Capy
screenshot17362337972574627274782526763858 Screenshot_20250107_021035

Here's the HTML for that section:
<div class="e_e"><div class="e_Oh e_Kp"><h3 class="e_Np">Which Android XR form factor are you most excited about?</h3><div class="e_Op">1343 votes</div><div class="e_Pp e_Qh"><button type="button" class="e_Qp"><div class="e_Rp">The headset! Give me that immersion!</div><div class="e_Mp">10<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:10%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">The glasses! I want that lightweight design I can use everywhere.</div><div class="e_Mp">46<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:46%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">Both.</div><div class="e_Mp">14<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:14%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">None. I&#x27;m already invested in another XR/VR/AR platform.</div><div class="e_Mp">3<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:3%"></div></div></button><button type="button" class="e_Qp"><div class="e_Rp">I don&#x27;t care about XR, AR, or VR.</div><div class="e_Mp">28<!-- -->%</div><div class="e_Sp"><div class="e_Lp" style="width:28%"></div></div></button></div></div></div>

No special element or meaningful class names that would allow you to tatget these polls in the custom parser, but after comparing with some of their other articles that contain polls, it appears that the classes e_Oh and e_Kp are the ones used to define these polls.

Should these be stripped out? Or maybe it could also be useful to the reader to know that there's actually a poll there, I'm not sure.

@jocmp
Copy link
Owner

jocmp commented Jan 8, 2025

@PhilC813 good find. Let me see if they're using JS for these. It would be nice to allow this functionality. If nothing else, I'll remove it to avoid the jarring markup.

@PhilC813
Copy link
Contributor Author

In this article:
https://www.androidauthority.com/best-android-apps-2024-3506812/

The title of the apps in display are stripped out in Capy. Could this be improved?

There's also another example of a poll when you'll need to confirm your results with #699!

@jocmp
Copy link
Owner

jocmp commented Jan 14, 2025

yep! I'll roll that in with the polls since the also use h3 headings. (jocmp/mercury-parser#41)

Here's a preview from my markup tester. "Google Gemini" and "Mozilla Thunderbird" were previously hidden.

@jocmp
Copy link
Owner

jocmp commented Jan 18, 2025

@PhilC813 the heading and poll updates are available as of 2025.01.1096-dev!

@PhilC813
Copy link
Contributor Author

Not sure if it's the result of #699, but now none of the articles in the feed show anything below the header image when in full content mode, as you can see below.

Image

@jocmp
Copy link
Owner

jocmp commented Jan 27, 2025

@PhilC813 unfortunately it looks like they changed their selectors again. I'm going to need to find a better way than using classes since they're unreliable for Android Authority.

https://github.com/jocmp/mercury-parser/blob/4e98d18a07eab2bd7030308ef368a83d0f5a06c0/src/extractors/custom/www.androidauthority.com/index.js#L23

@jocmp jocmp reopened this Jan 27, 2025
@jocmp jocmp closed this as completed Jan 27, 2025
@jocmp
Copy link
Owner

jocmp commented Jan 27, 2025

Tracking via #779

@PhilC813
Copy link
Contributor Author

unfortunately it looks like they changed their selectors again. I'm going to need to find a better way than using classes since they're unreliable for Android Authority.

Almost like if they want to annoy RSS reader devs. Thank you!

@PhilC813
Copy link
Contributor Author

You're already spending way too much time optimizing this feed but just in case, Android Authority's review pages also has a slightly broken layout in Capy:
https://www.androidauthority.com/garmin-instinct-3-amoled-review-3518583/

The "What we like/don't like" section more specifically. There's also smaller things that could be improved in the general parser overall like the fact that images are stacked up vertically even though they could both fit on one line, etc.

@jocmp
Copy link
Owner

jocmp commented Jan 29, 2025

Good callout. This is something strange with svg pictures. For some reason they're taking up more space than necessary. I may be able to strip them to avoid messing with custom styling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants