Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AO3-6719 Leave src URLs instead of emptiness for images stripped in AO3-6564 #4799

Merged
merged 20 commits into from
Apr 28, 2024

Conversation

sarken
Copy link
Collaborator

@sarken sarken commented Apr 25, 2024

Issue

https://otwarchive.atlassian.net/browse/AO3-6719

Purpose

Adds a keep_src keyword argument to the strip_images method that lets us have the option of replacing the stripped img tag with its src URL instead of empty space. I'm happy to give it a different name or make it a separate method. I very much hate the hack to make it work with tests, but I also felt like I'd get side-eyed for changing the tests. 😆 I can do that instead, though!

Replaces the strip_images keyword argument on sanitize_field with a new argument called image_safety_mode. We use it in the same places we were using strip_images, but it's a more flexible name in case we decide to switch to placeholders or go back to nothingness or something else entirely -- right now it uses strip_images with keep_src: true on the sanitized content. Again, delighted to give it a different name if you have one.

I assumed that we wouldn't mind including the image URL in the main content of support tickets and abuse reports, as well as the emailed copies of same. They were kind of a weird hybrid of "affected by AO3-6564" and "not affected by AO3-6564." Happy to go back to no images at all, though.

ETA: And the big one, make it so we can use the config to temporarily strip images in work comments without a deploy. (You can do tags, too, but that seems unlikely to be an issue.) We'll need to set PARENTS_WITH_IMAGE_SAFETY_MODE: ["AdminPost"] when this is deployed to keep the current behavior of stripping images from news post comments.

Testing Instructions

How can the Archive's QA team verify that this is working as you intended?

If you have a Jira account with access, please update or comment on the issue
with any new or missing testing instructions instead.

References

@sarken
Copy link
Collaborator Author

sarken commented Apr 25, 2024

Wait, when did we change Rubocop from wanting %r{} back to //? Was it intentional?

I don't care -- I actually like the slashes better -- but I don't want to change it until I'm sure Rubocop is doing what we want.

@Bilka2
Copy link
Contributor

Bilka2 commented Apr 25, 2024

I think we've always used the default configuration, which is slashes with AllowInnerSlashes false. That makes it want r%{} for expressions with slashes in them like on this old PR and // otherwise, like in this PR. You can also see that behaviour in this code revert.

Comment on lines 153 to 155
# The cleaned field will always use double quotes (") but the fields
# in tests aren't always cleaned, so we're allowing double or single
# quotes here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, how sure is it that it's only the tests where the fields aren't cleaned? An uncleaned field could have unexpected formatting beyond just different quotes, which would break/bypass the two regular expressions here.

A lot of the time, this function is called from sanitize_field. I think it would be more reliable to use Sanitize or nokogiri in sanitize_value to do the stripping. There we're already dealing with those libs and I'd expect them to be more reliable than making our own regexp (don't reinvent the wheel and all that).

Doing the image stripping in sanitize_value would also make things a bit nicer because we could use a config similar to ArchiveConfig.FIELDS_ALLOWING_VIDEO_EMBEDS to designate fields to clean (or rather, not clean), instead of calling strip_images during display. That also makes it less likely that some display spots will get missed like it already happened in AO3-6564.

With the source urls getting preserved, doing this in the sanitisation step also wouldn't lose any data, if that was a concern previously.

@sarken
Copy link
Collaborator Author

sarken commented Apr 26, 2024

Hold pls.

@sarken sarken closed this Apr 26, 2024
@sarken sarken changed the title AO3-6719 Leave src URLS instead of emptiness for images stripped in AO3-6564 AO3-6719 Leave src URLs instead of emptiness for images stripped in AO3-6564 Apr 26, 2024
@sarken sarken reopened this Apr 26, 2024
@sarken sarken added the Has Production Config Changes Modifies the config file and needs special attention when deploying label Apr 26, 2024
@sarken sarken closed this Apr 26, 2024
@sarken sarken reopened this Apr 26, 2024
spec/lib/html_cleaner_spec.rb Show resolved Hide resolved
end

it "removes the img tag entirely when the src uses single quotes" do
string = "Hi! <img src='http://example.org/image.png'> Bye"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour seems a bit confusing. I know in real usage we're not really going to have single quotes, but for consistency would it be better to treat single/double quotes the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, idk. I originally treated them the same for consistency, even though I don't really like writing code just to make tests pass, but after Bilka's comment, I got more skittish about that approach.

I think maybe we should have different behavior so it forces us to think about what's going on, e.g., if we're seeing the wrong behavior on production, that's a big yikes because it means the text isn't getting sanitized properly, and if we're seeing the wrong behavior in tests, is it because we're making test data in a way that's bypassing the sanitizer, or because there's actually a problem.

That said, I'm happy to take the path of least resistance and do ['"] in the regex instead.

Copy link
Contributor

@Bilka2 Bilka2 Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of ways to break this new regexp, from more spaces to newlines. But I don't think rewriting this regexp until I run out of obvious ways to break it will get us anywhere actually safe, that's why I suggested using a lib for it. The regexp will never quite protect from the case of some place accidentally doing sanitize_field(strip_images()) (perhaps hidden behind or separated by steps of abstraction) instead of using the arg or the correct order. Or one of the myriad of other ways unsanitized content could end up in strip_images (like in tests) while still appearing safe at first glance.

But I see that the dynamic config option means it can't happen during the sanitize step and tossing the whole thing into a whole nokogiri/Sanitize setup for a second time in strip_images isn't a great solution either. Furthermore, this PR is not the right place to rework comments to store unsanitized content and only sanitize (and strip images during sanitization) for display.

So, I'd say to make a regexp that's readable/feels good to you and only secondary that it handles unexpected input. And then in the integration tests where things should be sanitized beforehand, use an img tag that specifically would break the regexp if things run in the wrong order. That should be slightly more protection against the "big yikes" than the current setup. Not good, but better 😭

Copy link
Contributor

@Bilka2 Bilka2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three minor missed things in the tests, rest looks good!

end

context "when the comment is on a tag" do
let(:comment) { create(:comment, :on_tag) }

it_behaves_like "a notification email with a link to the comment"
it_behaves_like "a notification email with a link to reply to the comment"
it_behaves_like "a comment subject to image safety mode settings"
Copy link
Contributor

@Bilka2 Bilka2 Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This it_behaves_like should also be added to the context "when the comment is a reply to another comment" a few lines below this one.


before do
comment.comment_content += image_tag
comment.save!
end

it "strips the image from the email message" do
expect(email).not_to have_html_part_content(image_tag)
context "when image safety mode is enabled for admin post comments" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context should check text content, like its counterpart for #edited_comment_notification below.

Comment on lines +9 to +11
When I view <commentable> with comments
Then I should see the image "src" text "https://example.com/image.jpg"
And I should not see "OMG! https://example.com/image.jpg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test added to add_comment.feature in #4729 can be removed, because this tests the same.

@brianjaustin brianjaustin merged commit f59cb5e into otwcode:master Apr 28, 2024
26 checks passed
@sarken sarken deleted the AO3-6719 branch May 2, 2024 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has Production Config Changes Modifies the config file and needs special attention when deploying Reviewed: Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants