Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check template UTF8 validity before parsing #1774

Merged
merged 1 commit into from
Jan 10, 2024
Merged

Conversation

ggmichaelgo
Copy link
Contributor

@ggmichaelgo ggmichaelgo commented Jan 5, 2024

What are you trying to solve?

Liquid relies on regex for parsing, and Liquid needs to check the template's String encoding validity.

Currently, Liquid is raising a ArgumentError when a template includes an invalid UTF8 byte sequence.

require 'liquid'
Liquid::Template.parse("{% assign foo = '\xC0' %}") 

# lib/liquid/tokenizer.rb:31:in `split': invalid byte sequence in UTF-8 (ArgumentError)

Instead of throwing the ArgumentError, this PR updates Liquid to raise a Syntax error when a template has a invalid encoding.

With this change, the developers won't have to catch the invalid encoding error like this:

begin
  Liquid::Template.parse("\xC0")
rescue ArugmentError => e
  if e.message == "invalid byte sequence in UTF-8"
     ...
  else
    raise e
  end
end

Instead, the error can be handled like this

begin
  Liquid::Template.parse("\xC0")
rescue Liquid::Error
     ...
end

@ggmichaelgo ggmichaelgo requested a review from a team January 5, 2024 22:05
@@ -107,6 +107,11 @@ def initialize
# Returns self for easy chaining
def parse(source, options = {})
parse_context = configure_options(options)

unless source.valid_encoding?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the regex always throw the UTF-8 error? If so, then this duplicates the validation work. What if we caught the exception and re-threw instead?

Copy link
Contributor Author

@ggmichaelgo ggmichaelgo Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is also for Liquid-C parser. Since Liquid-C doesn't use regex, it allowed Liquid templates like this:

{% assign foo = "Broken String\x00" %}
{% comment \xC0 %}
  \xC0
{% endcomment \xC0 %}

With this change, we won't have to implement UTF-8 validation check on Liquid-C.

Also, the UTF-8 error would have to be handled by checking error message, and I think it is more cleaner to check the validity by using String#valid_encoding?

rescue ArgumentError => e
  if e.message == "invalid byte sequence in UTF-8"

@ggmichaelgo ggmichaelgo requested a review from samdoiron January 9, 2024 18:49
@ggmichaelgo ggmichaelgo force-pushed the check-utf8-validity branch 2 times, most recently from b360602 to 89a3841 Compare January 10, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants