bug: Strange input results in false positive #49

HatScripts · 2024-01-07T05:02:13Z

Expected behavior

When I input the following string:

    "" ""
    "" ""
    "" ""
Assamese -> Assam

I expect that there should be no censoring.

Actual behavior

However, Assam becomes A*sam.

Strangely, modifying parts of the string, such as the quotes ("), results in no censoring.

Minimal reproducible example

import {
  RegExpMatcher,
  TextCensor,
  englishDataset,
  englishRecommendedTransformers,
  keepStartCensorStrategy,
  keepEndCensorStrategy,
  asteriskCensorStrategy
} from 'obscenity'

const matcher = new RegExpMatcher({
  ...englishDataset.build(),
  ...englishRecommendedTransformers
})

const strategy = keepStartCensorStrategy(keepEndCensorStrategy(asteriskCensorStrategy()))
const censor = new TextCensor().setStrategy(strategy)

const input = `    "" ""
    "" ""
    "" ""
Assamese -> Assam`

const matches = matcher.getAllMatches(input)
console.log(censor.applyTo(input, matches))

Steps to reproduce

Run the above code
View console

Additional context

No response

Node.js version

N/A

Obscenity version

0.2.0

Priority

Low
Medium
High

Terms

I agree to follow the project's Code of Conduct.
I have searched existing issues for similar reports.

The text was updated successfully, but these errors were encountered:

HatScripts · 2024-01-07T05:07:31Z

const input = `    
    
    
Assamese -> Assam`

Additionally, removing the quotes and leaving just the whitespace (4 spaces on each line) still results in the unexpected censoring. If you delete any of these spaces, no censoring occurs.

jo3-l · 2024-01-07T05:54:05Z

The error seems to be in the whitelisted term matching logic. In particular, we are using an index into the original input where we should instead be using an index to the transformed input, resulting in the second assa to be skipped over*. The following diff seems to fix it, if this is urgent for you:

diff --git a/src/matcher/regexp/RegExpMatcher.ts b/src/matcher/regexp/RegExpMatcher.ts
index 7f4fdb1..af31d87 100644
--- a/src/matcher/regexp/RegExpMatcher.ts
+++ b/src/matcher/regexp/RegExpMatcher.ts
@@ -161,7 +161,7 @@ export class RegExpMatcher implements Matcher {
                                }
 
                                matches.insert(indices[startIndex], endIndex);
-                               lastEnd = endIndex + 1;
+                               lastEnd = startIndex + whitelistedTerm.length;
                        }
                }

I will hold off on a patch release until I have time to look at this more carefully, though. The matching logic is fairly complex and I would like to refamiliarize myself with the implementation to ensure this is fully correct first (particularly in cases with non-ASCII characters.) Unfortunately, as I said in #46, this may have to wait until late this month or early February. Apologies.

*I verified that there is no security issue with OOB access due to this mismatch--it should be purely a matter of correctness.

HatScripts added the bug Something isn't working label Jan 7, 2024

jo3-l added a commit that referenced this issue Aug 2, 2024

test: add failing test for #49

bb432e7

jo3-l closed this as completed in ebf95ad Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Strange input results in false positive #49

bug: Strange input results in false positive #49

HatScripts commented Jan 7, 2024

HatScripts commented Jan 7, 2024

jo3-l commented Jan 7, 2024 •

edited

Loading

bug: Strange input results in false positive #49

bug: Strange input results in false positive #49

Comments

HatScripts commented Jan 7, 2024

Expected behavior

Actual behavior

Minimal reproducible example

Steps to reproduce

Additional context

Node.js version

Obscenity version

Priority

Terms

HatScripts commented Jan 7, 2024

jo3-l commented Jan 7, 2024 • edited Loading

jo3-l commented Jan 7, 2024 •

edited

Loading