Skip to content

Commit

Permalink
make_unicode_tables.awk is now UnicodeTablesGenerator
Browse files Browse the repository at this point in the history
UnicodeTablesGenerator uses Unicode data from ICU4J to generate Unicode
tables for consumption by RE2/J. Output is google-java-formatted before
it is written.

No new runtime dependencies are added to RE2/J.

The generator uses ICU4J 4.8.2 which bundles Unicode 6.0.0. This keeps
it compatible with Java 8, which RE2/J targets. Consideration should be
given for how we might upgrade to later Unicode versions without
introducing inconsistencies (e.g. RE2/J matches something that shouldn't
match according to java.lang.Character data).

There are some differences in the generated tables:
  * the new tables do not contain binary property character ranges (e.g.
    ASCII_Hex_digit), as those tables are currently unused in RE2/J.

  * Cc (control) char class now contains NUL (u+0000), this is correct
    and was also the subject of #26.

See https://github.com/google/re2j/files/4725343/diff.txt for a full
list of differences between the old tables and the new.
  • Loading branch information
sjamesr committed Jun 3, 2020
1 parent 02f0c9f commit eedfe4b
Show file tree
Hide file tree
Showing 7 changed files with 476 additions and 202 deletions.
17 changes: 0 additions & 17 deletions java/com/google/re2j/Unicode.java
Original file line number Diff line number Diff line change
Expand Up @@ -75,23 +75,6 @@ static boolean isUpper(int r) {
return is(UnicodeTables.Upper, r);
}

// isLower reports whether the rune is a lower case letter.
static boolean isLower(int r) {
// See comment in isGraphic.
if (r <= MAX_LATIN1) {
return Character.isLowerCase((char) r);
}
return is(UnicodeTables.Lower, r);
}

// isTitle reports whether the rune is a title case letter.
static boolean isTitle(int r) {
if (r <= MAX_LATIN1) {
return false;
}
return is(UnicodeTables.Title, r);
}

// isPrint reports whether the rune is printable (Unicode L/M/N/P/S or ' ').
static boolean isPrint(int r) {
if (r <= MAX_LATIN1) {
Expand Down
181 changes: 0 additions & 181 deletions java/com/google/re2j/make_unicode_tables.awk

This file was deleted.

7 changes: 3 additions & 4 deletions javatests/com/google/re2j/ParserTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import java.util.EnumMap;
import java.util.Map;

import com.google.common.truth.Truth;
import org.junit.Test;

/**
Expand Down Expand Up @@ -286,7 +287,7 @@ public boolean applies(int r) {
// - Java UTF-16 things.

@Test
public void testParseSimple() throws Exception {
public void testParseSimple() {
testParseDump(PARSE_TESTS, TEST_FLAGS);
}

Expand Down Expand Up @@ -346,9 +347,7 @@ private void testParseDump(String[][] tests, int flags) {
try {
Regexp re = Parser.parse(test[0], flags);
String d = dump(re);
if (!test[1].equals(d)) {
fail(String.format("parse/dump of " + test[0] + " expected " + test[1] + ", got " + d));
}
Truth.assertWithMessage("parse/dump of " + test[0]).that(d).isEqualTo(test[1]);
} catch (PatternSyntaxException e) {
throw new RuntimeException("Parsing failed: " + test[0], e);
}
Expand Down
1 change: 1 addition & 0 deletions settings.gradle
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
rootProject.name = 're2j'

include ':benchmarks'
include ':unicode'
9 changes: 9 additions & 0 deletions unicode/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Utilities for emitting Unicode tables used by RE2J.

To rebuild the Unicode tables, run:

```
./gradlew :unicode:run -q > java/com/google/re2j/UnicodeTables.java
```

from the project root directory.
17 changes: 17 additions & 0 deletions unicode/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
plugins {
id 'java'
id 'application'
}

mainClassName = 'com.google.re2j.UnicodeTablesGenerator'

repositories {
mavenCentral()
}

dependencies {
compile 'com.google.googlejavaformat:google-java-format:1.0'
compile 'com.squareup:javapoet:1.12.1'
compile 'com.ibm.icu:icu4j:4.8.2'
compile 'com.google.guava:guava:29.0-jre'
}
Loading

0 comments on commit eedfe4b

Please sign in to comment.