Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Team3 regex Disjunctive Normal Form #127

Merged
merged 6 commits into from
Jun 15, 2016
Merged

Team3 regex Disjunctive Normal Form #127

merged 6 commits into from
Jun 15, 2016

Conversation

zuozhiw
Copy link
Collaborator

@zuozhiw zuozhiw commented Jun 9, 2016

The purpose of this PR is to transform the query from a very messy form, with possibly lots of redundant information to Disjunctive Normal Form(DNF). Here's an example.

Regex is abc?pqr?, which can possibly match: abpq, abpqr, abcpq, abcpqr

The original query generated by our program is (((bpq AND abp) OR (abc AND cpq AND bcp)) AND (bpq OR cpq) AND (bpq OR cpq OR (pqr AND cpq) OR (pqr AND bpq)) AND (pqr OR bpq OR cpq) AND (abc OR abp)). It is so long and complicated that we can only print it out, go through it manually and see if it's correct (it is correct BTW).

When the regex goes a little bit complicated, the query generated is like crazy and we can't write automated test on it. So professor Li suggested us to translate it to DNF, (I believe CNF, Conjunctive normal form won't make a difference).

So this is the code to transform the query to DNF, and then I can easily do some simplifications on it. (in simplifyDNF function).

Here's the query after transforming to DNF and applying simplification:
(abp AND bpq) OR (abc AND bcp AND cpq)

Remember the regex can match abpq, abpqr, abcpq, abcpqr,
abpqr is redundant because of abpq, abcpqr is redundant because of abcpq.

We now get a much better query :)

This is our last TODO, we (team 3) are mostly finished after merging this.

// "AND" two DNF tree
// Apply distributive laws:
// (a OR b) AND (c OR d) --> (a AND c) OR (a AND d) OR (b AND c) OR (c AND d)
private static GramBooleanQuery andDNF(GramBooleanQuery left, GramBooleanQuery right) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is basically: for each node in left query, AND it with each node in right query.
It's long because we put "leaf nodes" in a list, and "nodes with children" in another list. If we treat leaf nodes and non-leaf nodes as the same in the first place, this function will be much more simpler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had an earlier discussion about this topic. I assume this logic is based on RE2J. Do you think it's better to use the cleaner design we talked before, in which we treat all such nodes uniformly? It can simplify this logic. We can do the refactoring as a separate PR. (It's optional though.)

@JavierJia
Copy link
Contributor

Some minor comments. A general request, do you have any reference about the classic algorithm to transform a general binary form to DNF?

@zuozhiw
Copy link
Collaborator Author

zuozhiw commented Jun 9, 2016

I googled around about the algorithm, but I can't find any. Some of them are just questions on StackOverflow. But the idea is just to apply boolean logical equivalence rules again and again.

I used 3 rules here:
Associative laws: (a OR b) OR c = a OR (b OR c), when transforming OR nodes,
Distributive laws: a AND (b OR c) = (a AND b) OR (a AND c), when transforming AND nodes,
Absorption laws: a OR (a AND b) = a, when simplifying DNF.

@JavierJia
Copy link
Contributor

Nice, could you add those laws in comments?

@chenlica
Copy link
Collaborator

chenlica commented Jun 9, 2016

Please add these comments to the code.

return true;
}

// "AND" two DNF tree
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"two DNF tree" -> "two DNF trees"

@chenlica
Copy link
Collaborator

chenlica commented Jun 9, 2016

General comments on this PR:

  • In the PR and codebase, add a high-level description of its purpose and the main idea of the algorithm.
  • Include examples in the comments.
  • Get rid of those "println()" statements.

After these changes, I will review it again.

Just curious: it will be good to use a separate PR to refactor the code to use a more consistent way to represent the children of a node in a boolean expression. Is it better to do that PR before or after THIS PR?

@JavierJia
Copy link
Contributor

JavierJia commented Jun 9, 2016

Just curious: it will be good to use a separate PR to refactor the code to use a more consistent way to represent the children of a node in a boolean expression. Is it better to do that PR before or after THIS PR?

IMHO, it should be a separate one after this one that every task will converge in an iterative model.

@chenlica
Copy link
Collaborator

chenlica commented Jun 9, 2016

@JavierJia : I am not following your comment. Can you elaborate?

@JavierJia
Copy link
Contributor

I mean that each PR doesn't need to be perfect to submit. It's OK to have pending issues, especially for the refactoring issues that usually happens after the code is functioning.
One principle I like is to merge fast and update fast 😄

@chenlica
Copy link
Collaborator

chenlica commented Jun 9, 2016

Agreed.

@zuozhiw
Copy link
Collaborator Author

zuozhiw commented Jun 13, 2016

Hi @chenlica @JavierJia , it's been a few days and I finally have time to finish this PR.
I just added:
examples and high-level explanations to the comments,
a few minor changes in the tests.

The print in the tests is just a helper function. It's there for debugging purposes. I can conveniently call that function when something goes wrong.

Since the code is working fine, let's merge this PR into master.

I did the refactoring on my flight, it's almost done, but there are still some bugs. I also need to do some cleanups. I'll finish it soon.

}

// "ab" can't form a trigram, so the result is an empty OR node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also depends on the length of grams, which may not be 3.

@chenlica
Copy link
Collaborator

Agreed that we want to merge the PR soon. I still don't see the comments on (1) How to translate each boolean expression to a DNF, and (2) how to test the equivalence of two DNFs. Some high-level description will be very helpful. Can you add them?

…orm, 2, how transforming to DNF helps comparison of two trees
@zuozhiw
Copy link
Collaborator Author

zuozhiw commented Jun 15, 2016

@chenlica
Done.
High level explanations are added in comments.
(1) how to transform to DNF is added in toDNF() function,
(2) how to compare is added in RegexToGramQueryTranslatorTest.java

* Associative laws: (a OR b) OR c = a OR (b OR c) = a OR b OR c, when transforming OR nodes,
* Distributive laws: a AND (b OR c) = (a AND b) OR (a AND c), when transforming AND nodes,
*
* For each node, it's children will be transformed to DNF form first, then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it's" -> "its". General comment: please proofread comments to fix simple typos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll proofread more carefully in the future. I know I often make grammar errors.

@chenlica
Copy link
Collaborator

I left some minor comments. After taking care of them, please go ahead to do the merge!

@zuozhiw zuozhiw merged commit e8794cc into master Jun 15, 2016
@zuozhiw zuozhiw deleted the team3-regex-dnf branch June 15, 2016 03:10
@zuozhiw zuozhiw mentioned this pull request Jul 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants