-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Team3 regex Disjunctive Normal Form #127
Conversation
// "AND" two DNF tree | ||
// Apply distributive laws: | ||
// (a OR b) AND (c OR d) --> (a AND c) OR (a AND d) OR (b AND c) OR (c AND d) | ||
private static GramBooleanQuery andDNF(GramBooleanQuery left, GramBooleanQuery right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is basically: for each node in left query, AND it with each node in right query.
It's long because we put "leaf nodes" in a list, and "nodes with children" in another list. If we treat leaf nodes and non-leaf nodes as the same in the first place, this function will be much more simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had an earlier discussion about this topic. I assume this logic is based on RE2J. Do you think it's better to use the cleaner design we talked before, in which we treat all such nodes uniformly? It can simplify this logic. We can do the refactoring as a separate PR. (It's optional though.)
Some minor comments. A general request, do you have any reference about the classic algorithm to transform a general binary form to DNF? |
I googled around about the algorithm, but I can't find any. Some of them are just questions on StackOverflow. But the idea is just to apply boolean logical equivalence rules again and again. I used 3 rules here: |
Nice, could you add those laws in comments? |
Please add these comments to the code. |
return true; | ||
} | ||
|
||
// "AND" two DNF tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"two DNF tree" -> "two DNF trees"
General comments on this PR:
After these changes, I will review it again. Just curious: it will be good to use a separate PR to refactor the code to use a more consistent way to represent the children of a node in a boolean expression. Is it better to do that PR before or after THIS PR? |
IMHO, it should be a separate one |
@JavierJia : I am not following your comment. Can you elaborate? |
I mean that each PR doesn't need to be perfect to submit. It's OK to have pending issues, especially for the refactoring issues that usually happens after the code is functioning. |
Agreed. |
Hi @chenlica @JavierJia , it's been a few days and I finally have time to finish this PR. The print in the tests is just a helper function. It's there for debugging purposes. I can conveniently call that function when something goes wrong. Since the code is working fine, let's merge this PR into master. I did the refactoring on my flight, it's almost done, but there are still some bugs. I also need to do some cleanups. I'll finish it soon. |
} | ||
|
||
// "ab" can't form a trigram, so the result is an empty OR node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also depends on the length of grams, which may not be 3.
Agreed that we want to merge the PR soon. I still don't see the comments on (1) How to translate each boolean expression to a DNF, and (2) how to test the equivalence of two DNFs. Some high-level description will be very helpful. Can you add them? |
…orm, 2, how transforming to DNF helps comparison of two trees
@chenlica |
* Associative laws: (a OR b) OR c = a OR (b OR c) = a OR b OR c, when transforming OR nodes, | ||
* Distributive laws: a AND (b OR c) = (a AND b) OR (a AND c), when transforming AND nodes, | ||
* | ||
* For each node, it's children will be transformed to DNF form first, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it's" -> "its". General comment: please proofread comments to fix simple typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll proofread more carefully in the future. I know I often make grammar errors.
I left some minor comments. After taking care of them, please go ahead to do the merge! |
The purpose of this PR is to transform the query from a very messy form, with possibly lots of redundant information to Disjunctive Normal Form(DNF). Here's an example.
Regex is abc?pqr?, which can possibly match: abpq, abpqr, abcpq, abcpqr
The original query generated by our program is (((bpq AND abp) OR (abc AND cpq AND bcp)) AND (bpq OR cpq) AND (bpq OR cpq OR (pqr AND cpq) OR (pqr AND bpq)) AND (pqr OR bpq OR cpq) AND (abc OR abp)). It is so long and complicated that we can only print it out, go through it manually and see if it's correct (it is correct BTW).
When the regex goes a little bit complicated, the query generated is like crazy and we can't write automated test on it. So professor Li suggested us to translate it to DNF, (I believe CNF, Conjunctive normal form won't make a difference).
So this is the code to transform the query to DNF, and then I can easily do some simplifications on it. (in simplifyDNF function).
Here's the query after transforming to DNF and applying simplification:
(abp AND bpq) OR (abc AND bcp AND cpq)
Remember the regex can match abpq, abpqr, abcpq, abcpqr,
abpqr is redundant because of abpq, abcpqr is redundant because of abcpq.
We now get a much better query :)
This is our last TODO, we (team 3) are mostly finished after merging this.