Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Team 3) Testcases of the CSEARCH algorithm #99

Closed
laisycs opened this issue May 13, 2016 · 10 comments
Closed

(Team 3) Testcases of the CSEARCH algorithm #99

laisycs opened this issue May 13, 2016 · 10 comments

Comments

@laisycs
Copy link
Contributor

laisycs commented May 13, 2016

We are thinking about how to test the correctness of our implementation of CSEARCH algorithm. The input of this algorithm is a regex string, and the output is a boolean expression object (which has toString() function). For now we are considering two ways to test it.

1, We manually translate a regex to a trigram boolean query, and compare it to the result generated by the program. However, two boolean expressions might not be completely identical, but they are equivalent. And its order doesn't matter. For example "a AND (b OR c)" is equivalent to "(c OR b) AND a". It makes us hard to compare the query. The complexity of comparison is O(n!) because we need to try out every matching.

2, We compare the result instead of query. We check if the result of matching a regex remains the same after using our translator program. The problem of this approach is, if a test case fails, we don't know what goes wrong. We can't test the correctness of each individual component. It also relies on the correctness of RegexMatcher.

Does anyone has any suggestions on this issue? Does anyone know a Java library dealing with boolean expressions? Thanks.

@laisycs
Copy link
Contributor Author

laisycs commented May 13, 2016

@zuozhi @chenlica

@chenlica
Copy link
Collaborator

@laisycs The second approach is not acceptable since the correctness of RegexMatcher depends on the correctness of the regex-to-boolean translator. We cannot do "cyclic testing" here.

For the first approach, in general the problem of testing the equivalence of two boolean expressions can be viewed as a tree homomorphism problem, which is NP-complete (I think). Check https://en.wikipedia.org/wiki/Tree_homomorphism and http://www.ams.org/journals/tran/1996-348-04/S0002-9947-96-01537-1/S0002-9947-96-01537-1.pdf .

One approach to do the equivalence testing is do some pruning before a recursive call for a child pair, since we can easily avoid comparing two nodes that are not the same "type." The worst-case complexity of this approach is still exponential. I don't know in practice how good it is. It will help if you can do some exercise on paper by using two boolean expressions from two real regex queries.

Here's a simpler approach. We first convert each boolean expression to a disjunctive normal form (DNF). In other words, we "flatten" the boolean tree to a set of sets. Then we can test the equivalence of the two DNFs easily.

The article http://math.stackexchange.com/questions/228969/how-to-convert-formula-to-disjunctive-normal-form talked about how to translate a boolean expression to DNF. I think there should be an algorithm to do the translation.

Do you have any idea about the complexity of each boolean expression? How deep is the tree? What's the fanout of a node? The size of the DNS affects the complexity of the final testing.

@zuozhiw
Copy link
Collaborator

zuozhiw commented May 13, 2016

For the first approach, we mean to run RegexMatcher with translator, run it again without translator, and see if the results are the same. It this still cycle-testing?

Currently we think the tree's height should be 3. The first level is a single AND, the second level is multiple ORs, the third level is multiple ANDs. We are looking for third party libraries that handles boolean expressions. Besides this comparison issue, we also want to do some boolean simplifications to the query. For example, we can simplify "a AND (a OR b)" to "a". It can potentially speed up our query. (I don't know if Lucene does this kind of simplifications internally, I'll test it later.)

@chenlica
Copy link
Collaborator

It's not cyclic testing. If we do this way, we are doing "integration testing" that uses two pieces of logic. But we need unit testing for the translator logic. So this approach is not good.

Did you think about the approach based on DNF comparison?

@zuozhiw
Copy link
Collaborator

zuozhiw commented May 13, 2016

I'll do more research about DNF. As I said, besides testing, transforming boolean expressions can help us in simplification. So I think we should definitely try to do it.
For now, I need to carefully think about what the tree will eventually look like. In the implementation, besides AND and OR, we have ANY (match any string, it's basically scan based) and NONE (match no string, it's for handling errors). I need to figure out how ANY and NONE affect the boolean expression tree.

@chenlica
Copy link
Collaborator

Agreed. Do the planning carefully given the limited amount of time we have this quarter.

@laisycs
Copy link
Contributor Author

laisycs commented May 13, 2016

@zuozhi I think we could first implement comparison of boolean expressions, because we have to use it in our test cases. Since the tree is well formatted, it should be achievable.

For simplifying a boolean expression, we could first do what Russ Cox's did (remove duplicate etc.), and see how it performs.

@chenlica
Copy link
Collaborator

Agreed. If the generated boolean expression is not too "big," we could use a simple comparator based on recursion, assuming it's fast to run. To make the comparison fast, for the children of a node, we can first partition them into different groups based on their types, then compare two sets of children "group by group." This method should be faster than generating all permutations of the children, even though its worst complexity is still exponential.

laisycs added a commit that referenced this issue May 17, 2016
@laisycs
Copy link
Contributor Author

laisycs commented Jun 3, 2016

Next step:
Design and implement algorithm to compare two boolean expressions. The basic idea is to format boolean expressions into DNF, and compare two DNFs.

@zuozhiw
Copy link
Collaborator

zuozhiw commented Jun 15, 2016

Finished in #127. Issue closed.

@zuozhiw zuozhiw closed this as completed Jun 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants