-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Team 3) Testcases of the CSEARCH algorithm #99
Comments
@laisycs The second approach is not acceptable since the correctness of RegexMatcher depends on the correctness of the regex-to-boolean translator. We cannot do "cyclic testing" here. For the first approach, in general the problem of testing the equivalence of two boolean expressions can be viewed as a tree homomorphism problem, which is NP-complete (I think). Check https://en.wikipedia.org/wiki/Tree_homomorphism and http://www.ams.org/journals/tran/1996-348-04/S0002-9947-96-01537-1/S0002-9947-96-01537-1.pdf . One approach to do the equivalence testing is do some pruning before a recursive call for a child pair, since we can easily avoid comparing two nodes that are not the same "type." The worst-case complexity of this approach is still exponential. I don't know in practice how good it is. It will help if you can do some exercise on paper by using two boolean expressions from two real regex queries. Here's a simpler approach. We first convert each boolean expression to a disjunctive normal form (DNF). In other words, we "flatten" the boolean tree to a set of sets. Then we can test the equivalence of the two DNFs easily. The article http://math.stackexchange.com/questions/228969/how-to-convert-formula-to-disjunctive-normal-form talked about how to translate a boolean expression to DNF. I think there should be an algorithm to do the translation. Do you have any idea about the complexity of each boolean expression? How deep is the tree? What's the fanout of a node? The size of the DNS affects the complexity of the final testing. |
For the first approach, we mean to run RegexMatcher with translator, run it again without translator, and see if the results are the same. It this still cycle-testing? Currently we think the tree's height should be 3. The first level is a single AND, the second level is multiple ORs, the third level is multiple ANDs. We are looking for third party libraries that handles boolean expressions. Besides this comparison issue, we also want to do some boolean simplifications to the query. For example, we can simplify "a AND (a OR b)" to "a". It can potentially speed up our query. (I don't know if Lucene does this kind of simplifications internally, I'll test it later.) |
It's not cyclic testing. If we do this way, we are doing "integration testing" that uses two pieces of logic. But we need unit testing for the translator logic. So this approach is not good. Did you think about the approach based on DNF comparison? |
I'll do more research about DNF. As I said, besides testing, transforming boolean expressions can help us in simplification. So I think we should definitely try to do it. |
Agreed. Do the planning carefully given the limited amount of time we have this quarter. |
@zuozhi I think we could first implement comparison of boolean expressions, because we have to use it in our test cases. Since the tree is well formatted, it should be achievable. For simplifying a boolean expression, we could first do what Russ Cox's did (remove duplicate etc.), and see how it performs. |
Agreed. If the generated boolean expression is not too "big," we could use a simple comparator based on recursion, assuming it's fast to run. To make the comparison fast, for the children of a node, we can first partition them into different groups based on their types, then compare two sets of children "group by group." This method should be faster than generating all permutations of the children, even though its worst complexity is still exponential. |
Next step: |
Finished in #127. Issue closed. |
We are thinking about how to test the correctness of our implementation of CSEARCH algorithm. The input of this algorithm is a regex string, and the output is a boolean expression object (which has
toString()
function). For now we are considering two ways to test it.1, We manually translate a regex to a trigram boolean query, and compare it to the result generated by the program. However, two boolean expressions might not be completely identical, but they are equivalent. And its order doesn't matter. For example "a AND (b OR c)" is equivalent to "(c OR b) AND a". It makes us hard to compare the query. The complexity of comparison is O(n!) because we need to try out every matching.
2, We compare the result instead of query. We check if the result of matching a regex remains the same after using our translator program. The problem of this approach is, if a test case fails, we don't know what goes wrong. We can't test the correctness of each individual component. It also relies on the correctness of RegexMatcher.
Does anyone has any suggestions on this issue? Does anyone know a Java library dealing with boolean expressions? Thanks.
The text was updated successfully, but these errors were encountered: