-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character confusion fix suggestion #3144
Comments
Do you want to send a pull request with the suggested fix? |
What do you check |
I could create a PR yes, but the threshold might not be universal |
Just want to avoid empty space and null char |
Would |
Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine? |
Yes we tested other values too, from 0.7 to 0.9 and found out that 0.88 behaves the best |
The 139 is a null char for us. |
I believe it will be a different number in other traineddata files. |
That's why I was asking. |
@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata? I just have run a test on the TIFF files from test/testing and used this conditional:
This fixed several confusions, all similar to this one:
I would have expected Internally Tesseract has two preferred choices, with
So the new code picked the wrong choice. |
We use the best traineddata, french language |
So, both apostrophes should be considered as OK in tesseract's output, right? |
If there is a confusion with two alternatives of similar confidence, I'd normally take the one with higher confidence, even if it is only slightly higher (unless there are other rules like for example a dictionary which suggest to take the second alternative). |
Just to clarify, the suggested fix removes one confused character, but it is not necessarily the correct one (like the example with the apostrophe). One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks. |
tesseract/src/lstm/lstmrecognizer.cpp Line 119 in 5761880
|
I hope it is ok for me to chime in and point out that this issue affects many users for some years now. Even if the proposed fix does not choose the best candidate, it is still very much an improvement over the current situation. Could someone experienced in C++ and tesseract please add a pull request to get the process started and the change reviewed? |
@stweil related to your question. I've already posted some images to #1060. Now I've collected more images with double characters. I'm posting them below. All are tested with on Windows 10 64bit example call: |
Service Im April auf |
From the results above, the character confusion is not fixed, right? Do you have also cases where it is fixed ?. Just to mention again, the fix is to solve this issue but it does not guarantee you get the correct character. But most of the times you get the correct character. |
@EucliTs0 I've just extracted images where one character becomes two characters. I didn't keep an exact list, where it was different before. But yes there were some images who had two characters before and returned only one with the latest version. |
@TheSeiko Perhaps in your case you need to modify the threshold |
Hi EucliTs0: We have been experiencing the same behavior as yourself, with extra characters showing up in the Tesseract output stream. I am experimenting with the most recent master branch code, and I think that the line numbers in the source may be somewhat different from the version you are working with. So could you please do me the favor of providing the method name where you are putting your fix, and attaching the full recodebeam.cpp file so I can find it and try it out myself. Thanks, Dave |
Hello @woodjohndavid, We use the last stable version of Tesseract 4.1.1 ([https://github.com/tesseract-ocr/tesseract/tree/4.1.1]). We added this block inside I cannot attack the .cpp file, because it is not supported here so I will add it as plain text. |
@EucliTs0 thank you for trying to make Tesseract better! Since AFAICT no one is working on this long-standing issue, any hint to track down the actual cause is welcome. But please use Github facilities (or at least a diff/patch) for sharing next time! Here's your change in a reusable way: diff --git a/src/lstm/recodebeam.cpp b/src/lstm/recodebeam.cpp
index 1c840569..bb34cd7a 100644
--- a/src/lstm/recodebeam.cpp
+++ b/src/lstm/recodebeam.cpp
@@ -615,6 +615,14 @@ void RecodeBeamSearch::ContinueContext(const RecodeNode* prev, int index,
if (prev != nullptr && prev->code == code && !is_simple_text_) continue;
float cert = NetworkIO::ProbToCertainty(outputs[code]) + cert_offset;
if (cert < kMinCertainty && code != null_char_) continue;
+
+ if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
+ {
+ const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);
+ const float ratio_scores = outputs[code] / sum_proba_prev_current;
+ if (ratio_scores < 0.88f) break;
+ }
+
full_code.Set(length, code);
int unichar_id = recoder_.DecodeUnichar(full_code);
// Map the null char to INVALID. I have not tried it yet, but (in addition to @stweil's comments), a few problems stand out:
|
Hi EucliTs0: Thanks for the information. That will help me try out your fix in the context of the latest master version and see how it goes. I will report back on this thread with my results and any suggestions I might come up with. Regards, Dave |
Hi @bertsky
|
Hi EucliTs0: Please see my latest post here #3477 If you like, you can try the solution I have proposed and see if it works in your situation. I did try out the fix that you have used, but it didn't work consistently in our case. I guess it depends on the specific mix of characters that are encountered. |
I have just created pull request #4211 which I consider to be an improved solution for diplopia. I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible. Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are: bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect Obviously if my diplopia change is of value, then these configuration items should be made into settings. |
Environment
Hello,
We utilize Tesseract a lot in our platform, and we most often had the following issue:
For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65".
But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.
It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.
D(t+1) = P(t+1) / P(t) + P(t+1),
where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.
In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:
Suggested Fix:
The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.
Unfortunately, I cannot provide any documents because we work on sensitive data.
Thank you.
The text was updated successfully, but these errors were encountered: