-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing some text #38
Comments
Interesting. I am trying locally too to see if I can find anything relevant. In my short tests it worked ok. |
You can try "english_test.wav" in the zip file I attached, where "english_test.txt" is the original text |
I'm thinking first I'll expose all params to scripting, and then I'll see what generates the issue. Could the the duration_ms that is set by default to 5000 ms(5 seconds) |
Or maybe problem is it doesnt detect end of sentence correctly. This part could also drop text, and maybe we should make it configurable:
|
The problem occurs in gdscript after all, here:
With this example:
What happens here, it looks for a ] and it finds it, but since this sentence contains two sentences for some reason, it will ignore the first part. Instead, it should just try to remove from the text the special character, and not skip anything. Testing locally to see if it fixes it. |
The purpose of writing this code at that time was because I thought every "transferred_msg" contains only one sentence, and I would like to remove the markers at the beginning and end. So will there be a phenomenon where a sentence contains multiple identifiers? |
Yes, of course, I understand why the code is like this. The phenomenon seems to be with the character [_TT_388]. But nevertheless, we don't really care about any such characters, so I'm thinking to just remove all things that start with [ and end with ], possibly with a regex instead that we update by time(eg. we can add to that regex also < >) |
this code use the [[_TT_xxx]:
|
I doubt it, we don't need these tags for anything. But anyway, am only removing them in gdscript, where we process the text, not in the server. Also, another issue I found(maybe this is the one that happens, though I had multiple things happen) is: eg you have 4 transcribed messages:
If the first one, it transcribed a large sentence, but it doesn't know it's 2 sentences. Then, it doesn't recognise the second sentence.(still partial though) Then, it recognises it again, then it ends the first sentence, but it doesn't continue the second one at all. |
I have changed the method, but there are still issues with missing text, The problem you have discovered may only be a partial cause of this problem. Attached is my modified code:
|
You have explained it very clearly, and this should be the cause of the problem. We need to find a way to solve it |
Yup. I am still investigating it. I am guessing it's probably related to whisper_params.single_segment or probably something that makes it process just one sentence. |
I think it's related to the duration parameter after all. If I set it to something higher, like 20s, it just works. |
Ok, I think I have a fix, testing it and will put it on my branch and merge it if all is good. |
Put all on this branch. #39 |
After it builds on main branch u can try again, or u can try building locally. Merged the change. |
I found that it may be related to |
Are you sure it's related? What do you think the issue is related to it? Should we make it configurable on node properties and test them? Also, I was thinking some or all properties should be instead project level settings and not node level settings, as it doesn't make sense to expose all settings to 1 node, and there is just 1 singleton anyway. |
@Ughuuu #40 I have made some modifications to this logic. All text can now be recognized, and there will be no missing text. I enabled timestamps for this logic. I don’t know if it will have a big impact on performance. |
The modification logic is when it is found that and other data is retained for the next inference. After all, the inference time of audio within 30 seconds is the same for Whisper.cpp, so we do not need to delete all data In most cases, this logic is not a problem, but in some cases, the timestamp returned by Whisper seems to be problematic? I used token.t0 and token.t1 to obtain timestamps, perhaps there are other ways? The timestamp issue may be related to the initial silent voice, or it may be related to incomplete implementation of the current timestamp |
This is the current test video 2024-01-16.17-20-35.mp4Note: I used 1.5.4 wheeler.cpp. It seems that the version upgrade has greatly alleviated the issue of timestamp errors. Perhaps we can consider upgrading the version Attention: In the current usage process, there is a small probability that the processing time may reach up to 30 seconds. The reason is currently unclear. (Stuck is just a lack of inference results, the application will not be frozen) |
Lets upgrade whisper.cpp then. |
I think it would be more reasonable for you to create the PR for upgrading Whisper.cpp version. But in my testing, simply replacing all files with version 1.5.4 and compiling them can be used normally on Windows |
By the way, I posted a issue about timestamp on wheeler.cpp(ggerganov/whisper.cpp#1776) . If there is any valid response, I will try to continue fixing this issue. |
@fire can you upgrade whisper cpp |
I made a pr that did |
@fire @Ughuuu it can trigger timestamp updates: https://github.com/V-Sekai/godot-whisper/blob/dc741ff637130dffd038dc72b383c9aba3af8d0b/thirdparty/whisper.cpp/whisper.cpp#L5753C78 -L5753C92 In my actual testing, 9 out of 10 tests were able to achieve near perfect results, with 1 test causing lag for unknown reasons. But there were no issues with incorrect timestamps #42 we need merge it |
Awesome. Cant wait to test it out. |
I think I have found the cause of this problem. If you click on start_button: Turn on the recording switch. At this time, even if there is no voice content, an array of all zeros will be added every second through "_speech_to_text_singleton.add.audio_buffer (buffer)": buffer: [(0,0), (0,0), (0,0), (0,0)...] Then the Whisper will generate an illusion when it receives blank speech, and the output will be similar to: The current truncation logic actually has an implicit premise: This ensures that the duration of pcmf32 is always less than 30 seconds and also ensures multiple iterations of the sentence, improving its accuracy. But due to hallucinations, the blank speech is generated as In our current test cases, success case:
fail case
@fire @Ughuuu |
Nice, kudos for identifying the problem so quick |
#43 @Ughuuu @fire But there is also an implicit issue because the timestamp returned by wheeler.cpp is not very accurate in some cases, which can lead to the direct consequence of cutting too much or too little audio when doing audio cutting. There may be issues with duplicate or missing text in the returned text. But the probability of this problem occurring is not very high, and it should not be a problem in daily use. If you want to deeply fix this problem. So the possible methods are:
|
Is the time stamp of Godot Engine better? What is DTW? |
Open a new task/issue. I am now also testing it locally to see what the issue is. In this example: Assume that the voice is saying those words, and then whisper.cpp generates this data: Token[0]:
Token[1]:
Token[2]:
Token[3]:
Token[4]:
|
The issue he is refering to is that sometimes the token_timestamp is not correct. But that is a problem in whisper.cpp, nothing we can fix. |
yes, This is a specific example: |
@fire So it is crucial to delete some audio content through certain methods to keep it within 15 seconds. Currently, audio arrays are deleted by obtaining the location to be deleted based on the "audio timestamp of a certain segment" * "sampling rate". If there is a better way, please change it |
Do you want me to try merging in that pr? |
no, according to the author's self description, the PR has not been completed yet |
@fire @Ughuuu
The current example is a very short repetitive sentence, so it is not possible to test for text loss issues.
When I tried to use this plugin in daily life, I found that text loss problems occurred when I entered long and different sections of text.
like this test wav:
Test example 1:
Test example 2:
test_wav.zip
Based on the above example, it can be seen that there is a problem of missing text, which may be caused by my PR?
I will see how to solve this problem. If you have any good suggestions, please let me know
The text was updated successfully, but these errors were encountered: