Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing some text #38

Closed
aiaimimi0920 opened this issue Jan 14, 2024 · 38 comments
Closed

Missing some text #38

aiaimimi0920 opened this issue Jan 14, 2024 · 38 comments

Comments

@aiaimimi0920
Copy link
Contributor

aiaimimi0920 commented Jan 14, 2024

@fire @Ughuuu
The current example is a very short repetitive sentence, so it is not possible to test for text loss issues.

When I tried to use this plugin in daily life, I found that text loss problems occurred when I entered long and different sections of text.

like this test wav:
Test example 1:

  • model: small.en.bin
  • lanuage: English
  • use_gpu:true
  • Source text:
 As a five-year-old girl ,Lin Qiaozhi was deeply affected by her mother's death.
 At age 18, instead of following the traditional path of marriage like the majority of girls, 
she chose to study medicine. 
"Why should we go through so much?  Finally a good husband should be their final goal!" 
her brother complained, thinking of the high tuition fees.
she responded, "I'd rather stay single to study all my life!"
  • result text:
    image

Test example 2:

  • model: small.bin
  • lanuage: Chinese
  • use_gpu:true
  • Source text:
骑士的战马的头部经过长戟手正前方,长戟手果断出手,长戟伸向骑士的身侧。
当长戟碰到骑士手臂的时候,一股巨大的力量瞬间通过长期同时作用于它和长戟手。
长棘手的双手如同铁钳一般牢牢攥紧木柄,双脚就像树根一样死死扎在地上。
上一秒,骑士以为自己将要得救。
下一秒,他就被长戟手从马上拖了下来。
骑士感觉他像是在飞,短暂的飞翔之后,他重重摔在松软的草地上。
  • result text:
    image

test_wav.zip

Based on the above example, it can be seen that there is a problem of missing text, which may be caused by my PR?

I will see how to solve this problem. If you have any good suggestions, please let me know

@aiaimimi0920 aiaimimi0920 changed the title Text loss Missing some text Jan 14, 2024
@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

Interesting. I am trying locally too to see if I can find anything relevant. In my short tests it worked ok.

@aiaimimi0920
Copy link
Contributor Author

You can try "english_test.wav" in the zip file I attached, where "english_test.txt" is the original text

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

I'm thinking first I'll expose all params to scripting, and then I'll see what generates the issue. Could the the duration_ms that is set by default to 5000 ms(5 seconds)

@Ughuuu Ughuuu mentioned this issue Jan 14, 2024
@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

Or maybe problem is it doesnt detect end of sentence correctly.

This part could also drop text, and maybe we should make it configurable:

if (token.p > 0.6 && token.plog < -0.5) {
						continue;
					}

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

The problem occurs in gdscript after all, here:

var cur_text = transcribed_msg["text"]
var token_index = cur_text.rfind("]")
if token_index!=-1:
	cur_text = cur_text.substr(token_index+1)

With this example:

[{ "is_partial": false, "text": " As a 5-year-old girl, Linozu was deeply affected by her mother\'s death.[_TT_388] At age 18, instead of following the traditional path of marriage like the majority of girls," }]

What happens here, it looks for a ] and it finds it, but since this sentence contains two sentences for some reason, it will ignore the first part. Instead, it should just try to remove from the text the special character, and not skip anything. Testing locally to see if it fixes it.

@aiaimimi0920
Copy link
Contributor Author

The purpose of writing this code at that time was because I thought every "transferred_msg" contains only one sentence, and I would like to remove the markers at the beginning and end.

So will there be a phenomenon where a sentence contains multiple identifiers?

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

Yes, of course, I understand why the code is like this. The phenomenon seems to be with the character [_TT_388]. But nevertheless, we don't really care about any such characters, so I'm thinking to just remove all things that start with [ and end with ], possibly with a regex instead that we update by time(eg. we can add to that regex also < >)

@aiaimimi0920
Copy link
Contributor Author

this code use the [[_TT_xxx]:
https://github.com/ggerganov/whisper.cpp/blob/6ebba525f1cc9393752906023a3385a2cc8062ed/whisper.cpp#L1261

  1. Could not find the introduction document, but it appears to be marking the beginning of a sentence?
  2. Are there any of these tags that we may need to use?

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

I doubt it, we don't need these tags for anything. But anyway, am only removing them in gdscript, where we process the text, not in the server.

Also, another issue I found(maybe this is the one that happens, though I had multiple things happen) is:

eg you have 4 transcribed messages:

[{ "is_partial": true, "text": " As a-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of." }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death at age 18" }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of marriage like the majority of" }]
[{ "is_partial": false, "text": " As a 5-year-old girl,ozu was deeply affected by her mother\'s death." }]

If the first one, it transcribed a large sentence, but it doesn't know it's 2 sentences. Then, it doesn't recognise the second sentence.(still partial though) Then, it recognises it again, then it ends the first sentence, but it doesn't continue the second one at all.

@aiaimimi0920
Copy link
Contributor Author

image

I have changed the method, but there are still issues with missing text,

The problem you have discovered may only be a partial cause of this problem.

Attached is my modified code:

		var regex_a = RegEx.new()
		regex_a.compile("\\[[^\\[\\]]*\\]")
		cur_text = regex_a.sub(cur_text,"",true)
		
		var regex_b = RegEx.new()
		regex_b.compile("\\<[^\\<\\>]*\\>")
		cur_text = regex_b.sub(cur_text,"",true)

@aiaimimi0920
Copy link
Contributor Author

我对此表示怀疑,我们不需要这些标签做任何事情。但无论如何,我只在 gdscript 中删除它们,我们在那里处理文本,而不是在服务器中。

另外,我发现的另一个问题(也许这就是发生的问题,尽管我发生了很多事情)是:

例如,您有 4 条转录的消息:

[{ "is_partial": true, "text": " As a-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of." }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death at age 18" }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of marriage like the majority of" }]
[{ "is_partial": false, "text": " As a 5-year-old girl,ozu was deeply affected by her mother\'s death." }]

如果是第一个,它转录了一个大句子,但它不知道它是 2 个句子。然后,它无法识别第二句话。(虽然仍然是部分的)然后,它再次识别它,然后它结束了第一句话,但它根本没有继续第二句话。

You have explained it very clearly, and this should be the cause of the problem. We need to find a way to solve it

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

Yup. I am still investigating it. I am guessing it's probably related to whisper_params.single_segment or probably something that makes it process just one sentence.

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

I think it's related to the duration parameter after all. If I set it to something higher, like 20s, it just works.
So I think the duration parameter shouldn't be hardcoded, but instead it should be calculated by the amount of frames it has and run on that.

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

Ok, I think I have a fix, testing it and will put it on my branch and merge it if all is good.

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

Put all on this branch. #39
Must also say, it works, but sometimes it still might cut some things, but less than before maybe. But now also there is interface to set parameters from UI, so that's nice.
I would say, if you try to make a bigger break when sentence ends, it will have a higher success rate.

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 14, 2024

After it builds on main branch u can try again, or u can try building locally. Merged the change.

@aiaimimi0920
Copy link
Contributor Author

I found that it may be related to pcmf32. size()>n_samples_iter_threshold,
I will try to fix it

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 15, 2024

Are you sure it's related? What do you think the issue is related to it? Should we make it configurable on node properties and test them?

Also, I was thinking some or all properties should be instead project level settings and not node level settings, as it doesn't make sense to expose all settings to 1 node, and there is just 1 singleton anyway.

@aiaimimi0920
Copy link
Contributor Author

@Ughuuu #40
Indeed, it is very likely that the problem is caused by the automatic truncation triggered after pcmf32.size()>n_samples_iter_threshold.

I have made some modifications to this logic.

All text can now be recognized, and there will be no missing text.

image

I enabled timestamps for this logic. I don’t know if it will have a big impact on performance.

@aiaimimi0920
Copy link
Contributor Author

The modification logic is when it is found that pcmf32. size()>n_samples_iter_threshold , PCMF32 is not directly cleared, but all data corresponding to the first segment in PCMF32 is deleted.

and other data is retained for the next inference. After all, the inference time of audio within 30 seconds is the same for Whisper.cpp, so we do not need to delete all data

In most cases, this logic is not a problem, but in some cases, the timestamp returned by Whisper seems to be problematic?

I used token.t0 and token.t1 to obtain timestamps, perhaps there are other ways?

The timestamp issue may be related to the initial silent voice, or it may be related to incomplete implementation of the current timestamp

  1. Implement word-level timestamps approach proposed by OpenAI ggerganov/whisper.cpp#375
  2. [MM-54242] Improve timestamp accuracy mattermost/calls-transcriber#3
  3. [DRAFT] Token level timestamps with DTW (#375)  ggerganov/whisper.cpp#1485

@aiaimimi0920
Copy link
Contributor Author

This is the current test video

2024-01-16.17-20-35.mp4

Note: I used 1.5.4 wheeler.cpp. It seems that the version upgrade has greatly alleviated the issue of timestamp errors. Perhaps we can consider upgrading the version

Attention: In the current usage process, there is a small probability that the processing time may reach up to 30 seconds. The reason is currently unclear. (Stuck is just a lack of inference results, the application will not be frozen)

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 16, 2024

Lets upgrade whisper.cpp then.

@aiaimimi0920
Copy link
Contributor Author

aiaimimi0920 commented Jan 16, 2024

Lets upgrade whisper.cpp then.

I think it would be more reasonable for you to create the PR for upgrading Whisper.cpp version.
I am not sure if you have made any custom modifications to Whistler.CPP before.

But in my testing, simply replacing all files with version 1.5.4 and compiling them can be used normally on Windows

@aiaimimi0920
Copy link
Contributor Author

By the way, I posted a issue about timestamp on wheeler.cpp(ggerganov/whisper.cpp#1776) .

If there is any valid response, I will try to continue fixing this issue.
Because there is still a probability that too many voice buffers will be discarded due to timestamp errors

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 16, 2024

@fire can you upgrade whisper cpp

@fire
Copy link
Member

fire commented Jan 16, 2024

I made a pr that did git subrepo pull --branch=master --force thirdparty/whisper.cpp

@aiaimimi0920
Copy link
Contributor Author

@fire @Ughuuu
I have delved a little deeper into the code of wheeler.cpp and found that although the document mentions single_segment=true is more suitable for stream usage, but if set to false,

it can trigger timestamp updates: https://github.com/V-Sekai/godot-whisper/blob/dc741ff637130dffd038dc72b383c9aba3af8d0b/thirdparty/whisper.cpp/whisper.cpp#L5753C78 -L5753C92

In my actual testing, 9 out of 10 tests were able to achieve near perfect results, with 1 test causing lag for unknown reasons. But there were no issues with incorrect timestamps

#42 we need merge it

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 16, 2024

Awesome. Cant wait to test it out.

@aiaimimi0920
Copy link
Contributor Author

with 1 test causing lag for unknown reasons

I think I have found the cause of this problem.

If you click on start_button: Turn on the recording switch. At this time, even if there is no voice content, an array of all zeros will be added every second through "_speech_to_text_singleton.add.audio_buffer (buffer)": buffer: [(0,0), (0,0), (0,0), (0,0)...]

Then the Whisper will generate an illusion when it receives blank speech, and the output will be similar to:
msg.text: [BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] y]

The current truncation logic actually has an implicit premise:
When pcmf32. size()>n_samples_iter_threshold, delete the size of the first segment, so you can make pcmf32.size() smaller than n_samples_iter_threshold is then used for the next iteration.

This ensures that the duration of pcmf32 is always less than 30 seconds and also ensures multiple iterations of the sentence, improving its accuracy.

But due to hallucinations, the blank speech is generated as [2BEG_] you [TT-103]:
At this point, deleting only the length of the first segment of the speech does not significantly reduce the length of pcmf32, which may be greater than 30 seconds. Additionally, each iteration only deletes one segment of the [1BEG_] you [TT_103] corresponding to the pcmf32 array, and the remaining blank speech can only be deleted next time. This is why sometimes iterations are very slow

In our current test cases,

success case:

  1. Click on start_ Button,
  2. Click on AudioStreamPlayer2.play now
  3. You will obtain a relatively satisfactory inference result

fail case

  1. Click on start_ Button,
  2. Wait for a while, such as 5 seconds
  3. Click on AudioStreamPlayer2.play
  4. You will receive an excessively long delay and there is a probability that you will not be able to obtain the result

@fire @Ughuuu
This should be the last problem that needs to be solved for this PR, and I will come up with a solution. This is actually closely related to the illusion caused by silent audio

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 16, 2024

Nice, kudos for identifying the problem so quick

@aiaimimi0920
Copy link
Contributor Author

#43 @Ughuuu @fire
Basically solved the problem,

But there is also an implicit issue because the timestamp returned by wheeler.cpp is not very accurate in some cases, which can lead to the direct consequence of cutting too much or too little audio when doing audio cutting. There may be issues with duplicate or missing text in the returned text.

But the probability of this problem occurring is not very high, and it should not be a problem in daily use. If you want to deeply fix this problem. So the possible methods are:

  1. Modify the logic of audio truncation
  2. Waiting for Whisper.cpp to provide DTW

@fire
Copy link
Member

fire commented Jan 19, 2024

Is the time stamp of Godot Engine better? What is DTW?

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 19, 2024

Open a new task/issue. I am now also testing it locally to see what the issue is.
The timestamp he is refering to is actually the token timestamp(eg. at what time interval a token appears).

In this example:
I go home.

Assume that the voice is saying those words, and then whisper.cpp generates this data:

Token[0]:

  • text: [BEG] -- this is usually begin token
  • timestamp_start: 0ms
  • timestamp_end: 0ms

Token[1]:

  • text: I
  • timestamp_start: 0ms
  • timestamp_end: 100ms

Token[2]:

  • text: go
  • timestamp_start:120ms
  • timestamp_end: 200ms

Token[3]:

  • text: home
  • timestamp_start:230ms
  • timestamp_end: 400ms

Token[4]:

  • text: [END] -- this is usually end token, can also be [_TTL_400] or smth like that
  • timestamp_start:400ms
  • timestamp_end: 400ms

@Ughuuu
Copy link
Collaborator

Ughuuu commented Jan 19, 2024

The issue he is refering to is that sometimes the token_timestamp is not correct. But that is a problem in whisper.cpp, nothing we can fix.

@aiaimimi0920
Copy link
Contributor Author

yes, This is a specific example:
ggerganov/whisper.cpp#1776

@aiaimimi0920 aiaimimi0920 reopened this Jan 19, 2024
@aiaimimi0920
Copy link
Contributor Author

aiaimimi0920 commented Jan 19, 2024

@fire
Because real-time inference is required, I need to ensure that the content in PCMF32 is always less than 15 seconds, so that you can obtain an inference result within 1-3 seconds.

So it is crucial to delete some audio content through certain methods to keep it within 15 seconds.

Currently, audio arrays are deleted by obtaining the location to be deleted based on the "audio timestamp of a certain segment" * "sampling rate".
Then, due to issues with the timestamp returned in the segment, there were too many or too few audio arrays deleted
Finally, there will be inference results of text duplication or text loss
apply single_section=false and Whisper1.5.4 partially resolved the issue of incorrect timestamps, but did not fully resolve it.
If we continue the idea of deleting audio arrays through timestamps, then we need to wait for this PR.
ggerganov/whisper.cpp#1485

If there is a better way, please change it

@fire
Copy link
Member

fire commented Jan 19, 2024

Do you want me to try merging in that pr?

@aiaimimi0920
Copy link
Contributor Author

no, according to the author's self description, the PR has not been completed yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants