Specify time resolution when returning speech coordinates in seconds #627

b3by · 2025-03-24T07:57:50Z

The get_speech_timestamps method can return the speech coordinates in seconds, currently with a fixed resolution of one decimal place. In this commit I added a parameter to the method, time_resolution, that can be used to specify the resolution of the temporal coordinates. The parameter is defaulted to 1, which reflects the current behaviour.

…onds

adamnsandle · 2025-03-24T16:00:03Z

Thanks for your contribution!

b3by · 2025-03-25T07:55:06Z

No problem at all!

On a side note, I was also considering the possibility to automatically compute the time resolution in auto mode based on the sampling rate, so that the time coordinates would be lossless.
As an example, a sampling rate of 16_000 would require a time resolution of 5 in order to have lossless time coordinates. This could be achieved by changing the type of time_resolution to Union[int, str] and acceppting a string value of auto, so that the user wouldn't need to manually compute the resolution.

Would that be something worth working on?

snakers4 · 2025-03-25T08:09:55Z

Ideally we want the VAD to be frame accurate, i.e. up to ~30 ms. In real life we can lose 1-2 frames even if the VAD works well. Which is fine, I guess.

So, to rephrase your question a bit, 1 decimal point in seconds is about 0.1 * 16,000 = 1,600 which is about 3 frames. This is a bit cheeky on our part, but we have frames instead (I do not remember, but I believe one of these was added by the community).

But 2 decimal points is roughly 1/3 of a frame, and 3 decimal points is already negligible.

So I believe maybe just changing the default and testing that nothing breaks is already fine?

b3by · 2025-03-25T09:33:53Z

Alright, I see the point now. It makes sense then, considering that a frame would be roughly 512 samples with a rate of 16_000.

If this is the case, it would probably be ok to just leave the rounding to one decimal place, and let the user decide the precision they want when requiring time coordinates rather than sample coordinates. This would also reflect the original method behaviour - just in case there is code somewhere that relies on that.

At any rate, I will keep an eye on this.

Thank you!

time resolution can be specified when coordinates are returned in sec…

6622e56

…onds

adamnsandle merged commit feba8cd into snakers4:master Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify time resolution when returning speech coordinates in seconds #627

Specify time resolution when returning speech coordinates in seconds #627

Uh oh!

b3by commented Mar 24, 2025

Uh oh!

adamnsandle commented Mar 24, 2025

Uh oh!

b3by commented Mar 25, 2025

Uh oh!

snakers4 commented Mar 25, 2025 •

edited

Loading

Uh oh!

b3by commented Mar 25, 2025

Uh oh!

Uh oh!

Specify time resolution when returning speech coordinates in seconds #627

Specify time resolution when returning speech coordinates in seconds #627

Uh oh!

Conversation

b3by commented Mar 24, 2025

Uh oh!

adamnsandle commented Mar 24, 2025

Uh oh!

b3by commented Mar 25, 2025

Uh oh!

snakers4 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b3by commented Mar 25, 2025

Uh oh!

Uh oh!

snakers4 commented Mar 25, 2025 •

edited

Loading