Skip to content

Conversation

b3by
Copy link
Contributor

@b3by b3by commented Mar 24, 2025

The get_speech_timestamps method can return the speech coordinates in seconds, currently with a fixed resolution of one decimal place. In this commit I added a parameter to the method, time_resolution, that can be used to specify the resolution of the temporal coordinates. The parameter is defaulted to 1, which reflects the current behaviour.

@adamnsandle adamnsandle merged commit feba8cd into snakers4:master Mar 24, 2025
@adamnsandle
Copy link
Collaborator

Thanks for your contribution!

@b3by
Copy link
Contributor Author

b3by commented Mar 25, 2025

No problem at all!

On a side note, I was also considering the possibility to automatically compute the time resolution in auto mode based on the sampling rate, so that the time coordinates would be lossless.
As an example, a sampling rate of 16_000 would require a time resolution of 5 in order to have lossless time coordinates. This could be achieved by changing the type of time_resolution to Union[int, str] and acceppting a string value of auto, so that the user wouldn't need to manually compute the resolution.

Would that be something worth working on?

@snakers4
Copy link
Owner

snakers4 commented Mar 25, 2025

Ideally we want the VAD to be frame accurate, i.e. up to ~30 ms. In real life we can lose 1-2 frames even if the VAD works well. Which is fine, I guess.

So, to rephrase your question a bit, 1 decimal point in seconds is about 0.1 * 16,000 = 1,600 which is about 3 frames. This is a bit cheeky on our part, but we have frames instead (I do not remember, but I believe one of these was added by the community).

But 2 decimal points is roughly 1/3 of a frame, and 3 decimal points is already negligible.

So I believe maybe just changing the default and testing that nothing breaks is already fine?

@b3by
Copy link
Contributor Author

b3by commented Mar 25, 2025

Alright, I see the point now. It makes sense then, considering that a frame would be roughly 512 samples with a rate of 16_000.

If this is the case, it would probably be ok to just leave the rounding to one decimal place, and let the user decide the precision they want when requiring time coordinates rather than sample coordinates. This would also reflect the original method behaviour - just in case there is code somewhere that relies on that.

At any rate, I will keep an eye on this.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants