December’s Google Pixel Feature Drop brought “speaker labels” to transcriptions made by its Recorder app.
Now a post on Google’s AI blog (via 9to5Google) takes a deeper dive into exactly how the machine learning tech works while running solely on the phone’s hardware.
The system mainly consists of three components: a speaker turn detection model that detects a change of speaker in the input speech, a speaker encoder model that extracts voice characteristics from each speaker turn, and a multi-stage clustering algorithm that annotates speaker labels to each speaker turn in a highly efficient way. All components run fully on the device.
It also mentions features in development, like moving more processing from the CPU to the TPU to use less power, and expanding it to support more languages.






