By Hanchao
Voiceprint retrieval, as its name implies, is the process in which you authenticate or recognize speakers by their voice. A critical step in voiceprint recognition is voice vectorization, which converts the voice of speakers into structured vectors. Alibaba Cloud AnalyticDB for MySQL and AnalyticDB for PostgreSQL have provided a solution for voiceprint authentication and retrieval. With a few simple SQL commands, you can build a set of high-accuracy voiceprint retrieval and authentication services in just three steps.
Figure 1. Voiceprint demo system
Figure 1 shows the demo interface of the voiceprint retrieval system in AnalyticDB Vector Edition. To facilitate your experience, we have converted the voice information of 340 people into vectors and stored these vectors in the system. The current demo system consists of two parts. In the first part, which is the retrieval function, you can either import a recorded sound file or record a sound file on site and upload it. Then, you can submit the sound file to the voiceprint database for matching and retrieval. In the second part, which is the registration function, you can register and upload your own voice to the current voiceprint database to facilitate later queries and authentication. We will describe each function separately in the following sections.
Figure 2. Voice query
As shown in figure 2, BAC009S0004W0486.wav, a test audio file that contains the voice of S0004, is uploaded to the voiceprint database for retrieval. S0004 ranks first and appears at the top of the result table.
Figure 3. Voice registration
Figure 3 shows the voiceprint registration system, in which you can register your own voice in the backend voiceprint database for easy retrieval. For example, the user Hanchao registers his voice (only 7s in length) in the current system. At present, the system supports registration without text, and you can register by speaking any word.
Figure 4. Voice recording and retrieval
As show in figure 4, users can record their voice on site and upload it to the system for retrieval. For example, Hanchao records a 5-second voice clip and retrieves it in the voiceprint system. Hanchao voice, which has been previously registered, ranks first in the result table.
The current voiceprint demo system returns the results of 1:N identification. With this method, you can identify the corresponding speaker in a conference room by voice. At present, in 1:1 authentication demo, you can limit the distance to 550 for convenient authentication.
Figure 5. Voiceprint retrieval database
Figure 5 shows the overall structure of the retrieval system in Alibaba Cloud voiceprint database. AnalyticDB (voiceprint database) is responsible for storing and querying all structured information (user registration ID, user name, and other user information) and unstructured information (vectors generated from voice) throughout the voiceprint retrieval application. During the query process, you can use voiceprint extraction models to convert voice into vectors and query them in AnalyticDB. The system returns the corresponding user information and the I2 vector distance [5]. We will explain how to train and test voice extraction models in the next article.
The current demo voiceprint system uses the GMM-UMB model to extract i-vectors for retrieval [3]. In addition, we have trained a more accurate deep learning model for voiceprint recognition (x-vector [4]). Furthermore, we can train voiceprint models for specific scenarios, such as phone calls, mobile apps, and noisy environments.
The accuracy of voiceprint recognition (1:N) in the datasets that are commonly used in academia (Aishall.v1 [1] datasets and TIMIT [2] datasets) is more than 99.5%, as listed in table 1.
Table 1. Accuracy of the results that rank first
The first step is initialization.
The current system has implemented the voice-to-vector conversion function. After you send the voice obtained from the frontend to the Alibaba Cloud service system through a POST request and select the appropriate voiceprint model, the system converts the voice into a corresponding vector.
import requests
import json
import numpy as np
# sound: binary sound file.
# model_id: ID of the model.
def get_vector(sound, model_id='i-vector'):
url = 'http://47.111.21.183:18089/demo/vdb/v1/retrieve'
d = {'resource': sound,
'model_id': model_id}
r = requests.post(url, data=d)
js = json.loads(r.text)
return np.array(js['emb'])
# Read the user file.
file = 'xxx.wav'
data = f.read()
print(get_vector(data))
f.close()
During initialization, create a corresponding user voiceprint table. In addition, add a vector index to the vector column in the table to accelerate the query process. The current voiceprint model generates 400-dimensional vectors. Therefore, set the index parameter "dim" to 400.
-- Create a user voiceprint table
CREATE TABLE person_voiceprint_detection_table(
id serial primary key,
name varchar,
voiceprint_feature float4[]
);
-- Create a vector index
CREATE INDEX person_voiceprint_detection_table_idx
ON person_voiceprint_detection_table
USING ann(voiceprint_feature)
WITH(distancemeasure=L2,dim=400,pq_segments=40);
The second step is registering the user's voice.
During registration, register a user and insert a record into the current system.
-- Register the user "John" in the current system.
-- Use the HTTP service to convert the voiceprint into a corresponding vector.
INSERT INTO person_voiceprint_detection_table(name, voiceprint_feature)
SELECT 'John', array[-0.017,-0.032,...]::float4[])
The third step is retrieving and authenticating the user's voice.
Voiceprint authentication for door locks (1:1): The authentication system obtains the user's identity information (user_id) and calculates the distance between the input voice vector and the user's voice vector in the voiceprint database. Generally, a distance threshold (threshold = 550) is set in the system. If the distance between the vectors is larger than the threshold, the authentication fails. If the distance is lower than the threshold, the voiceprint authentication is successful.
-- Voiceprint authentication for door locks (1:1)
SELECT id, -- User ID
name, -- User name
l2_distance(voiceprint_feature, ARRAY[-0.017,-0.032,...]::float4[]) AS distance -- Distance between the vectors
FROM person_voiceprint_detection_table -- User voice table
WHERE distance < threshold -- Generally, the threshold is 550
AND id = 'user_id' -- The user ID to authenticate
Voiceprint retrieval for conference (1:N identification): The system identifies the voice of the current speaker and returns the information of the most relevant registered users. If the system returns no results, the current conference speaker is not registered in the voiceprint database.
-- Voiceprint recognition of a conference speaker (1:N identification)
SELECT id, -- User ID
name, -- User name
l2_distance(voiceprint_feature, ARRAY[-0.017,-0.032,...]::float4[]) AS distance -- Distance between the vectors
FROM person_voiceprint_detection_table -- User voice table
WHERE distance < threshold -- Generally, the threshold is 550
ORDER BY voiceprint_feature <-> ARRAY[-0.017,-0.032,...]::float4[] -- Use the vectors to sort
LIMIT 1; -- Return the most similar results
[1] Aishell Data set.
[2] TIMIT Data set.
[3] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
[4] David Snyder, Daniel Garcia-Romero, Daniel Povey and Sanjeev Khudanpur, "Deep Neural Network Embeddings for Text-Independent Speaker Verification", Interspeech , 2017 :999-1003.
[5] Anton, Howard (1994), Elementary Linear Algebra (7th ed.), John Wiley & Sons, pp. 170-171, ISBN 978-0-471-58742-2
HEADING's ERP Help Retailers Reduce Costs by 50% with Alibaba Cloud PolarDB
Alibaba Clouder - March 17, 2017
Alibaba Cloud Community - December 28, 2021
ApsaraDB - November 16, 2020
Alibaba Clouder - October 11, 2019
Alibaba Clouder - August 9, 2018
Alibaba Clouder - June 19, 2018
An online MPP warehousing service based on the Greenplum Database open source program
Learn MoreA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreMore Posts by ApsaraDB