FIELD: computer technology.
SUBSTANCE: invention relates to the extracting audio features in a dialogue detector in response to an input audio signal. The technical result consists in increasing the performance of extracting sound features when using several context windows, each of which contains a different number of frames to represent the frame in different contexts. The technical result is achieved by dividing the input audio signal into many frames; extracting frame audio features from each frame I; defining a set of context windows, where each context window contains a number of frames surrounding the current frame; deriving, for each context window, a corresponding contextual audio feature for the current frame based on the frame audio features of the frames in each corresponding context; performing concatenation on each contextual audio feature to form a combined feature vector to represent the current frame; and obtaining a speech confidence score representing the probability of dialogue occurring in the current frame using the combined feature vector, wherein the number of frames in one or more context windows is determined adaptively based on the extracted frame audio features.
EFFECT: increasing the performance of extracting sound features when using several context windows.
12 cl, 12 dwg
Authors
Dates
2023-11-10—Published
2020-04-13—Filed