Examples of polyphonic music5/18/2023 When training non-trivial neural networks, it is always, to varying extent, a matter of chance if the optimization process will converge to a good value. Upper bound regularization of D and batch summation (see Section 3.6 and Section 3.7) are not explicitly stated. The entire procedure is outlined in Algorithm 1.Īlgorithm 1 Training scheme for the network and the dictionary, based on AdaMax. Moreover, analogously to, for the denominator in AdaMax, we consider the maximum over all the harmonics of the particular instrument when training D. For the dictionary, we also use AdaMax, but with a reduced learning rate of 10 − 4. For each epoch, new random batches are assigned. We then train on each batch with the AdaMax algorithm. ], k = 1, ⋯, n len, into random batches of size 6.For each parameter output by the network, we add a trainable scaling layer. Doing this, we have to make sure that each instrument can only play one tone at a time by excluding that instruments that have already been assigned a tone from the sampling. The joint categorical distribution for the discrete frequencies ( ν j ) and instrument indices ( η j ) is given in vectorial form as non-normalized log-probabilities, so we apply the softmax mapping in order to obtain a valid discrete distribution. The positive parameters ( α j Γ, β j Γ ) are obtained after applying the exponential function, and the probabilities for the Bernoulli distribution for the sparsity parameters ( u j ) are mapped into the interval ( 0, 1 ) via a sigmoid function. For the continuous frequency offsets ( ν j ˜ ), a tanh function is used to keep them inside the interval ( − 5, 5 ). The widths ( σ j ) are kept positive via softplus, and they are clipped such that the value does not get too close to 0. Since the amplitudes ( a j ) are supposed to be non-negative, we apply the absolute value function to the respective output components of the network. Our algorithm yields high-quality separation results with particularly low interference on a variety of different audio samples, both acoustic and synthetic, provided that the sample contains enough data for the training and that the spectral characteristics of the musical instruments are sufficiently stable to be approximated by the dictionary. Due to the flexibility of the neural network, inharmonicity can be incorporated seamlessly and no preprocessing of the input spectra is required. To provide phase information and account for inaccuracies in the dictionary-based representation, we also let the network output a direct prediction, which we then use to resynthesize the audio signals for the individual instruments. Since some of the model parameters do not yield a useful backpropagation gradient, we model them stochastically and employ the policy gradient instead. The network is trained without ground truth information, based on the difference between the model prediction and the individual time frames of the short-time Fourier transform. The model parameters are predicted via a U-Net, which is a type of deep neural network. We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics. We propose a method for the blind separation of sounds of musical instruments in audio signals.
0 Comments
Leave a Reply. |