Upsample audio with the Fast Fourier Transform
The term “upsample” refers to the process of increasing the sampling frequency of a digital signal, usually audio. Upsampling is sometimes also referred to as interpolation.
There are many reasons one may want to upsample audio or other signals. In some cases it is desirable to work with a “higherresolution” version of a signal when using illconditioned signal processing techniques. For example, numerical differentiation is notoriously illconditioned, and decreasing the sampling interval (increasing the sampling frequency) while preserving the signal’s statistical properties is a great help.
In this post, I present a simple, wellknown technique to upsample audio which does not degrade signal quality or introduce statistical artifacts. The technique is referred to as zeropadding in the frequency domain, or simply zeropadding. Zeropadding in the frequency domain is linearithmic in time, i.e. $O(n \log n)$, and so is suitable for realtime applications.
This technique is exactly what the name suggests, and is best understood with an example, so let’s begin with a demonstration. I’ll be using a recent version of the Julia programming language to process the audio signals, and the SoX (Sound eXchange) tool to convert between file formats.
Demonstration
Let’s use the free song The wanderer
by the talented artist Waterflame.
Download the MP3 file, and convert it to the WAV format using sox
:
$ curl http://www.newgrounds.com/audio/download/508136 \
> Thewanderer.mp3
$ sox Thewanderer.mp3 Thewanderer.wav
(You could also use the VLC media player to do the conversion.)
Now begin an interactive session with Julia:
$ julia
The first step is to decode the audio signal into vectors (arrays) of
sample values.
Decode the WAV file using the sound.jl
module:
julia> require("sound.jl")
julia> import Sound
julia> x, f, n, etc = Sound.wavread("Thewanderer.wav")
Now the variable x
is a twocolumn matrix,
each column representing an audio channel;
the variable f
is the sampling frequency;
the variable n
is the number of bits encoding each sample;
and the variable etc
is not needed.
Let’s only work with the first $2^{20}$ (about a million) samples, so that the Fourier transforms are fast:^{1}
julia> x = x[1:2^20,:]
Get column vectors representing each of the two (stereo) audio channels:
julia> x1 = x[:,1]
julia> x2 = x[:,2]
The next step involves padding the Fourier transform of the signal with zeros. Take the Fourier transform of each vector, and pad the transform vectors to double their lengths with zeros:
julia> X1 = fft(x1)
julia> X2 = fft(x2)
julia> Y1 = [X1 ; zeros(size(X1))]
julia> Y2 = [X2 ; zeros(size(X2))]
Finally, take the inverse Fourier transform of each of the padded vectors to obtain the upsampled signal:
julia> y1 = ifft(Y1)
julia> y2 = ifft(Y2)
Save the result to another WAV file, with a sample frequency of twice that of the original file:
julia> y = [y1 y2]
julia> y = real(y)
julia> Sound.wavwrite(y, 2*f, n, "Thewanderer2.wav")
Now exit Julia and listen to the output:
$ sox Thewanderer2.wav d
You will notice that this technique does not affect the quality of the audio.
As an exercise, try zeropadding the transformed sample vector to three times its original size, but save the resulting samples with twice the sampling frequency. This will produce a version of the song that is 50% slower (i.e. 150% the length of the original).
In general, to achieve a “time warp” or slowdown in the time domain by a factor of $\beta > 1$, such that the resulting signal $x’$ is given by $x’(\beta t) = x(t)$, you should pad the input signal in the frequency domain with $\lfloor N (\beta  1) \rfloor$ zeros, where $N$ is the number of samples in the original signal.
Conclusions
Zeropadding in the frequency domain is a simple and accurate technique to interpolate sampled signals, which can be used in real time. The only caveat is that this technique is best applied to periodic signals, like music or speech.
If you work with signals of any kind (you probably do), zeropadding in the frequency domain should be a part of your toolkit. This technique is wellknown in spectral analysis and signals processing, but because of its many applications, it can be useful in almost any field. For example, audio is a onedimensional signal, whereas an image is a twodimensional signal, and so the same technique applies with only slight modifications. “Upsampling” an image is left as an exercise to the reader.
I hope this demonstration has piqued the curiosity of anyone who was not familiar. Please leave a comment if you have questions, if you enjoyed reading, or if you’d like to read more.

Because human hearing is bandwidth limited, you can iteratively process a fixed window of samples at a time and concatenate the upsampled output. For other applications, such as radar, this may not be an option, and a FPGA is probably required. To upsample more than a million or so samples on commodity hardware: process poweroftwo samples at a time, and use Amazon EC2 or another HPC service. ↩