Upsample audio with the Fast Fourier Transform

December 10, 2012

The term “upsample” refers to the process of increasing the sampling frequency of a digital signal, usually audio. Upsampling is sometimes also referred to as interpolation.

There are many reasons one may want to upsample audio or other signals. In some cases it is desirable to work with a “higher-resolution” version of a signal when using ill-conditioned signal processing techniques. For example, numerical differentiation is notoriously ill-conditioned, and decreasing the sampling interval (increasing the sampling frequency) while preserving the signal’s statistical properties is a great help.

In this post, I present a simple, well-known technique to upsample audio which does not degrade signal quality or introduce statistical artifacts. The technique is referred to as zero-padding in the frequency domain, or simply zero-padding. Zero-padding in the frequency domain is linearithmic in time, i.e. $O(n \log n)$, and so is suitable for real-time applications.

This technique is exactly what the name suggests, and is best understood with an example, so let’s begin with a demonstration. I’ll be using a recent version of the Julia programming language to process the audio signals, and the SoX (Sound eXchange) tool to convert between file formats.


Let’s use the free song The wanderer by the talented artist Waterflame. Download the MP3 file, and convert it to the WAV format using sox:

$ curl \
  > The-wanderer.mp3
$ sox The-wanderer.mp3 The-wanderer.wav

(You could also use the VLC media player to do the conversion.)

Now begin an interactive session with Julia:

$ julia

The first step is to decode the audio signal into vectors (arrays) of sample values. Decode the WAV file using the sound.jl module:

julia> require("sound.jl")
julia> import Sound
julia> x, f, n, etc = Sound.wavread("The-wanderer.wav")

Now the variable x is a two-column matrix, each column representing an audio channel; the variable f is the sampling frequency; the variable n is the number of bits encoding each sample; and the variable etc is not needed.

Let’s only work with the first $2^{20}$ (about a million) samples, so that the Fourier transforms are fast:1

julia> x = x[1:2^20,:]

Get column vectors representing each of the two (stereo) audio channels:

julia> x1 = x[:,1]
julia> x2 = x[:,2]

The next step involves padding the Fourier transform of the signal with zeros. Take the Fourier transform of each vector, and pad the transform vectors to double their lengths with zeros:

julia> X1 = fft(x1)
julia> X2 = fft(x2)
julia> Y1 = [X1 ; zeros(size(X1))]
julia> Y2 = [X2 ; zeros(size(X2))]

Finally, take the inverse Fourier transform of each of the padded vectors to obtain the upsampled signal:

julia> y1 = ifft(Y1)
julia> y2 = ifft(Y2)

Save the result to another WAV file, with a sample frequency of twice that of the original file:

julia> y = [y1 y2]
julia> y = real(y)
julia> Sound.wavwrite(y, 2*f, n, "The-wanderer-2.wav")

Now exit Julia and listen to the output:

$ sox The-wanderer-2.wav -d

You will notice that this technique does not affect the quality of the audio.

As an exercise, try zero-padding the transformed sample vector to three times its original size, but save the resulting samples with twice the sampling frequency. This will produce a version of the song that is 50% slower (i.e. 150% the length of the original).

In general, to achieve a “time warp” or slowdown in the time domain by a factor of $\beta > 1$, such that the resulting signal $x’$ is given by $x’(\beta t) = x(t)$, you should pad the input signal in the frequency domain with $\lfloor N (\beta - 1) \rfloor$ zeros, where $N$ is the number of samples in the original signal.


Zero-padding in the frequency domain is a simple and accurate technique to interpolate sampled signals, which can be used in real time. The only caveat is that this technique is best applied to periodic signals, like music or speech.

If you work with signals of any kind (you probably do), zero-padding in the frequency domain should be a part of your toolkit. This technique is well-known in spectral analysis and signals processing, but because of its many applications, it can be useful in almost any field. For example, audio is a one-dimensional signal, whereas an image is a two-dimensional signal, and so the same technique applies with only slight modifications. “Upsampling” an image is left as an exercise to the reader.

I hope this demonstration has piqued the curiosity of anyone who was not familiar. Please leave a comment if you have questions, if you enjoyed reading, or if you’d like to read more.

  1. Because human hearing is bandwidth limited, you can iteratively process a fixed window of samples at a time and concatenate the upsampled output. For other applications, such as radar, this may not be an option, and a FPGA is probably required. To upsample more than a million or so samples on commodity hardware: process power-of-two samples at a time, and use Amazon EC2 or another HPC service.

← Return to blog index