Example 1: Rendering a Spectrogram Image

Introduction

This example shows how to generate a greyscale constant-Q spectrogram image from an audio file using the Gaborator library.

Preamble

We start off with some boilerplate #includes.

#include <memory.h>
#include <iostream>
#include <fstream>
#include <sndfile.h>

The Gaborator is a header-only library — there are no C++ files to compile, only header files to include. The core spectrum analysis and resynthesis code is in gaborator/gaborator.h, and the code for rendering images from the spectrogram coefficients is in gaborator/render.h.

#include <gaborator/gaborator.h>
#include <gaborator/render.h>

The program takes the names of the input audio file and output spectrogram image file as command line arguments, so we check that they are present:

int main(int argc, char **argv) {
    if (argc < 3) {
        std::cerr << "usage: render input.wav output.pgm\n";
        exit(1);
    }

Reading the Audio

The audio file is read using the libsndfile library and stored in a std::vector<float>. Note that although libsndfile is used in this example, the Gaborator library itself does not depend on or use libsndfile.

    SF_INFO sfinfo;
    memset(&sfinfo, 0, sizeof(sfinfo));
    SNDFILE *sf_in = sf_open(argv[1], SFM_READ, &sfinfo);
    if (! sf_in) {
        std::cerr << "could not open input audio file: "
            << sf_strerror(sf_in) << "\n";
        exit(1);
    }
    double fs = sfinfo.samplerate;
    sf_count_t n_frames = sfinfo.frames;
    sf_count_t n_samples = sfinfo.frames * sfinfo.channels;
    std::vector<float> audio(n_samples);
    sf_count_t n_read = sf_readf_float(sf_in, audio.data(), n_frames);
    if (n_read != n_frames) {
        std::cerr << "read error\n";
        exit(1);
    }
    sf_close(sf_in);

In case the audio file is a stereo or multi-channel one, mix down the channels to mono, into a new std::vector<float>:

    std::vector<float> mono(n_frames);
    for (size_t i = 0; i < (size_t)n_frames; i++) {
        float v = 0;
        for (size_t c = 0; c < (size_t)sfinfo.channels; c++)
            v += audio[i * sfinfo.channels + c];
        mono[i] = v;
    }

The Frequency Scale

To make a constant-Q spectrogram, we will use a logarithmic frequency scale, represented by a gaborator::log_fq_scale object. To create a log_fq_scale, we must choose some parameters: the frequency resolution, the frequency range, and optionally a reference frequency.

The frequency resolution is specified as a number of frequency bands per octave. A typical number for analyzing music signals is 48 bands per octave, or in other words, four bands per semitone in the 12-note equal tempered scale.

The frequency range is specified by giving a minimum frequency; this is the lowest frequency that will be included in the spectrogram display. For audio signals, a typical minimum frequency is 20 Hz, the lower limit of human hearing. In the Gaborator library, all frequencies are given in units of the sample rate rather than in Hz, so we need to divide the 20 Hz by the sample rate of the input audio file: 20.0 / fs.

Unlike the minimum frequency, the maximum frequency is not given explicitly — instead, the analysis always produces coefficients for frequencies all the way up to half the sample rate (a.k.a. the Nyquist frequency). If you don't need the coefficients for the highest frequencies, you can simply ignore them.

If desired, one of the frequency bands can be exactly aligned with a reference frequency. When analyzing music signals, this is typically 440 Hz, the standard tuning of the note A4. Like the minimum frequency, it is given in units of the sample rate, so we pass 440.0 / fs.

Let's construct a log_fq_scale:

    gaborator::log_fq_scale scale(48, 20.0 / fs, 440.0 / fs);

The Spectrum Analysis Parameters

The frequency scale forms a part of the spectrum analysis parameters, which are stored in an object of type gaborator::parameters. In this example, there is no need to specify parameters other than the frequency scale, so the parameters can be constructed simply by calling the gaborator::parameters constructor with the frequency scale as an argument:

    gaborator::parameters params(scale);

The Spectrum Analyzer

Next, we create an object of type gaborator::analyzer; this is the workhorse that performs the actual spectrum analysis (and/or resynthesis, but that's for a later example). It is a template class, parametrized by the floating point type to use for the calculations; this is typically float. Constructing the gaborator::analyzer involves allocating and precalculating all the filter coefficients and other auxiliary data needed for the analysis and resynthesis, and this takes considerable time and memory, so when analyzing multiple pieces of audio with the same parameters, creating a single gaborator::analyzer and reusing it is preferable to destroying and recreating it.

    gaborator::analyzer<float> analyzer(params);

The Spectrogram Coefficients

The result of the spectrum analysis will be a set of spectrogram coefficients. To store them, we will use a gaborator::coefs object. Like the analyzer, this is a template class parametrized by the data type. Because the layout of the coefficients is determined by the spectrum analyzer, it must be passed as an argument to the constructor:

    gaborator::coefs<float> coefs(analyzer);

Running the Analysis

Now we are ready to do the actual spectrum analysis, by calling the analyze method of the spectrum analyzer object. The first argument to analyze is a float pointer pointing to the first element in the array of samples to analyze. The second and third arguments are of type int64_t and indicate the time range covered by the array, in units of samples. Since we are passing the whole file at once, the beginning of the range is sample number zero, and the end is sample number mono.size(). The fourth argument is a reference to the set of coefficients that the results of the spectrum analysis will be stored in (or to be precise, added to).

    analyzer.analyze(mono.data(), 0, mono.size(), coefs);

Rendering an Image

Now there is a set of spectrogram coefficients in coefs. To render them as an image, we will use the function gaborator::render_p2scale().

Rendering involves two different coordinate spaces: the time-frequency coordinates of the spectrogram coefficients, and the x-y coordinates of the image. The two spaces are related by an origin and a scale factor, each with an x and y component.

The origin specifies the point in time-frequency space that corresponds to the pixel coordinates (0, 0). Here, we will use an origin where the x (time) component is zero (the beginning of the signal), and the y (frequency) component is the band number of the first (highest frequency) band:

    int64_t x_origin = 0;
    int64_t y_origin = analyzer.bandpass_bands_begin();

render_p2scale() supports scaling the spectrogram in both the time (horizontal) and frequency (vertical) dimension, but only by power-of-two scale factors. These scale factors are specified relative to a reference scale of one vertical pixel per frequency band and one horizontal pixel per signal sample.

Although a horizontal scale of one pixel per signal sample is a mathematically pleasing reference point, this reference scale is not used in practice because it would result in a spectrogram that is much too stretched out horizontally. A more typical scale factor might be 210 = 1024, yielding one pixel for every 1024 signal samples, which is about one pixel per 23 milliseconds of signal at a sample rate of 44.1 kHz.

    int x_scale_exp = 10;

To ensure that the spectrogram will fit on the screen even in the case of a long audio file, let's auto-scale it down further until it is no more than 1000 pixels wide:

    while ((n_frames >> x_scale_exp) > 1000)
        x_scale_exp++;

In the vertical, the reference scale factor of one pixel per frequency band is reasonable, so we will use it as-is. In other words, the vertical scale factor will be 20.

    int y_scale_exp = 0;

Next, we need to define the rectangular region of the image coordinate space to render. Since we are rendering the entire spectrogram rather than a tile, the top left corner of the rectangle will have an origin of (0, 0).

    int64_t x0 = 0;
    int64_t y0 = 0;

The coordinates of the bottom right corner are determined by the length of the signal and the number of bands, respectively, taking the scale factors into account. The length of the signal in samples is n_frames, and we get the number of bands as the difference of the end points of the range of band numbers: analyzer.bandpass_bands_end() - analyzer.bandpass_bands_begin(). The scale factor is taken into account by right shifting by the scale exponent.

    int64_t x1 = n_frames >> x_scale_exp;
    int64_t y1 = (analyzer.bandpass_bands_end() - analyzer.bandpass_bands_begin()) >> y_scale_exp;

The right shift by y_scale_exp above doesn't actually do anything because y_scale_exp is zero, but it would be needed if, for example, you were to change y_scale_exp to 1 to get a spectrogram scaled to half the height. You could also make a double-height spectrogram by setting y_scale_exp to -1, but then you also need to change the >> y_scale_exp to << -y_scale_exp since you can't shift by a negative number.

We are now ready to render the spectrogram, producing a vector of floating-point amplitude values, one per pixel. Although this is stored as a 1-dimensional vector of floats, its contents should be interpreted as a 2-dimensional rectangular array of (y1 - y0) rows of (x1 - x0) columns each, with the row indices increasing towards lower frequencies and column indices increasing towards later sampling times.

    std::vector<float> amplitudes((x1 - x0) * (y1 - y0));
    gaborator::render_p2scale(
        analyzer,
        coefs,
        x_origin, y_origin,
        x0, x1, x_scale_exp,
        y0, y1, y_scale_exp,
        amplitudes.data());

Writing the Image File

To keep the code simple and to avoid additional library dependencies, the image is stored in pgm (Portable GreyMap) format, which is simple enough to be generated with just a few lines of inline code. Each amplitude value in amplitudes is converted into an 8-bit gamma corrected pixel value by calling gaborator::float2pixel_8bit(). To control the brightness of the resulting image, each amplitude value is multiplied by a gain; this may have to be adjusted depending on the type of signal and the amount of headroom in the recording, but a gain of about 15 often works well for typical music signals.

    float gain = 15;
    std::ofstream f;
    f.open(argv[2], std::ios::out | std::ios::binary);
    f << "P5\n" << (x1 - x0) << ' ' << (y1 - y0) << "\n255\n";
    for (size_t i = 0; i < amplitudes.size(); i++)
        f.put(gaborator::float2pixel_8bit(amplitudes[i] * gain));
    f.close();

Postamble

To make the example code a complete program, we just need to finish main():

    return 0;
}

Compiling

If you are using macOS, Linux, NetBSD, or a similar system, you can build the example by running the following command in the examples subdirectory. You need to have libsndfile installed and supported by pkg-config.

c++ -std=c++11 -I.. -O3 -ffast-math $(pkg-config --cflags sndfile) render.cc $(pkg-config --libs sndfile) -o render

Compiling for Speed

The above build command uses the Gaborator's built-in FFT implementation, which is simple and portable but rather slow. Performance can be significantly improved by using a faster FFT library. On macOS, you can use the FFT from Apple's vDSP library by defining GABORATOR_USE_VDSP and linking with the Accelerate framework:

c++ -std=c++11 -I.. -O3 -ffast-math -DGABORATOR_USE_VDSP $(pkg-config --cflags sndfile) render.cc $(pkg-config --libs sndfile) -framework Accelerate -o render

On Linux and NetBSD, you can use the PFFFT (Pretty Fast FFT) library. You can get the latest version from https://bitbucket.org/jpommier/pffft, or the exact version the example code was tested with from gaborator.com:

wget http://download.gaborator.com/mirror/pffft/7641e67977cf.zip
unzip 7641e67977cf.zip
mv jpommier-pffft-7641e67977cf pffft

Then, compile it:

cc -c -O3 -ffast-math pffft/pffft.c -o pffft/pffft.o

(If you are building for ARM, you will need to add -mfpu=neon to both the above compilation command and the ones below.)

PFFFT is single precision only, but it comes with a copy of FFTPACK which can be used for double-precision FFTs. Let's compile that, too:

cc -c -O3 -ffast-math -DFFTPACK_DOUBLE_PRECISION pffft/fftpack.c -o pffft/fftpack.o

Then build the example and link it with both PFFFT and FFTPACK:

c++ -std=c++11 -I.. -Ipffft -O3 -ffast-math -DGABORATOR_USE_PFFFT $(pkg-config --cflags sndfile) render.cc pffft/pffft.o pffft/fftpack.o $(pkg-config --libs sndfile) -o render

Running

Running the following shell commands will download a short example audio file (of picking each string on an acoustic guitar), generate a spectrogram from it as a .pgm image, and then convert the .pgm image into a JPEG image:

wget http://download.gaborator.com/audio/guitar.wav
./render guitar.wav guitar.pgm
cjpeg <guitar.pgm >guitar.jpg

Example Output

The JPEG file produced by the above will look like this:

Spectrogram