Let's process sound with Go

October 16, 2018

Disclaimer: I don’t consider any algorithms and API for working with sound and speech recognition. This article is about problems with audio processing and how to solve them with Go.

phono is an application framework for sound processing. It’s main purpose - to create a pipeline of different technologies to process sound in a way you need.

What does the pipeline do, using different technologies and why do we need another framework? We will figure that out now.

Whence the sound?

In 2018 sound became a standard way of interaction between human and technology. With the majority of IT-giants either already having voice-assistant, or are working on it right now. Voice control integrated into major operating systems and voice messages are a default feature of any modern messenger. There are around a thousand of start-ups working on natural language processing and around two hundred on speech-recognition.

Same story with music. It’s getting played from any device and sound recording is available for anyone who has a computer. Music software is developed by hundreds of companies and thousands of enthusiasts around the globe.

Common tasks

If you ever performed any sound-processing tasks, the following conditions should be familiar to you:

Audio should be received from file, device, network, etc.
Audio should be processed: add FX, encode, analyze, etc.
Audio should be sent into file, device, network, etc.
Data is transmitted in small buffers

It becomes a pipeline - there is a data stream that goes through several stages of processing.

Solutions

For clarity, let’s take a real life problem: we need to transform voice into text:

Record audio with a device
Remove noises
Equalize
Send signal to a voice-recognition API

As any other problem, this one has several solutions.

Brute force

Hardcore-developers ~~wheel-inventors~~ only. Record sound directly through a sound interface driver, write smart noise-suppressor and multi-track equalizer. This is very interesting, but you can forget about your original task for several months.

Time-consuming and very complex.

Normal way

The alternative is to use existing audio APIs. It’s possible to record audio with ASIO, CoreAudio, PortAudio, ALSA and others. Multiple standards of plugins are available for processing: AAX, VST2, VST3, AU.

A rich selection doesn’t mean that it’s possible to use everything. Usually the following limitations apply:

Operating system. Not all APIs are available on all OSs. For example, AU is OS X native technology and available only there.
Programming language. The majority of audio libraries are written in C or C++. In 1996 Steinberg released the first version of VST SDK, still the most popular plugin format. Today, you no longer have to write in C/C++: there are VST wrappers in Java, Python, C#, Rust and many others. Although the language remains a limitation, today people process sound even with Javascript.
Functionality. If the problem is simple, there is no need to write a new application. The FFmpeg has a ton of features.

In this case the complexity depends on your choices. The worst-case scenario - you’ll have to deal with multiple libraries. And if you’re really unlucky, with complex abstractions and absolutely different interfaces.

What in the end?

We have to choose between very complicated and complicated:

deal with several low-level APIs to invent our own wheels
deal with several APIs and try to make them friends

It doesn’t matter which way you’ve chosen, the task always comes down to the pipeline. The technologies may differ, but the essence is unchanged. Again, the problem is instead of solving the problem, we have to ~~invent a wheel~~ build a pipeline.

But there’s an option.

phono

phono is created to solve common tasks - “receive, process and send” the sound. It utilizes a pipeline as the most natural abstraction. There is a post in the official Go blog. It describes a pipeline-pattern. The core idea of the pipeline-pattern is that there are several stages of data processing working independently and exchanging data through channels. That’s what is needed.

But, why Go?

At first, a lot of audio software and libraries are written in C. Go is known as its successor. On top of that, there is cgo and a big variety of bindings for existing audio APIs that we can take and use.

Second, in my opinion, Go is a good language. I don’t want to dive deep, but I note its concurrency possibilities. Channels and goroutines make pipeline implementation significantly easier.

Abstractions

pipe.Pipe struct is the heart of phono - it implements the pipeline pattern. Just as in the blog example there are three stages defined:

pipe.Pump - receive sound, only output channels
pipe.Processor - process sound, input and output channels
pipe.Sink - send sound, only input channels

Data transfers in buffers within a pipe.Pipe. Rules which allow you to build a pipe:

Single pipe.Pump
Multiple pipe.Processor, placed sequentially
Single or multiple pipe.Sink, placed in parallel
All pipe.Pipe components should have the same:
- Buffer size
- Sample rate
- Number of channels

Minimal configuration is a Pump with single Sink, the rest is optional.

Let’s go through several examples.

Easy

Problem: play a wav file.

Let’s express a problem in “receive, process, send” form:

Receive audio from wav file
Send audio to portaudio device

Audio is read and immediately played back.

First we create all elements of a pipeline: wav.Pump and portaudio.Sink and use them in a constructor pipe.New. Function p.Do(pipe.actionFn) error starts a pipeline and awaits until it’s done.

More difficult

Problem: cut wav file into samples, put them into track, save and play the result at the same time.

Sample is a small piece of audio and track is a sequence of samples. In order to sample the audio, we have to put it into memory first. We can use asset.Asset from phono/asset package to serve this purpose. Express the problem in standard steps:

Receive audio from wav file
Send audio to memory

Now we can make samples, add them to the track and finalize the solution:

Receive audio from track
Send audio to:
- wav file
- portaudio device

Again, there is no processing stage, but we have two pipelines now!

Compared to the previous example, there are two pipe.Pipe. First one transfers data into memory, so we can do the sampling. Second one has two sinks in the final stage: wav.Sink and portaudio.Sink. With this configuration, the sound is simultaneously saved to a wav file and played.

Even more difficult

Problem: read two wav files, mix them, process with vst2 plugin and save into a new wav file.

There is a simple mixer mixer.Mixer at phono/mixer package. It allows to send signals from multiple sources and receive a single one mixed. To achieve this, it implements both pipe.Sink and pipe.Pump at the same time.

Again, the problem consists of two small ones. First one looks like this:

Receive audio from wav file
Send audio to mixer

Second:

Receive audio from mixer
Process audio with plugin
Send audio to wav file

Here we have three instances of pipe.Pipe, all connected with a mixer. Execution is started with p.Begin(pipe.actionFn) (pipe.State, error). In compare to p.Do(pipe.actionFn) error, it doesn’t block the call and just returns expected state. The state can be awaited with p.Wait(pipe.State) error.

What’s next?

I want phono to be a very convenient application framework. If there is a task with sound, you do not need to understand complex APIs and spend time studying the standards. All that you need is to build a pipeline with suitable elements and launch it.

In last six months the following packages built:

phono/wav - read/write wav files
phono/vst2 - not completed VST2 SDK bindings, just open plug-ins and call it’s methods, not all structures are mapped
phono/mixer - mixer, summarizes N signals, no balance and volume
phono/asset - sampling buffers
phono/track - sequential read of buffers
phono/portaudio - playback audio, experimental

In addition to this list, there is a constantly growing backlog of new ideas, among which:

Time measurement
Mutable on-the-flight pipeline
HTTP pump/sink
Parameters automation
Resampling-processor
Balance and volume for mixer
Real-time pump
Synchronized pump for multiple tracks
Full vst2

Topics for upcoming articles:

lifecycle of pipe.Pipe - it’s state is managed with finite state machine due to complex internal structure
how to write your own pipe stages

This is my first open-source project, so I would be happy to get any help and recommendations.

pipelined.dev