Updated: May 24, 2021
In previous editions of this blog, I’ve ranted and raved about poor audio-quality when it comes to recording over programs like Zoom, Skype, or Google Hangouts. Recently, I experienced something that helped me put my frustrations into extremely
simple terms and I’m passing that psychological grace onto you.
A client was using a video-calling program (that will remain unnamed here) to record audio for their podcast. They’d invested in a great mic, sound-proofed their recording space, but were still experiencing what they described as a hissing sound in their audio files when saying words with "S's" in them. They wanted to know what they’d done wrong with their equipment to create this strange effect. The problem was, everything they’d done was correct. Everything was working as intended and I didn’t have the heart to tell them to look at my previous blog post for the extended explanation.
Nobody should be reduced to that *wink*.
The Problem with Video-Conferencing Audio
In short, their audio was being compressed by the program. It will always be compressed. They could use the biggest, most badass microphone on the planet and as long as they were still recording audio over the internet, this hissing sound would
always be present because it’s not really being added to their audio, it’s being dithered.
Let’s imagine looking at a photo over a dial-up internet connection. Remember the days when photos used to take minutes to load on a webpage. As you could imagine, the smaller the image size, the faster that image would load. In the same way that the webpage delivers a photo to your screen, these video-calling programs are delivering audio to your computer. The only difference is that the video-calling programs need to deliver that audio to your computer to be recorded in real-time. To do this, they need to reduce the size of the audio, in the same way, reducing the size of an image allows it to be delivered faster and more reliably.
The computer reduces the audio size by throwing out certain frequencies of the human voice that it deems unnecessary and mixing other frequencies together. The process by which similar frequencies are reduced and then mixed is called “dithering.”
It’s like when a high-quality image is compressed until you can see the pixels. Similar colors are combined into single pixels to save space. This is what the computer does with audio and that audio “pixelation” gets experienced as a “hissing” in the
higher frequencies of a piece of audio.
Some of you may have experienced this in the early days of programs like Napster and Limewire in which incredibly compressed MP3’s were shared. If you have a program like Apple Music or Spotify, type in the name of a song that uses cymbals. Anything jazzy really. Listen very closely to the complexity of the high tones in the audio. Then go on YouTube and try to find an upload of the same song. Then turn the video quality to the lowest possible setting. That is hissing. That is dithering.
The Problem Visualized
Another way to visualize this, and the method that was therapeutic for me, is to think about the human voice itself. I’ve done so here by recording a short sample of my voice, then using a spectrometer to identify the frequencies of the samples within that
piece of audio. It looks like the image below.
The vertical axis represents how loud certain samples are and the horizontal axis represents how high or low in frequency that sample is. Each dot is one sample in the 2 seconds of audio I analyzed.
What I want you to remember most about this audio is that, while some ranges of sounds are louder than others, the human voice contains all of these frequencies in every word it speaks. From high to low, every vocal chord that vibrates while
speaking gives off several frequencies. This is a full-range sample of those tones all being recorded without quality-loss into a computer program like Audacity or Audition. This is what I would consider “good audio.”
The next piece of audio is from a source that is undeniably bad. Do you remember voicemail? If you don’t, try to listen to one. The quality is, in a word,
terrible. This is because the audio, sent through phone lines, needed to be reduced as much as possible to ensure that your phone was able to record it, and didn’t take up too much bandwidth for other calls being sent across the lines. The phone companies have given us a great example of audio that has been reduced and dithered to the absolute tolerable limit. Anything less than this would be painful to listen to. Below is a spectrograph of that audio.
First, notice how it differs from the previous example. There are frequencies in the 4k, 6.5k, and 11k range that have been drastically dithered. Plus every sound higher than 16k has been almost completely removed.
Now notice how similar the two images are. There is still a lot of information present even in the example of the bad audio. This should serve to illustrate how little information can be removed before the audio starts to sound bad. It really is a fine line and the next example will show that.
This final spectrograph is taken from one of the popular video-calling apps that report to record in quality as good as studio-audio. As you can see, we’ve determined that was a lie.
While the trained ear may not hear the difference, in the highest tones of the human voice, we see more reduction and more dithering. The mid-tone frequencies are not as damaged as the voicemail audio, but all off the low-frequency tones have been drastically lowered in loudness. Frequencies above 20k experience a significant drop off.
Just like the cymbals in the background of a bad mp3, these high tones experience audio-dithering or the equivalent of image pixelation. This is what gets delivered to your audience when you record audio over one of these programs. It’s not as bad as voicemail, but it’s far from perfectly good or studio-quality audio.
The Fix for Bad Calling Audio
There is a very simple fix for this! One I briefly included in my last post, but that I’ve elaborated here simply in a 10-step infographic PDF.
This is a much easier alternative to mailing a recording device to your guest or using one yourself.
Simply record into a Digital Audio Workspace, (in this case the free program Audacity) and allow it to record audio from your microphone as you use a video-calling program to communicate with your guest.
When your show is finished, you will be able to export and sync your locally recorded audio with the audio from your video calling program. If you can get your guest to record the same way on their end, it will sound like you’re in the same room with them and I will promise to never write a blog post about you again.