Music

How Shazam Names a Song Without Ever Listening to It

June 18, 2026 6 min read

Illustration generated with Google Flow (Nano Banana Pro).

You're in a bar. The music is loud, people are shouting over it, glasses are clinking, and a song you half-recognize is bleeding through all of that noise. You hold up your phone, wait five seconds, and the screen calmly tells you the title, the artist, and the album. It feels like the phone listened to the melody — hummed it back, recognized the tune. It didn't. Shazam doesn't hear music at all in any human sense. It doesn't care about melody, key, or rhythm. What it does instead is one of the most elegant tricks in modern computing: it turns sound into a sparse map of stars and goes looking for that exact constellation in a haystack of millions.

A packed crowd holding phones up at a concert — the loud, chaotic environment Shazam was built to cut through — Credit: Aleksandr Popov / Unsplash

From sound to a picture of sound

The first thing Shazam does is stop thinking of the audio as a waveform and start thinking of it as a picture. It chops the clip into tiny overlapping slices of time and runs each one through a Fourier transform — the mathematical machine that takes a messy wiggle of sound pressure and tells you which pure frequencies it's made of, and how loud each one is.

Stack all those slices side by side and you get a spectrogram: time running left to right, frequency bottom to top, brightness showing how much energy sits at each pitch in each moment. A bass thump lights up the bottom; a hi-hat sparkles along the top. It's sound made visible — the same kind of image a sound engineer or a linguist would recognize on sight.

A real spectrogram: time on the horizontal axis, frequency on the vertical, brightness showing energy — the bright streaks are the loud peaks Shazam keeps — Credit: Aquegg / Wikimedia Commons (Public domain)

Keeping only the brightest stars

Here's the genius move, the one that makes the whole thing work in a noisy bar. A full spectrogram is far too much data, and most of it is mush — the smear of background chatter, the rumble of the air conditioner, the hiss of a crowd. So Shazam throws nearly all of it away. It keeps only the peaks: the points that are louder than everything around them, the frequencies that punched cleanly through the racket.

What's left is a scatter of dots that Shazam's creator, Avery Wang, called a constellation map — and the name is perfect. Like real constellations, these points are sparse, distinctive, and stubbornly survive bad conditions. A loud peak in the original recording tends to stay a loud peak even after it's been crushed through a phone speaker, soaked in chatter, and squeezed down a cheap microphone. The background noise is loud, but it's diffuse; it rarely beats the sharp musical peaks at their own game. By keeping only those peaks, Shazam quietly deletes most of the noise before it ever tries to match anything.

Pairs of stars become a fingerprint

A single dot isn't unique — plenty of songs have a loud note at 440 Hz. So Shazam doesn't store dots; it stores relationships. It takes an "anchor" peak and pairs it with several nearby peaks just ahead of it in time. Each pair becomes a tiny hash: the frequency of the first point, the frequency of the second, and the time gap between them, packed into one compact number.

That triplet is wonderfully distinctive. The odds that some other song has the same two frequencies separated by the same sliver of time, over and over in the same order, are vanishingly small. And because a hash is built from relationships between peaks — not absolute loudness — it survives volume changes, compression, and a fair amount of grime. Shazam precomputes these hashes for every track it knows and stuffs them into a giant database, each one tagged with which song it came from and when in that song it occurred.

Finding the song without ever hearing it

Now the magic. Your five-second clip gets the same treatment — spectrogram, peaks, pairs, hashes — and Shazam looks each hash up in its index. You'll get scattered hits: this hash matches that song, that hash matches three others. Coincidences are everywhere. So how does it tell a real match from random noise?

With a beautifully simple idea about time. If your clip genuinely comes from a song, then every matching hash should line up at a consistent offset. Say you recorded 47 seconds into the track: then hash A matches at 47s, hash B at 47.3s, hash C at 48s — every single one sitting exactly your-recording-time later than where it lives in the original. Shazam plots all the matches and looks for that telltale diagonal line: a thick stripe of hits that all share the same time difference. A wrong song produces a random spray of dots with no line. The right song produces an unmistakable streak. Find the diagonal, and you've found your track — no melody recognition, no AI listening, just geometry.

A vintage microphone glowing in bar light — the kind of room where a sparse constellation of peaks still cuts through the noise — Credit: Israel Palacio / Unsplash

The kicker

The blueprint for all of this was published openly by Avery Wang back in 2003, in a paper barely a few pages long — years before the iPhone existed, when "Shazam" meant calling a number — 2580 — holding your phone up to the music for about thirty seconds, and waiting for an SMS to tell you the song. The core algorithm has barely changed since, because it didn't need to. While the rest of the world chases ever-bigger neural networks to understand music, the thing in your pocket that names a song through a wall of noise is doing something almost stubbornly old-fashioned: it's not listening at all. It's just matching stars.

A project like this one?

I design and deploy products like this. Let's talk.

Let's talk