Introducing the BizCrush STT Benchmark
5
min

BizCrush
Growth
Comparing speech-to-text accuracy in real-world conditions
"This app transcribes more accurately."
That's usually how speech-to-text (STT) services get judged. But everyone uses them in different conditions, and since "it just feels better" varies from person to person, an objective comparison is hard to come by.
To solve that, BizCrush built an open STT benchmark that anyone can inspect and verify for themselves.
Instead of simply calling an API to measure performance, we run the actual services, play the same audio into them, and compare the results. We publish not just the scores, but the source audio, the ground-truth transcript, and each app's transcript output so anyone can review the numbers directly.

The BizCrush leaderboard
Why we built it
Every STT service claims to be accurate.
But how accurate they really are and where they're strong or weak is hard to compare objectively. In a quiet office, almost every service performs well. The situations where users actually struggle are noisy ones: several people talking at once, or heavy background noise.
BizCrush built this benchmark to compare, out in the open, how STT services actually perform in those real-world conditions.
BizCrush is one of the services being compared. That's exactly why we publish not just the scores, but the ground-truth transcripts, the app outputs, the scoring method, and the limitations as well so that anyone can verify the results.
What makes it different
Most STT benchmarks call the engine's API directly. But real users don't use APIs. And most apps add their own audio processing, noise reduction, and post-processing on top of the underlying recognition engine. As a result, API performance and the real user experience often differ.
The BizCrush benchmark tests the entire path a user actually experiences. We play the source audio, run the real STT app inside an Android emulator, feed the audio through the microphone input, and collect exactly what the app outputs. This measures performance in the way closest to how people actually use these apps.
How a test runs
We open the target app and start recording.
We feed the source audio into the microphone input.
We capture the live transcript the app generates.
When playback ends, we save the final result.
We compare it against the ground truth and compute the score.
After internal review, we publish it.
Every result is reviewed by a person before it goes public.

Test detail page
We separate live and post-processed transcripts
Many services rewrite the transcript after recording ends by adding punctuation, smoothing sentences, sometimes even correcting misrecognized words. The result looks nicer, but it can hide the actual recognition performance.
So BizCrush measures two things separately.
Live transcript
What the user actually sees on screen during the meeting.
Post-processed transcript
The result after post-processing is applied once recording stops. For services that provide both, we publish each separately so you can see exactly what the post-processing changed. We're now adding post-processed results for note-style services such as Clova Note.
How scoring works
Scores use WER (Word Error Rate) or CER (Character Error Rate). The lower the number, the more accurate.
Errors fall into four categories:
Substitution: a word recognized as a different word
Deletion: a word that was missed
Insertion: a word added that wasn't actually spoken
Correct: recognized correctly
For example, if the ground truth is
The meeting is at three tomorrow afternoon
and the transcript says
The meeting is at four tomorrow afternoon
then "three" was misrecognized as "four" as a substitution error.
How much do results differ in noisy conditions?
The hardest clip currently published is a dinner-party recording with several people talking at once. It was captured by a far-field microphone, and some stretches are hard to make out even for a human.
On this test:
BizCrush: 49.6% WER
Competing service: ~72% WER
In quiet conditions, most services land around 3–5% and look similar. But in the noisy conditions where users struggle most, a significant gap appears. What matters isn't the absolute number but it's which service makes fewer errors on the same audio.
We don't count simple notation differences as errors
To evaluate real recognition performance, we strip away notation differences as much as possible. For example:
letter case
punctuation
HTML entities
currency notation
percentage notation
If the meaning is the same, we treat them as identical. The same applies across languages — in Korean, for instance, "할 수 있다" and "할수있다" sound identical but differ only in spacing. A reviewer listens to such cases and, if they're acoustically the same, marks them as a match.
Test data currently published
For now, we use only publicly available audio with no licensing issues.

Test clips page
Dinner Party (Amazon DiPCo)
English
High noise
15:49
VOA Korea – Parkour
Korean
Low noise
5:37
VOA – Texas Korean Community
English
Low noise
9:03
Every clip is published together with its source.
On top of that, BizCrush also produces its own license-clean recordings, expanding the test set every week. The first series is the BizCrush Board Game Lunch (BGL), in which we curate real conversation settings together with their ground-truth transcripts. These in-house recordings carry their own benchmark-use license, stated in the page footer and in the License & Sources section of the Methodology page.
Results split by noise level
Looking at the published results, almost every service is highly accurate in quiet conditions. But the gap widens as noise increases.
That's why the BizCrush benchmark doesn't just show an average. It separates
low noise
high noise
so the results reflect real-world usage more accurately.

Example of the runs page
Transparency first
Every published test comes with:
the source audio
the ground-truth transcript
the app's transcript output
an error breakdown
the WER/CER calculation
so anyone can check it and recompute it themselves. We also anonymize the compared services (except major global services) to prevent unnecessary service identification. The source audio can be reviewed and verified directly on the page.
What's next
This benchmark currently focuses on real, app-based testing. Going forward, we plan to add:
direct API testing
more languages
more noisy environments
post-processed-transcript evaluation for note-style services
automatic app-version tracking
We believe a good benchmark is built by steadily accumulating trustworthy data. BizCrush will keep building an STT evaluation standard that stays as close as possible to real-world use.

