Product

Introducing the BizCrush STT Benchmark

5

min

BizCrush

Growth

INDEX

    INDEX

      Comparing speech-to-text accuracy in real-world conditions


      "This app transcribes more accurately."


      That's usually how speech-to-text (STT) services get judged. But everyone uses them in different conditions, and since "it just feels better" varies from person to person, an objective comparison is hard to come by.


      To solve that, BizCrush built an open STT benchmark that anyone can inspect and verify for themselves.

      Instead of simply calling an API to measure performance, we run the actual services, play the same audio into them, and compare the results. We publish not just the scores, but the source audio, the ground-truth transcript, and each app's transcript output so anyone can review the numbers directly.


      image.png

      The BizCrush leaderboard

      Why we built it

      Every STT service claims to be accurate.

      But how accurate they really are and where they're strong or weak is hard to compare objectively. In a quiet office, almost every service performs well. The situations where users actually struggle are noisy ones: several people talking at once, or heavy background noise.


      BizCrush built this benchmark to compare, out in the open, how STT services actually perform in those real-world conditions.

      BizCrush is one of the services being compared. That's exactly why we publish not just the scores, but the ground-truth transcripts, the app outputs, the scoring method, and the limitations as well so that anyone can verify the results.

      What makes it different

      Most STT benchmarks call the engine's API directly. But real users don't use APIs. And most apps add their own audio processing, noise reduction, and post-processing on top of the underlying recognition engine. As a result, API performance and the real user experience often differ.


      The BizCrush benchmark tests the entire path a user actually experiences. We play the source audio, run the real STT app inside an Android emulator, feed the audio through the microphone input, and collect exactly what the app outputs. This measures performance in the way closest to how people actually use these apps.

      How a test runs

      1. We open the target app and start recording.

      2. We feed the source audio into the microphone input.

      3. We capture the live transcript the app generates.

      4. When playback ends, we save the final result.

      5. We compare it against the ground truth and compute the score.

      6. After internal review, we publish it.


      Every result is reviewed by a person before it goes public.


      image.png

      Test detail page

      We separate live and post-processed transcripts

      Many services rewrite the transcript after recording ends by adding punctuation, smoothing sentences, sometimes even correcting misrecognized words. The result looks nicer, but it can hide the actual recognition performance.

      So BizCrush measures two things separately.

      Live transcript

      What the user actually sees on screen during the meeting.

      Post-processed transcript

      The result after post-processing is applied once recording stops. For services that provide both, we publish each separately so you can see exactly what the post-processing changed. We're now adding post-processed results for note-style services such as Clova Note.

      How scoring works

      Scores use WER (Word Error Rate) or CER (Character Error Rate). The lower the number, the more accurate.

      Errors fall into four categories:

      • Substitution: a word recognized as a different word

      • Deletion: a word that was missed

      • Insertion: a word added that wasn't actually spoken

      • Correct: recognized correctly


      For example, if the ground truth is


      The meeting is at three tomorrow afternoon


      and the transcript says


      The meeting is at four tomorrow afternoon


      then "three" was misrecognized as "four" as a substitution error.

      How much do results differ in noisy conditions?

      The hardest clip currently published is a dinner-party recording with several people talking at once. It was captured by a far-field microphone, and some stretches are hard to make out even for a human.

      On this test:

      • BizCrush: 49.6% WER

      • Competing service: ~72% WER


      In quiet conditions, most services land around 3–5% and look similar. But in the noisy conditions where users struggle most, a significant gap appears. What matters isn't the absolute number but it's which service makes fewer errors on the same audio.

      We don't count simple notation differences as errors

      To evaluate real recognition performance, we strip away notation differences as much as possible. For example:

      • letter case

      • punctuation

      • HTML entities

      • currency notation

      • percentage notation


      If the meaning is the same, we treat them as identical. The same applies across languages — in Korean, for instance, "할 수 있다" and "할수있다" sound identical but differ only in spacing. A reviewer listens to such cases and, if they're acoustically the same, marks them as a match.

      Test data currently published

      For now, we use only publicly available audio with no licensing issues.


      image.png

      Test clips page

      Dinner Party (Amazon DiPCo)

      • English

      • High noise

      • 15:49

      VOA Korea – Parkour

      • Korean

      • Low noise

      • 5:37

      VOA – Texas Korean Community

      • English

      • Low noise

      • 9:03


      Every clip is published together with its source.

      On top of that, BizCrush also produces its own license-clean recordings, expanding the test set every week. The first series is the BizCrush Board Game Lunch (BGL), in which we curate real conversation settings together with their ground-truth transcripts. These in-house recordings carry their own benchmark-use license, stated in the page footer and in the License & Sources section of the Methodology page.

      Results split by noise level

      Looking at the published results, almost every service is highly accurate in quiet conditions. But the gap widens as noise increases.

      That's why the BizCrush benchmark doesn't just show an average. It separates

      • low noise

      • high noise


      so the results reflect real-world usage more accurately.


      image.png

      Example of the runs page

      Transparency first

      Every published test comes with:

      • the source audio

      • the ground-truth transcript

      • the app's transcript output

      • an error breakdown

      • the WER/CER calculation


      so anyone can check it and recompute it themselves. We also anonymize the compared services (except major global services) to prevent unnecessary service identification. The source audio can be reviewed and verified directly on the page.

      What's next

      This benchmark currently focuses on real, app-based testing. Going forward, we plan to add:

      • direct API testing

      • more languages

      • more noisy environments

      • post-processed-transcript evaluation for note-style services

      • automatic app-version tracking


      We believe a good benchmark is built by steadily accumulating trustworthy data. BizCrush will keep building an STT evaluation standard that stays as close as possible to real-world use.


      👉 Visit the BizCrush Benchmark page