Skip to content

Artificial intelligence is all the rage nowadays, and Barton Gellman indicated how whisper.cpp presented fantastic accuracy.

So, I gave the app a run, and it is impressive. Unfortunately, directions for usage could be a bit better. Here are some helpful tips.

First, some directions for installing.

  1. Clone the git repository.$ git clone https://github.com/ggerganov/whisper.cpp
  2. Move into the newly created cloned whisper.$ cd whisper.cpp
  3. Compile the software.$ make [My systems are pretty vanilla, and there were no hitches with the compile. Kudos to those writing this software.]
  4. Next, install a transcription engine by running the download script for one of the engines. There are five to choose from: tiny, base, small, medium, and large. Below, the base engine is installed.$ cd models
    $ ./download-ggml-model.sh base.en [downloads the base engine]
    $ cd .. [to return to the whisper.cpp directory]

Whisper and the base engine is now installed and ready to go. The basic whisper command structure is:

usage: `./main [options] file0.wav file1.wav ...`

Useful/important options to consider using, in order of use, are:

  • -m MODEL [engine model to use]
  • -otxt [txt file output format]
  • -ocsv [csv file output format]
  • -of FILENAME [name of output file, without an extension]
  • -f WAV FILE [name of wav file to transcribe]

To see all available options, enter ./main -h. Here is the output from running the following command with a short file from one of my unemployment hearings (client name and phone number removed from the transcription).

$ ./main -m models/ggml-base.en.bin -otxt -of Client-test -f ClientSample.wav

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

main: processing 'ClientSample.wav' (4940975 samples, 308.8 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:06.120]   This is a continuation of the hearing we were having difficulties with the connection.
[00:00:06.120 --> 00:00:11.280]   So ultimately I just decided to disconnect and connect all the parties again.
[00:00:11.280 --> 00:00:15.000]   I'm going to call the attorney first.
[00:00:15.000 --> 00:00:25.200]   Hello, this is administrative law judge Barbara Gerber.
[00:00:25.200 --> 00:00:27.320]   Do we have a better connection?
[00:00:27.320 --> 00:00:28.320]   It is better.
[00:00:28.320 --> 00:00:35.320]   All right, let me try to connect Miss CLIENT again.
[00:00:35.320 --> 00:00:55.320]   It's styling.
[00:00:55.320 --> 00:01:13.320]   Please leave your message for ###-###-####.
[00:01:13.320 --> 00:01:21.240]   Miss CLIENT, this is administrative law judge Barbara Gerber calling regarding your unfinished
[00:01:21.240 --> 00:01:23.200]   unemployment appeal hearing.
[00:01:23.200 --> 00:01:27.440]   I'm going to wait a couple of minutes and then I'll give you another call and hopefully we
[00:01:27.440 --> 00:01:29.440]   can make a connection at that time.
[00:01:29.440 --> 00:01:30.440]   Thank you.
[00:01:30.440 --> 00:01:37.240]   So Mr. Forberger, I'm going to give her about five minutes and see if she can figure out
[00:01:37.240 --> 00:01:46.360]   either a different phone or location and get to some spot where we can finish the hearing.
[00:01:46.360 --> 00:01:50.160]   All right.
[00:01:50.160 --> 00:01:59.840]   Attorney for Berger.
[00:01:59.840 --> 00:02:03.240]   Thank you.
[00:02:03.240 --> 00:02:13.240]   [BLANK_AUDIO]
[00:02:13.240 --> 00:02:23.240]   [BLANK_AUDIO]
[00:02:23.240 --> 00:02:33.240]   [BLANK_AUDIO]
[00:02:33.240 --> 00:02:43.240]   [BLANK_AUDIO]
[00:02:43.240 --> 00:02:53.240]   [BLANK_AUDIO]
[00:02:53.240 --> 00:03:03.240]   [BLANK_AUDIO]
[00:03:03.240 --> 00:03:13.240]   [BLANK_AUDIO]
[00:03:13.240 --> 00:03:23.240]   [BLANK_AUDIO]
[00:03:23.240 --> 00:03:33.240]   [BLANK_AUDIO]
[00:03:33.240 --> 00:03:43.240]   [BLANK_AUDIO]
[00:03:43.240 --> 00:03:53.240]   [BLANK_AUDIO]
[00:03:53.240 --> 00:04:03.240]   [BLANK_AUDIO]
[00:04:03.240 --> 00:04:13.240]   [BLANK_AUDIO]
[00:04:13.240 --> 00:04:23.240]   [BLANK_AUDIO]
[00:04:23.240 --> 00:04:33.240]   [BLANK_AUDIO]
[00:04:33.240 --> 00:04:43.240]   [BLANK_AUDIO]
[00:04:43.240 --> 00:04:53.240]   [BLANK_AUDIO]
[00:04:53.240 --> 00:05:03.240]   [BLANK_AUDIO]
[00:05:03.240 --> 00:05:13.240]   [BLANK_AUDIO]

output_txt: saving output to 'Client-test.txt'

whisper_print_timings:     fallbacks =   4 p /   0 h
whisper_print_timings:     load time =   230.18 ms
whisper_print_timings:      mel time =  2945.69 ms
whisper_print_timings:   sample time =   511.61 ms /   564 runs (    0.91 ms per run)
whisper_print_timings:   encode time = 63995.05 ms /    26 runs ( 2461.35 ms per run)
whisper_print_timings:   decode time = 11700.60 ms /   548 runs (   21.35 ms per run)
whisper_print_timings:    total time = 79435.22 ms

As noted in this output, a txt file called Client-test.txt with this transcription was also produced. A test with the same WAV file using the medium engine produced this text (time stamps removed).

This is a continuation of the hearing.
We were having difficulties with the connection, so ultimately I just decided to disconnect
and connect all the parties again.
I'm going to call the attorney first.
Hello, this is Administrative Law Judge Barbara Gerber.
Do we have a better connection?
It is better.
All right.
So let me try to connect Ms. CLIENT again.
It's dialing.
Please leave your message for ###-###-####.
Ms. CLIENT, this is Administrative Law Judge Barbara Gerber calling regarding your unfinished
unemployment appeals hearing.
I'm going to wait a couple of minutes and then I'll give you another call and hopefully
we can make a connection at that time.
Thank you.
So Mr. Forberger, I'm going to give her about five minutes and see if she can figure out
either a different phone or location and get to some spot where we can finish the hearing.
All right?
Attorney Forberger?
Attorney Forberger?
Yes.
Okay.
Okay.
Okay.
Okay.
Okay.
Okay.
Okay.
Okay.
Okay.

This transcription is pretty good. But, it is still a long ways from replacing a court reporter.

.wordads-ad-wrapper {display:none;font: normal 11px Arial, sans-serif;letter-spacing: 1px;text-decoration: none;width: 100%;margin: 25px auto;padding: 0;}.wordads-ad-title {margin-bottom: 5px;}.wordads-ad-controls {margin-top: 5px;text-align: right;}.wordads-ad-controls span {cursor: pointer;}.wordads-ad {width: fit-content;margin: 0 auto;}

Advertisement
Privacy Settings