跳转到主要内容

OpenAI的Whisper模型可以对多种语言进行语音识别。在查看此简单指南中的性能分析之前,我们将学习如何运行Whisper。

昨天,OpenAI发布了其Whisper语音识别模型。Whisper加入了目前可用的其他开源语音到文本模型,如Kaldi、Vosk、wav2vec 2.0等,并与最先进的语音识别结果相匹配。

在本文中,我们将学习如何安装和运行Whisper,还将深入分析Whisper的准确性、推理时间和运行成本。

#如何运行OpenAI的Whisper

在本节中,我们将学习如何安装和使用Whisper。如果您已经启动并运行了Whisper,您可以跳到Whisper分析或更复杂的Whisper高级用法。

步骤1:安装依赖项

Whisper需要Python3.7+和最新版本的PyTorch(我们使用了PyTorch 1.12.1,没有问题)。如果您还没有Python和PyTorch,请立即安装它们。

Whisper还需要FFmpeg,一个音频处理库。如果您的计算机上尚未安装FFmpeg,请使用以下命令之一进行安装。

# Linux
sudo apt update && sudo apt install ffmpeg

# MacOS
brew install ffmpeg

# Windows
chco install ffmpeg

其他详细信息

The MacOS installation command requires Homebrew, and the Windows installation command requires Chocolatey, so make sure to install either tool as needed.

最后,如果使用Windows,请确保已启用开发人员模式。在您的系统设置中,导航到“隐私与安全”>“针对开发人员”,如果尚未打开,则打开顶部切换开关以打开“开发人员模式”。

步骤2:安装Whisper

现在我们可以安装Whisper了。打开命令行并执行以下命令以安装Whisper:

pip install git+https://github.com/openai/whisper.git 

3.运行Whisper

命令行

首先,我们将从命令行使用Whisper。只需打开一个终端,导航到音频文件所在的目录。我们将使用一个名为audio.wav的文件,这是葛底斯堡演讲的第一行。要转录此文件,我们只需在终端中运行以下命令:

whisper audio.wav

输出将显示在终端中:

(venv) C:\Users> whisper audio.wav
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: english
[00:00.000 --> 00:10.000]  Four score and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

转录还保存到audio.wav.txt中,以及用于隐藏字幕的文件audio.wa.vtt。

Python

在Python中使用Whisper进行转录非常容易。只需导入耳语,指定模型,然后转录音频。

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.wav")

转录文本可以访问结果[“text”]。结果对象本身包含其他有用信息:

{
  "text": " Four score and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 10.0,
      "text": " Four score and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.",
      "tokens": [
        50364,
        7451,
        6175,
        ...,
        2681,
        13,
        50864
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1833780391796215,
      "compression_ratio": 1.3858267716535433,
      "no_caption_prob": 0.05988641083240509
    }
  ],
  "language": "en"
}

OpenAI Whisper 分析

Whisper论文的下图将Whisper使用单词错误率(WER)的准确性与当前最先进的语音识别模型进行了比较。正如你所看到的,Whisper报告称取得了最先进的成果,这对语音识别领域来说是一个令人兴奋的发展,特别是考虑到Whisper是一个开源模型。

虽然这些结果令人兴奋,但语音识别仍然是一个悬而未决的问题,尤其是对于非英语语言。下图报告了每种支持语言的Whisper的单词错误率。虽然Whisper在几种罗曼史语言(德语、日语等)上取得了最先进的成绩,但其他语言的表现相对较差。

Whisper word error rate as a function of language (source)

下面我们将看到语言的分布作为单词错误率的函数。在上图中的82种语言中,有50种语言的单词错误率大于20%,

轶事比较(Anecdotal Comparisons)

在Assembly,我们的API由最先进的Conformer-CTC模型提供动力,该模型经过约100000小时的标记数据训练。为了探索Whisper的准确性,我们决定将Whisper与我们的模型进行一些并排比较。

首先,我们展示了Whisper公告帖子中Micro Machines示例的比较:

AssemblyAI

This is a Micro Machine man presenting the most midget miniature motorcade of Micro Machine. This one has dramatic details, perfect turn, precision paint jobs, plus incredible Micro Machine pocket place that says a police station, fire station, restaurant service station, and more. Perfect pocket portable to take any place. And there are many miniature places to play with. Each one comes with its own special edition Micro Machine vehicle and fun, fantastic features that miraculously moved OOH. Raise the boltless at the airport marina, man the gun turret at the army base. Clean your car at the car wash. Raise the tulbridge. And these places fit together to form a Micro Machine world. Micro Machine pocket places that's tremendously tiny, so perfectly precise, so dazzlingly detailed, you'll want to pocket them all. Micro Machines are microschin pocket place that sold separately from glue. The smaller they are, the better they are.

Google Speech-to-Text

this is Michael presenting the most midget miniature motorcade of micro machine which one has dramatic details terrific current position paying jobs plus incredible Michael Schumacher place that's there's a police station Fire Station restaurant service station and more perfect bucket portable to take any place and there are many many other places to play with of each one comes with its own special edition Mike eruzione vehicle and fun fantastic features that miraculously move raise the boat looks at the airport Marina men the gun turret at the Army Base clean your car at the car wash raised the toll bridge and these play sets fit together to form a micro machine world like regime Parker Place that's so tremendously tiny so perfectly precise so dazzlingly detail Joanna pocket them all my questions are microscopic play set sold separately from glue the smaller they are the better they are

Whisper

This is the Micro Machine Man presenting the most midget miniature motorcade of Micro Machines. Each one has dramatic details, terrific trim, precision paint jobs, plus incredible Micro Machine Pocket Play Sets. There's a police station, fire station, restaurant, service station, and more. Perfect pocket portables to take any place. And there are many miniature play sets to play with, and each one comes with its own special edition Micro Machine vehicle and fun, fantastic features that miraculously move. Raise the boatlift at the airport marina. Man the gun turret at the army base. Clean your car at the car wash. Raise the toll bridge. And these play sets fit together to form a Micro Machine world. Micro Machine Pocket Play Sets, so tremendously tiny, so perfectly precise, so dazzlingly detailed, you'll want to pocket them all. Micro Machines are Micro Machine Pocket Play Sets sold separately from Galoob. The smaller they are, the better they are.

第二个例子是播客中的一个片段

AssemblyAI

One of them is I made the claim I think most civilizations going from simple bacteria like things to space, colonizing civilizations, they spend only a very tiny fraction of their life being where we are, that I could be wrong about. The other one I could be wrong about is quite different statements that I think that actually I'm guessing that we are the only civilization in our observable universe from which light has reached us so far that's actually gotten far enough to invent telescopes. So let's talk about maybe both of them in turn, because they really are different. The first one, if you look at N, equals one the data for we have on this planet. So we spent four and a half billion years fussing around on this planet with life. And most of it was pretty lame stuff from an intelligence perspective. Bacteria and then the dinosaurs spent then the things greatly accelerated, and the dinosaurs spent over 100 million years stomping around here without even inventing smartphones. And then very recently, we've only spent 400 years going from Newton to us, right? Yeah. In terms of technology. And look what we've done even.

Google Speech-to-Text

one of them is I made the claim I think most civilizations going from simple bacteria are like things to space space colonizing civilization they spend only a very very tiny fraction of their other other life being where we are. I could be wrong about the other one I could be wrong about this quite different statements and I think that actually I'm guessing that we are the only civilization in the observable universe from which life has weeks or so far that's actually gotten far enough to men's telescopes but if you look at the antique was one of the date of when we have on this planet right so we spent four and a half billion years fucking around on this planet with life we got most of it was it was pretty lame stuff from an intelligence perspective he does bacteria and then the dinosaurs spent then the things right The Accelerated by then the dinosaurs spent over a hundred million a year is stomping around here without even inventing smartphone and and then very recently I only spent four hundred years going from Newton to us right now in terms of technology and look what we don't even

Whisper

One of them is, I made the claim, I think most civilizations, going from, I mean, simple bacteria like things to space colonizing civilizations, they spend only a very, very tiny fraction of their life being where we are. That I could be wrong about. The other one I could be wrong about is the quite different statement that I think that actually I'm guessing that we are the only civilization in our observable universe from which light has reached us so far that's actually gotten far enough to invent telescopes. So let's talk about maybe both of them in turn because they really are different. The first one, if you look at the N equals one, the date of one we have on this planet, right? So we spent four and a half billion years f**king around on this planet with life, right? We got, and most of it was pretty lame stuff from an intelligence perspective, you know, the dinosaur has spent, then the things were actually accelerated, right? Then the dinosaur has spent over a hundred million years stomping around here without even inventing smartphones. And then very recently, you know, it's only spent four hundred years going from Newton to us, right? In terms of technology, and we've looked at what we've done even.

最后的音频文件是董事会会议

AssemblyAI

East Side charter. I'm sorry. Go now. Okay. I'd like to call to order a special joint meeting of the board of directors of Eastside Charter School and Charter School of Newcastle. It is 535. I'd like to call the role. And attending for East Side Charter School we have Ms. Stewart, Mr. Sawyer, Dr. Gordon, Mr. Hare, Ms. Sims, Mr. Veal, Ms. Fortunato, Ms. Tieno and Mr. Humphrey. And attending for Charter School in Newcastle. We have Dr. Bailey, Ms. Johnson, mr. Taylor, mr. McDowell, mr. Preston and Mr. Humphrey. I miss anybody, and I do not believe anybody is on the conference line. As there is no public items on our agenda. I would like a motion from a Charter School of Newcastle board meeting to move into executive discussion to talk about personnel matters. I'll make that motion. Thank you, Mr. Second, Mr. Preston. All Charter School and Newcastle board members in favor, please say aye. Aye. Any opposed? Motion unanimous. I would ask the same. Put the same question to East Side Charter School. Thank you, MSN. Is there a second? Thank you, Mr. Veal. All those in favor, please say aye. Any opposed? Okay, so we move from public session to executive session at 535. We're back in public session. You just read your message. Okay, we're now back in public session at 715. And there being no further business, I will entertain a motion from Charter School Newcastle to adjourn. Thank you. Is there a second? Thank you. All in favor please say aye. Opposed? Charter School. Adjourn EastWater Charter School for the same motion. Thank you. Thank you, Ms. Mitchell. All those in favor, please say aye. Opposed? Motion carries. Meeting adjourned. Thank you all very much.

Google Speech-to-Text

I'd like to call to order a special joint meeting of the board of directors of Eastside charter school is Charter School of New Castle it is 5:35 I'd like to call the roll and they're sending for eastside Charter School dr. Gordon sister here I miss them Mr Vilnius Fortunato misiano and Mr Humphrey attending for Charter School of New Castle we have dr. Bailey is Johnson mr. Taylor Miss McDowell mr. Preston and mr. Humphries is anybody and I do not believe anybody is on the conference line is there is no public items on our agenda I would like a motion from a charter school of New Castle board meeting to move into executive discussion to talk about personal matters call Turtle Newcastle board members in favor please say I charter school all those in favor please say I so we moved from public session to Executive session at 5:35 okay it is 750 + can you just leave it here at 7:15 and there being no further business I was in between the motion soundtrack to a New Castle to adjourn thank you is there s you all in favor please say I referred her to let her know I will be set at her school for the promotion of a second long does it take a PPI motion carry beating jiren thank you all very much./p>

Whisper

I'd like to call to order a special joint meeting of the board of directors of East Side Charter School in Charter School of Newcastle. It is 535. I'd like to call the role and attending for East Side Charter School. We have Mr. Stewart, Mr. Sawyer, Dr. Gordon, Mr. Hair, Ms. Thames, Mr. Veal, Ms. Portionato, Ms. Dienno, and Mr. Humphrey. And attending for Charter School of Newcastle, we have Dr. Bailey, Ms. Johnson, Mr. Taylor, Mr. McDowell, Mr. Preston, and Mr. Humphrey. I do not believe anybody is on the conference line. As there is no public items on our agenda, I would like a motion from a Charter School of Newcastle board meeting to move into executive discussion to talk about personnel matters. I'll make that motion. Thanks, Mr. Preston. All Charter School of Newcastle board members in favor, please say aye. Aye. Aye. Any opposed? Motion unanimous. I would ask the same question to East Side Charter School. Thank you, Mr. Thames. Is there a second? Thank you, Mr. Veal. All those in favor, please say aye. Aye. Any opposed? Okay. So we move from public session to executive session at 535. Okay, we're back. Okay. It is now 715. And we're back in public session. You just need to carry my phone. Okay. So we are now back in public session at 715. And they're being a further business. I will then be paying the motion from Charter School of Newcastle to adjourn. Thank you. Is there a second? Thank you. All in favor, please say aye. Aye. Any opposed? Charter School adjourned. I will ask East Side Charter School for the same motion as usual. Thank you. Mr. Thames, Mr. Mitchell, all those in favor, please say aye. Any opposed? Motion carries. Meeting adjourned. Thank you all very much.

正如我们所看到的,Whisper表现得非常好,是当今最先进的语音识别选项的绝佳补充。

Whisper推理时间

Whisper有五种尺寸可供选择——小尺寸、基本尺寸(默认)、小尺寸、中等尺寸和大尺寸——尺寸越来越准确。因此,大模型具有最佳精度,是论文和上图中报告的基准中使用的模型。Whisper可以在CPU和GPU上使用;然而,当使用较大的模型时,CPU上的推理时间慢得令人望而却步,因此建议仅在GPU上运行它们。

Micro Machines的例子是用Whisper在每个型号的CPU和GPU上转录的,推理时间如下所示。首先,我们看到CPU(i5-11300H)的结果

接下来,我们有关于GPU(高RAM GPU Colab环境)的结果

以下是并列的相同结果

 

 

以下是并列的相同结果

Implementation Detail

If you run into the RuntimeError "slow_conv2d_cpu" not implemented for 'Half' when using Whisper on CPU, you will have to use Whisper's low-level API in Python and replace options = whisper.DecodingOptions() with options = whisper.DecodingOptions(fp16=False).

运行Whisper的成本

我们提供了使用GCP中的Whisper(1x A100 40 GB)转录1000小时音频的成本,用于使用不同批量大小的每个型号,其值可在图例中找到。

此图中的成本描述了部署Whisper后运行Whisper的原始计算成本。成本不包括构建相关基础设施、修复模型错误或改进和更新模型的人员编制。为了持续保持性能竞争力,需要一个专门的研究或研究工程团队。

结束语

我们的上述分析表明,Whisper在几种语言的语音识别方面取得了最先进的结果。与其他开源选项相比,Whisper将成为研究人员和黑客的宝贵工具,无论是准确性还是易用性。Whisper的性能部分源于其计算强度,因此需要更大、更强大版本的Whisper应用程序应确保在GPU上运行Whisper,无论是在本地还是在云中。

#Whisper高级用法

我们在上面的“如何运行OpenAI的Whisper”部分了解了Whisper。对于更复杂的示例,我们将查看多语言ASR笔记本的修改版本。执行以下命令下载示例代码并安装必要的要求:

git clone https://github.com/AssemblyAI-Examples/whisper-multilingual.git
cd whisper-multilingual
pip install -r requirements.txt

接下来,只需运行python main.py即可将几个韩语音频文件转录并翻译成英语。每个数据在CPU上处理大约需要3分钟。我们总共使用了10个数据点,所以在检查main.py代码时,让进程在后台运行。

首先,我们执行所有必要的导入,然后定义一个用于下载和存储音频数据的类。这个类的细节是不相关的,所以为了简洁起见,省略了它们。

import io
import os

import torch
import pandas as pd
import urllib
import tarfile
import whisper

from scipy.io import wavfile
from tqdm import tqdm

class Fleurs(torch.utils.data.Dataset):
    pass

接下来,我们设置一些参数来显示熊猫的结果,设置用于推理的设备,然后设置指定音频语言的变量。第一个是用于下载数据的朝鲜语代码,第二个是用于Whisper模型的朝鲜文代码。

# Display options for pandas dataset
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 1000

# Set inference device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Set language (korean)
language_google = "ko_kr"
language_whisper = "korean"

现在,我们使用上面定义的类创建数据集,选择10个音频文件的子样本,以加快处理速度。

# Create dataset object, selecting only 10 examples for brevity
dataset = Fleurs(language_google, subsample_rate=1, device=device)
dataset = torch.utils.data.random_split(dataset, [10, len(dataset)-10])[0]

接下来,我们加载将要使用的Whisper模型,选择“微小”模型版本以更快地进行推理。然后我们设置转录和翻译选项。

# Load tiny Whisper model
model = whisper.load_model("tiny")

# Set options
options = dict(language=language_whisper, beam_size=5, best_of=5)
transcribe_options = dict(task="transcribe", **options)
translate_options = dict(task="translate", **options)

最后,我们对数据集进行迭代,将每个音频文件转录为韩语,并将各个音频文件翻译为英语。请注意,翻译直接发生在音频数据上,而不是将生成的转录翻译成英语。除了用于比较的基本事实参考外,我们还将转录和翻译保存到列表中。

# Run inference
references = []
transcriptions = []
translations = []

for audio, text in tqdm(dataset):
    transcription = model.transcribe(audio, **transcribe_options)["text"]
    translation = model.transcribe(audio, **translate_options)["text"]

    transcriptions.append(transcription)
    translations.append(translation)
    references.append(text)

最后,我们创建pandas DataFrame来存储结果,然后打印结果并将其保存到CSV。

# Create dataframe from results and save the data
data = pd.DataFrame(dict(reference=references, transcription=transcriptions, translation=translations))
print(data)
data.to_csv("results.csv")

结果如下所示

 

文章链接