kaldi中的egs文件夹中的demo都是干什么的_Grace_yan的博客-CSDN博客-蚂蚁淘生物

kaldi官网对egs的介绍在这里 http://kaldi-asr.org/doc/examples.html

下面只是简单记录下如果你需要哪个请细看每个文件夹里面的README。

aishell 里面有两文件夹 S5是一种语音识别方法的demo V1是一种说话人识别的demo。数据用的就是aishell1。

aishell2 只有S5 也就是语音识别。

ami The AMI Meeting Corpus 只有S5和S5b 都是语音识别 S5b应该是在S5方法的基础上有所改良。

an4 只有语音识别

apiai_decode 只有语音识别。This directory contains scripts on how to use a pre-trained chain english model and kaldi base code to recognize any number of wav files.

aspire 语音识别。This recipe is JHU’s submission to the ASpIRE challenge. It uses Fisher-English corpus for training the acoustic and language models. It uses impulse responses and noises from RWCP, AIR and Reverb2014 databases to create multi-condition data

aurora4 use the Wall Street Journal corpus。有干净的声音也有人工加入的噪音。

babel 语音识别。有4种方法demo。有多语种识别的demo。

babel_multilang 同上应该是一个专门的语种识别demo。

bentham 图像识别 OCR识别。This directory contains example scripts for handwriting recognition on the Bentham dataset: http://www.transcriptorium.eu/~htrcontest/contestICFHR2014/public_html/ In the ICFHR 2014 contest, the best performing system in the unrestricted track obtained a WER of 8.6%.

bn_music_speech The MUSAN corpus is required for system training. It is available at: http://www.openslr.org/17/ The test requires Broadcast News data. The LDC Catalog numbers are: Speech LDC97S44 Transcripts LDC97T22

callhome_diarization This directory contains example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. The 2000 NIST SRE is required, and has an LDC catalog number LDC2001S97. Additional data sources (mostly past NIST SREs, Switchboard, etc) are required to train the systems in the subdirectories. See the corresponding README.txt files in the subdirectories for more details. The subdirectories \"v1” and so on are different diarization recipes. The recipe in v1 demonstrates a standard approach using iVectors, PLDA scoring and agglomerative hierarchical clustering。

callhome_egyptian 语音识别

chime1 语音识别demo。This is a kaldi setup for 1st CHiME challenge. See http://spandh.dcs.shef.ac.uk/projects/chime/challenge.html for more detailed information.

chime2 语音识别demo 同上

chime3 语音识别demo 同上

chime4 语音识别demo 同上

chime5 语音识别demo 同上

cifar 图像识别。This directory contains example scripts for image classification with the CIFAR-10 and CIFAR-100 datasets, which are available for free from https://www.cs.toronto.edu/~kriz/cifar.html. This demonstrates applying the nnet3 framework to image classification for fixed size images.

commonvoice 语音识别。This is a Kaldi recipe for the Mozilla Common Voice corpus v1. See https://voice.mozilla.org/data for additional details. The amount of training audio is approximately 240 hours.

csj 语音识别。使用日语语料库。Corpus of Spontaneous Japanese

dihard_2018 This is a Kaldi recipe for The First DIHARD Speech Diarization Challenge. DIHARD is a new annual challenge focusing on \"hard” diarization; that is,speech diarization for challenging corpora where there is an expectation that the current state-of-the-art will fare poorly, including, but not limited to: clinical interviews, extended child language acquisition recordings, YouTube videos and \"speech in the wild” (e.g., recordings in restaurants) See https://coml.lscp.ens.fr/dihard/index.html for details. The subdirectories \"v1” and so on are different speaker diarization recipes. The recipe in v1 demonstrates a standard approach using a full-covariance GMM-UBM, i-vectors, PLDA scoring and agglomerative hierarchical clustering. The example in v2 demonstrates DNN speaker embeddings, PLDA scoring and agglomerative hierarchical clustering.

fame The FAME! Speech Corpus。弗里斯兰人语料库。有语音识别和说话人识别。

farsdat 语音识别。farsdat是TIMIT在波斯语中的的对应物。波斯语语料库。

fisher_callhome_spanish 语音识别。西班牙语语料库。

fisher_english 语音识别。the Fisher-English corpus

fisher_swbd 语音识别。这个没有readme文件。有关SWBD语料库的介绍 https://catalog.ldc.upenn.edu/LDC97S62

formosa 语音识别。### Welcome to the demo recipe of the Formosa Speech in the Wild (FSW) Project ### The language habits of Taiwanese people are different from other Mandarin speakers (both accents and cultures) [1]. Especially Tainwaese use tranditional Chinese characters, i.e., 繁體中文). To address this issue, a Taiwanese speech corpus collection project \"Formosa Speech in the Wild (FSW)” was initiated in 2017 to improve the development of Taiwanese-specific speech recognition techniques.

gale_arabic 语音识别。阿拉伯语语料库。GALE Phase 2 Arabic Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 200 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program.

gale_mandarin 语音识别中文普通话语料库。This recipe is trained on LDC2013S08 (text transcripts from LDC2013T20) which is Gale Phase 2 Chinese Broadcast News speech: 126 hours of of Mandarin Chinese broadcast news speech collected in 2006 and 2007 by LDC and HKUST.

gp 多语种语音识别。About the GlobalPhone corpus: This is a corpus of read sentences from the newspapers in 19 different languages recorded under varying degrees of \"clean” conditions. There is roughly 15-20 hours of training data for each language, as well as DEV and EVAL sets of roughly 2 hours each.

heroico 西班牙语语音识别

hkust 湖南方言语音识别

hub4_english 语音识别 English Broadcast News (HUB4) corpus.包含了十份LDC的数据

hub4_spanish 西班牙语语音识别

iam This directory contains example scripts for handwriting recognition on the IAM dataset: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

iban 语音识别马来西亚语

ifnenit This directory contains example scripts for handwriting recognition on the Arabic IFN/ENIT dataset: http://www.ifnenit.com You’ll need to register at their website to be able to download the dataset.

librispeech 语音识别。The LibriSpeech corpus is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16kHz. The accents are various and not marked, but the majority are US English. It is available for download for free at http://www.openslr.org/12/. It was prepared as a speech recognition corpus by Vassil Panayotov.

lre 语种识别

lre07 语种识别。This directory (lre07) contains example recipes for the 2007 NIST Language Evaluation. The subdirectory v1 demonstrates the standard LID system, which is an I-Vector based recipe using full covariance GMM-UBM and logistic regression model. The subdirectory v2 demonstrates the LID system using a time delay deep neural network based UBM which is used to replace the GMM-UBM of v1. The DNN is trained using about 1800 hours of the English portion of Fisher.

madcat_ar MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Arabic Corpus is a LDC dataset (LDC2012T15, LDC2013T09, LDC2013T15) for handwriting recognition. The dataset contains abstracts from News related passages and blogs. The xml file for each page provides line segmentation and word segmentation information and also provides the writing condition (writing style, speed, carefulness) of the page. It is a large size dataset with total 42k page images and 750k (600k training, 75k dev, 75k eval) line images and 305 writers.The major text is in Arabic but it also contains English letters and numerals. The dataset contains about 95k unique words and 160 unique characters. The dataset has been used in NIST 2010 and 2013 (Openhart Arabic large vocabulary unconstrained handwritten text recognition competition) evaluation (maybe with different splits) for line level recognition task. 16.1% WER was obtained for line level recognition in that competition. More info: https://catalog.ldc.upenn.edu/LDC2012T15, https://catalog.ldc.upenn.edu/LDC2013T09/, https://catalog.ldc.upenn.edu/LDC2013T15/.

madcat_zh This directory contains example scripts for handwriting recognition on the MADCAT Chinese HWR dataset (LDC2014T13). This dataset consists of handwritten Chinese documents, scanned at high resolution and annotated for each line and token. More info: https://catalog.ldc.upenn.edu/LDC2014T13

mini_librispeech 语音识别没有readme

multi_en 语音识别。This is a WIP English LVCSR recipe that trains on data from multiple corpora。Large Vocabulary Continuous Speech Recognition (LVCSR)

ptb 语音识别。use the Penn Treebank corpus

reverb 带混响的语音识别。About the REVERB challenge ASR task: This is a kaldi recipe for REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. The challenge assumes the scenario of capturing utterances spoken by a single stationary distant-talking speaker with 1-channel (1ch), 2-channel (2ch) or 8-channel (8ch) microphone-arrays in reverberant meeting rooms. It features both real recordings and simulated data, a part of which simulates the real recordings. The ASR challenge task consists of improving the recognition accuracy of the same reverberant speech. The background noise is mostly stationary and the signal-to-noise ratio is modest. See http://reverb2014.dereverberation.com in more detail.

rimes Rimes is a French handwriting recognition database created by A2iA.

rm 语音识别。dan的ppt上讲语音识别流程用的例子。

sitw 真实环境中的说话人识别。This directory (sitw) contains example scripts for the Speakers in the Wild (SITW) Speaker Recognition Challenge. The SITW corpus is required, and can be obtained by following the directions at the url http://www.speech.sri.com/projects/sitw/

sprakbanken 语音识别丹麦语。About the sprakbanken corpus: This corpus is a free corpus originally collected by NST for ASR purposes and currently hosted by the Norwegian libraries. The corpus is multilingual and contains Swedish, Norwegian (Bokmål) and Danish. The current setup uses the Danish subcorpus. The vocabulary is large and there is approx. 350 hours of read-aloud speech with associated text scripts。 Some months ago the corpus was republished here: http://www.nb.no/sprakbanken/#ticketsfrom?lang en

sprakbanken-swe 同上语音识别瑞典语。

sre08 说话人识别 2008年的比赛语料库。This directory (sre08) contains example scripts for speaker identification, not speech recognition.

sre10 说话人识别 This directory (sre10) contains example scripts for the NIST SRE 2010 speaker recognition evaluation

sre16 说话人识别同上 2016年比赛语料库

svhn This directory contains example scripts for image classification with the SVHN (Street View House Numbers) dataset, which is available for free from http://ufldl.stanford.edu/housenumbers/. This demonstrates applying the nnet3 framework to image classification for fixed size images.

swahili 语音识别斯瓦希里语语音语料库

swbd 语音识别 Switchboard corpus Fisher corpus

tedlium 语音识别

thchs30 中文语音识别。THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University. The origional recording was conducted in 2002 by Dong Wang, supervised by Prof. Xiaoyan Zhu, at the Key State Lab of Intelligence and System, Department of Computer Science, Tsinghua Universeity, and the original name was ‘TCMSD’, standing for ‘Tsinghua Continuous Mandarin Speech Database’. The publication after 13 years has been initiated by Dr. Dong Wang and was supported by Prof. Xiaoyan Zhu. We hope to provide a toy database for new researchers in the field of speech recognition. Therefore, the database is totally free to academic users. The database can be downloaded from openslr: http://www.openslr.org/18/ or from the CSLT server: http://data.cslt.org/thchs30/README.html

tidigits 英文数字语音识别。The TIDIGITS database consists of men, women, boys and girls reading digit strings of varying lengths; these are sampled at 20 kHz. It’s available from the LDC as catalog number LDC93S10.

timit Available as LDC corpus LDC93S1, TIMIT is one of the original clean speech databases. Description of catalog from LDC (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId LDC93S1): \"The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.” Note: please do not use this TIMIT setup as a generic example of how to run Kaldi, as TIMIT has a very nonstandard structure. Any of the other setups would be better for this purpose: e.g. librispeech/s5 is quite nice, and is free; yesno is very tiny and fast to run and is also free; and wsj/s5 has an unusually complete set of example scripts which may however be confusing. s5: Monophone, Triphone GMM/HMM systems trained with Maximum Likelihood, followed by SGMM and DNN recipe. Training is done on 48 phonemes (see- Lee and Hon: Speaker-Independent Phone Recognition Using Hidden Markov Models. IEEE TRANSACTIONS ON ACOUSTICS. SPEECH, AND SIGNAL PROCESSING, VOL. 31. NO. 11, PG. 1641-48, NOVEMBER 1989, ). In scoring we map to 39 phonememes, as is usually done in conference papers.

tunisian_msa 语音识别。使用突尼斯语料库的阿拉伯语Kaldi方法。

uw3 This directory contains example scripts for optical character recognition (i.e. OCR) on the UW3 dataset (it’s a printed English OCR corpus): http://isis-data.science.uva.nl/events/dlia//datasets/uwash3.html

voxceleb 说话人识别。 This is a Kaldi recipe for speaker verification using the VoxCeleb1 and VoxCeleb2 corpora. See http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ and http://www.robots.ox.ac.uk/~vgg/data/voxceleb2/ for additional details and information on how to obtain them. Note: This recipe requires ffmpeg to be installed and its location included in $PATH The subdirectories \"v1” and so on are different speaker recognition recipes. The recipe in v1 demonstrates a standard approach using a full-covariance GMM-UBM, iVectors, and a PLDA backend. The example in v2 demonstrates DNN speaker embeddings with a PLDA backend.

voxforge voxforge是个收集语音的网址你可以免费得到语音库。VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines. VoxForge was set up to collect transcribed speech to create a free GPL speech corpus for use with open source speech recognition engines. The speech audio files will be ‘compiled’ into acoustic models for use with open source speech recognition engines such as Julius, ISIP, and Sphinx and HTK (note: HTK has distribution restrictions). VoxForge has[1] used LibriVox as a source of audio data since 2007. 这个demo中有一个online语音识别的例子。所谓offline语音识别就是在你开始识别一句话的时候这段语音的开始和结束你都已经有了。online就是只有开始没有结束。

vystadial_cz 语音识别捷克语也有online识别的demo。

vystadial_en 英语语音识别也有online识别的demo。和上面那个好像出自同一篇论文的关系。好像还有个对话系统啥的。如果你需要的话就自己看readme吧。The data comprise over 41 hours of speech in English. The English recordings were collected from humans interacting via telephone calls with statistical dialogue systems, designed to provide the user with information on a suitable dining venue in the town. The data collection process is described in detail in article \"Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license” published for LREC 2014 (To Appear). WE USE COMMON KALDI DECODERS IN THE SCRIPTS (gmm-latgen-faster through steps/decode.sh) However, the main purpose of providing the data and scripts is training acoustic models for real-time speech recognition unit for dialog system ALEX, which uses modified real-time Kaldi OnlineLatgenRecogniser. The modified Kaldi decoders are NOT required for running the scripts!

wsj 语音识别华尔街日报语料库。推荐新手开始用kaldi学语音识别的小例子。About the Wall Street Journal corpus: This is a corpus of read sentences from the Wall Street Journal, recorded under clean conditions. The vocabulary is quite large. About 80 hours of training data. Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ] or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ] The latter option is cheaper and includes only the Sennheiser microphone data (which is all we use in the example scripts).

yesno 语音识别这个网上的资料最多了。新手推荐。The \"yesno” corpus is a very small dataset of recordings of one individual saying yes or no multiple times per recording, in Hebrew. It is available from http://www.openslr.org/1. It is mainly included here as an easy way to test out the Kaldi scripts. The test set is perfectly recognized at the monophone stage, so the dataset is not exactly challenging.

yomdle_fa This directory contains example scripts for OCR on the Yomdle and Slam datasets. Training is done on the Yomdle dataset and testing is done on Slam. LM rescoring is also done with extra corpus data obtained from various newswires (e.g. Hamshahri)

yomdle_korean 同上

yomdle_russian 同上

yomdle_tamil 同上

yomdle_zh 同上应该是中文的

zeroth_korean 韩语语音识别。Zeroth-Korean kaldi example is from Zeroth Project. Zeroth project introduces free Korean speech corpus and aims to make Korean speech recognition more broadly accessible to everyone. This project was developed in collaboration between Lucas Jo( Atlas Guide Inc.) and Wonkyum Lee( Gridspace Inc.). In this example, we are using 51.6 hours transcribed Korean audio for training data (22,263 utterances, 105 people, 3000 sentences) and 1.2 hours transcribed Korean audio for testing data (457 utterances, 10 people). Besides audio and transcription, we provide pre-trained/designed language model, lexicon and morpheme-based segmenter(morfessor) The database can be also downloaded from openslr: http://www.openslr.org/40 The database is licensed under Attribution 4.0 International (CC BY 4.0) This folder contains a speech recognition recipe which is based on WSJ/Librispeech example. For more details about Zeroth project, please visit: https://github.com/goodatlas/zeroth

终于弄完了 dan真的是大佬做过这么多项目肯定这只是一部分资历真是要靠项目和时间啊

下面简单整理一下所有的例子中除了英语语音识别这个大项目外还有很多关于语音的小项目比如说话人识别 aishell fame sitw sre08 sre10 sre16 voxceleb

图像识别 ocr 多语种手写图片识别 bentham cifar iam ifnenit madcat_ar madcat_zh rimes svhn uw3 yomdle_fa yomdle_korean yomdle_russian yomdle_tamil yomdle_zh

speaker diarization 这方汪德亮老师是大牛 callhome_diarization callhome_egyptian dihard_2018

多语种语音识别 csj 日语 fame 弗里斯兰人语 farsdat 波斯语 fisher_callhome_spanish 西班牙语 formosa 台湾语 gale_arabic 阿拉伯语 gale_mandarin 中文普通话 gp 多语种 heroico 西班牙语 hkust 湖南方言 hub4_spanish 西班牙语 iban 马来西亚语 lre lre07 madcat_ar sprakbanken 丹麦语 sprakbanken-swe 瑞典语 swahili 斯瓦希里语 thchs30 中文 tunisian_msa 突尼斯语 vystadial_cz 捷克语 zeroth_korean 韩语

带混响的语音识别 reverb

on-line语音识别 voxforge vystadial_cz 捷克语 vystadial_en

一些涉及到的比赛 CHiME Speech Processing in Everyday Environments (CHiME 2018) 7th September, Microsoft, Hyderabad (a satellite event of Interspeech 2018) http://spandh.dcs.shef.ac.uk/chime_workshop/chime2018/

DIHARD The Second DIHARD Speech Diarization Challenge https://coml.lscp.ens.fr/dihard/index.html

LRE NIST 2017 Language Recognition Evaluation 语种识别大赛 https://www.nist.gov/itl/iad/mig/nist-2017-language-recognition-evaluation 国内清华也有一个语种识别大赛

REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge https://reverb2014.dereverberation.com/

SRE NIST 2019 Speaker Recognition Evaluation https://www.nist.gov/itl/iad/mig/nist-2019-speaker-recognition-evaluation nist 举办的说话人识别比赛的结果或paper一般都发表在第二年的interspeech上国内aishell有举办说话人识别比赛

SITW The Speakers in the Wild (SITW) Speaker Recognition Challenge http://www.speech.sri.com/projects/sitw/

缩写 LVCSR Large Vocabulary Continuous Speech Recognition

------------2019.10.28号添加 ------------------------------------

下面是从别的地方找到也做参考吧 1、babel : IARPA Babel program 语料库来自巴比塔项目主要是对低资源语言的语音识别和关键词检索例子包括普什语波斯语土耳其语越南语等等。据文献上讲效果不太好 wer 达到50以上。 2、sre08 Speaker Recognition Evaluations 说话人识别。 3、aurora4: 主页 http://aurora.hsnr.de/ 研究各种噪音的。带噪音的语音识别-- 健壮的语音识别项目。包括说话人分离音乐分离噪声分离。 4、hkust:香港大学的普通话语音识别 5、callhome_egyptian: 埃及的阿拉伯语语音识别 6、chime_wsj0: chime 挑战项目数据这个挑战是对电话会议远距离麦克风数据进行识别。 7、fisher_englist 英语的双声道话音。 8、gale_arabic:全球自动语言开发计划中的阿拉伯语。 9、gp:global phone项目全球电话语音 19种不同的语言每种15-20小时的语音 10、lre:包括说话人识别语种识别 11、wsj:wall street journal 华尔街日报语料库似乎所有的脚本都是这个东西开始的。 12、swbd:Switchboard 语料库 13、tidigits:男人女人孩子说的不同的数字串语音的识别训练 14、voxforge:开源语音收集项目 15、timit:不同性别不同口音的美国英语发音和词汇标注包括 Texas Instruments (TI) 和 Massachusetts Institute of Technology (MIT), 所以叫timit 16、tedlium: 数据在这里 http://www.openslr.org/resources/7/TEDLIUM_release1.tar.gz TED talks英语语音数据, 由Laboratoire d’Informatique de l’Université du Maine (LIUM) 创建 17、vystadial_cz: dataset of telephone conversations in Czech 希腊人搞的电话语音识别数据 18、vystadial_en: dataset of telephone conversations in English 希腊人搞的电话语音识别数据 19、yesno: 各种yes,no 两个词的语音识别归入命令词语音识别吧。 20、rm:DARPA Resource Management Continuous Speech Corpora

$\"\\\"\\\"\"$ $\"\\\"\\\"\"$ $\"\\\"\\\"\"$ 点赞 2 $\"\\\"\\\"\"$ $\"\\\"\\\"\"$ 评论

友情链接