用于大词汇量语音识别的门控残差DFSMN声波模型

doi:10.19784/j.cnki.issn1672-0172.2022.05.002

摘要/Abstract

摘要： 深度前馈序列记忆网络（DFSMN,Deep Feedforward Sequential Memory Network）是一种识别精度较高的声学模型,其在相邻的记忆块间引入跳跃链接来缓解梯度消失问题。而训练一个深层堆叠的DFSMN仍是十分具有挑战性的任务,且简单的网络层堆叠并不能使网络模型的性能得到提升。在构造非常深的神经网络结构时,残差学习是一种有效的方法,可以帮助神经网络更容易、更快地收敛。提出一种名为门控残差DFSMN（Gated Residual DFSMN,GR-DFSMN）的新型网络结构。该模型从低层DFSMN块引入了额外的门控捷径用于有效地训练深层DFSMN结构的网络。实验结果表明,当训练非常深的模型时,GR-DFSMN相比于普通的DFSMN具有较好的性能。在1000小时的大规模英语语料库任务中,当层数达到40时,与DFSMN相比,GR-DFSMN在四个测试集上评估所得的平均字错误率降低了0.7%。

关键词: 语音识别, DFSMN, 门控残差, CTC

Abstract: Deep Feedforward Sequential Memory Network (DFSMN) is a powerful acoustic model in terms of recognition accuracy. It alleviates the gradient vanishing problem by introducing skip connections between memory blocks in adjacent layers. However, we find it is still a challenging task to optimize the neural networks when training very deep DFSMNs and simply stacking more layers can not lead to better neural networks. Residual learning is an efficient method to help neural networks converge easier and faster when building very deep structures. A novel network architecture named gated residual DFSMN (GR-DFSMN) is proposed. It introduces additional gate controlled shortcut paths from lower DFSMN blocks for efficient training of networks with very deep DFSMN structures. Experimental results have shown that GR-DFSMN can outperform the original DFSMN when training very deep models. In the 1000 hours English Librispeech task, GR-DFSMN Mono-Phone CTC model achieves a 0.7% absolute improvement compared to the original DFSMN Mono-Phone CTC model.

Key words: Speech recognition, DFSMN, Gated residual, CTC

中图分类号:

TN912.34

霍伟明, 徐浩. 用于大词汇量语音识别的门控残差DFSMN声波模型[J]. 家电科技, 2022, 0(5): 22-25.

HUO Weiming, XU Hao. Gated residual DFSMN acoustic models for large vocabulary speech recognition[J]. Journal of Appliance Science & Technology, 2022, 0(5): 22-25.

参考文献 22

[1]	侯一民, 周慧琼, 王政一. 深度学习在语音识别中的研究进展综述[J]. 计算机应用研究, 2017, 34(08): 2241-2246.
[2]	马晗, 唐柔冰, 张义, 张巧灵. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(01): 1-10.
[3]	贾嘉敏, 潘文林. 基于改进Inception-ResNet_v2的低资源少数民族语音识别[J]. 软件导刊, 2022, 21(08): 51-55.
[4]	刘耦耕, 李圣清, 肖强晖. 多层前馈人工神经网络结构研究[J]. 湖南师范大学自然科学学报, 2004, 27(01): 405-407.
[5]	王勇和, 飞龙, 高光来. 基于TDNN-FSMN的蒙古语语音识别技术研究[J]. 中文信息学报, 2018, 32(09): 28-34.
[6]	周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017, 40(06): 1229-1251.
[7]	杨丽, 吴雨茜, 王俊丽, 刘义理. 循环神经网络研究综述[J]. 计算机应用, 2018, 38(zk2): 1-6+26.
[8]	郭玥秀, 杨伟, 刘琦, 王玉. 残差网络研究综述[J]. 计算机应用研究, 2020, 37(05): 1292-1297.
[9]	Srivastava R K, Greff K, Schmidhuber J. Highway Networks[J]. arXiv preprint arXiv:1507. 06228, 2015.
[10]	He K, Zhang X, Ren S,Jian S. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512. 03385, 2016.
[11]	Kim J, El-Khamy M, Lee J. Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition[J]. arXiv preprint arXiv:1701. 03360, 2017.
[12]	Wang Y, Deng X, Pu S, et al. Residual Convolutional CTC Networks for Automatic Speech Recognition[J]. arXiv preprintar arXiv:1702.07793, 2017.
[13]	Tang R, Lin J.Deep Residual Learning for Small-Footprint Keyword Spotting[A]// ICASSP 2018-2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. IEEE, 2018.
[14]	Zhang S, Liu C, Jiang H, et al.Feedforward Sequential Memory Networks: A New Structure to Learn Long-term Dependency[J]. Computer Science, 2015.
[15]	王庆楠. 基于序列记忆神经网络的藏语声学建模方法研究[D]. 合肥: 中国科学技术大学, 2018.
[16]	付婧, 罗建, 龙彦霖, 等. 前馈序列记忆网络在语音识别中的应用综述[J]. 内江师范学院学报, 2020, 35(04): 41-51.
[17]	Zhang S, Liu C, Jiang H, Wei S and Dai L. Non-Recurrent neural structure for long-term dependence[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2017, 25(04): 871-884.
[18]	Zhang S, Lei M, Yan Z, et al.Deep-FSMN for Large Vocabulary Continuous Speech Recognition[A]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP)[C]. IEEE, 2018: 5869-5873.
[19]	Zhang S, Lei M, Yan Z.Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition[J]. arXiv preprintar arXiv: 1904.10045, 2019.
[20]	Panayotov V, Chen G, Povey D, et al.Librispeech: An ASR corpus based on public domain audio books[A]// ICASSP 2015 - 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[C]. IEEE, 2015.
[21]	Povey D, Ghoshal A, Boulianne G, et al.The Kaldi Speech Recognition Toolkit[J]. Idiap, 2012.
[22]	Pundak G, Sainath T N.Lower Frame Rate Neural Network Acoustic Models[C]. Interspeech, 2016.