Resources

Links to corpora, toolkits, and communities working on ST.

Data

Dataset Paper Languages and Duration Domain
Fisher–CALLHOME (Post et al. 2013) Es→En 160hrs phone conversations
STC (Shimizu et al. 2014) En↔Jp 22hrs simult. interpret.
How2 (Sanabria et al. 2018) En→Pt 300hrs instructional videos
IWSLT 2018 (Niehues et al. 2018) En→De 273hrs TED talks
LIBRI-TRANS (Kocabiyikoglu et al. 2018) En→Fr 236hrs read audiobooks
MuST-C (Cattoni et al. 2021) En→ 14 lang. (237-504hrs) TED talks
CoVoST (Wang et al. 2020) En→15 lang. (929hrs),
21 lang.→En (30-311hrs)
read, Common Voice
Europarl-ST (Iranzo-Sanchez et al. 2020) 9 lang. (72 dir., 10-90hrs) EP proceedings
LibriVoxDeEn (Beilharz et al. 2020) De→En 100hrs read audiobooks
MaSS (Boito et al. 2020) 8 lang. (56 dir.) 20hrs Bible readings
BSTC (Baidu, 2020) Zh→En 50hrs simult. interpret.
Multilingual TEDx (Salesky et al. 2021) 8 lang.→6 lang. 11-69hrs TED talks

Tools

Communities