Resources
Links to corpora, toolkits, and communities working on ST.
Data
Dataset | Paper | Languages and Duration | Domain |
---|---|---|---|
Fisher–CALLHOME | (Post et al. 2013) | Es→En 160hrs | phone conversations |
STC | (Shimizu et al. 2014) | En↔Jp 22hrs | simult. interpret. |
How2 | (Sanabria et al. 2018) | En→Pt 300hrs | instructional videos |
IWSLT 2018 | (Niehues et al. 2018) | En→De 273hrs | TED talks |
LIBRI-TRANS | (Kocabiyikoglu et al. 2018) | En→Fr 236hrs | read audiobooks |
MuST-C | (Cattoni et al. 2021) | En→ 14 lang. (237-504hrs) | TED talks |
CoVoST | (Wang et al. 2020) | En→15 lang. (929hrs), 21 lang.→En (30-311hrs) |
read, Common Voice |
Europarl-ST | (Iranzo-Sanchez et al. 2020) | 9 lang. (72 dir., 10-90hrs) | EP proceedings |
LibriVoxDeEn | (Beilharz et al. 2020) | De→En 100hrs | read audiobooks |
MaSS | (Boito et al. 2020) | 8 lang. (56 dir.) 20hrs | Bible readings |
BSTC | (Baidu, 2020) | Zh→En 50hrs | simult. interpret. |
Multilingual TEDx | (Salesky et al. 2021) | 8 lang.→6 lang. 11-69hrs | TED talks |