Resources
Links to corpora, toolkits, and communities working on ST.
Data
| Dataset | Paper | Languages and Duration | Domain |
|---|---|---|---|
| Fisher–CALLHOME | (Post et al. 2013) | Es→En 160hrs | phone conversations |
| STC | (Shimizu et al. 2014) | En↔Jp 22hrs | simult. interpret. |
| How2 | (Sanabria et al. 2018) | En→Pt 300hrs | instructional videos |
| IWSLT 2018 | (Niehues et al. 2018) | En→De 273hrs | TED talks |
| LIBRI-TRANS | (Kocabiyikoglu et al. 2018) | En→Fr 236hrs | read audiobooks |
| MuST-C | (Cattoni et al. 2021) | En→ 14 lang. (237-504hrs) | TED talks |
| CoVoST | (Wang et al. 2020) | En→15 lang. (929hrs), 21 lang.→En (30-311hrs) |
read, Common Voice |
| Europarl-ST | (Iranzo-Sanchez et al. 2020) | 9 lang. (72 dir., 10-90hrs) | EP proceedings |
| LibriVoxDeEn | (Beilharz et al. 2020) | De→En 100hrs | read audiobooks |
| MaSS | (Boito et al. 2020) | 8 lang. (56 dir.) 20hrs | Bible readings |
| BSTC | (Baidu, 2020) | Zh→En 50hrs | simult. interpret. |
| Multilingual TEDx | (Salesky et al. 2021) | 8 lang.→6 lang. 11-69hrs | TED talks |