|
|
TEP: Tehran English-Persian Parallel Corpus
> First free Eng-Per corpus
> 4-million tokens on each side
> Sentence Aligned

To have a copy of this corpus contact us at: t.pilevar {at} ece.ut.ac.ir |
- Extracted from movie subtitles
- Text domain: informal/conversational
- Total alinged movie subtitles: 1600
- Total number of bilingual sentences: 612086
- Average sentence length: 7.8 words
- Corpus size (ignoring punctuations): About four million words
- Unique words on Persian side: 114275 (17605 with freq. > 10)
- Unique words on English side: 73002 (12716 with freq. > 10)
Usage of this package for any research or non-commercial purposes requires the precondition that you cite the following paper:
M. T. Pilevar, H. Faili, and A. H. Pilevar, “TEP: Tehran English-Persian Parallel Corpus”, in proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2011).
|
TMC: Tehran Monolingual Corpus
> Largest freely available monolingual corpus for Persian language
> Tokenized
> Suitable for Language Modeling
> More than 250M words in total, ~300K unique words of freq. > 1

To have a copy of this corpus contact us at: t.pilevar {at} ece.ut.ac.ir or nlp {at} ece.ut.ac.ir
|
|