Natural Language and Text Processing Laboratory
University of Tehran

TEP: Tehran English-Persian Parallel Corpus
> First free Eng-Per corpus
> 4-million tokens on each side
> Sentence Aligned


To have a copy of this corpus contact us at: t.pilevar {at} ece.ut.ac.ir
  • Extracted from movie subtitles

  • Text domain: informal/conversational

  • Total alinged movie subtitles: 1600

  • Total number of bilingual sentences: 612086

  • Average sentence length: 7.8 words

  • Corpus size (ignoring punctuations): About four million words

  • Unique words on Persian side: 114275 (17605 with freq. > 10)

  • Unique words on English side: 73002 (12716 with freq. > 10)

Usage of this package for any research or non-commercial purposes requires the precondition that you cite the following paper:

M. T. Pilevar, H. Faili, and A. H. Pilevar, “TEP: Tehran English-Persian Parallel Corpus”, in proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2011).




TMC: Tehran Monolingual Corpus
> Largest freely available monolingual corpus for Persian language
> Tokenized
> Suitable for Language Modeling
> More than 250M words in total, ~300K unique words of freq. > 1


To have a copy of this corpus contact us at: t.pilevar {at} ece.ut.ac.ir or nlp {at} ece.ut.ac.ir