Resource

 

Tehran NLP Lab Resources

 

Note: Usage of these resources for any research or non-commercial purpose requires that you cite the mentioned papers.

 

 

  • TEP: Tehran English-Persian Parallel Corpus

 

  • First free English-Persian corpus
  • 4-million tokens on each side
  • Sentence Aligned
  • Extracted from movie subtitles
  • Text domain: informal/conversational
  • Total alinged movie subtitles: 1600
  • Total number of bilingual sentences: 612086
  • Average sentence length: 7.8 words
  • Corpus size (ignoring punctuations): About four million words
  • Unique words on Persian side: 114275 (17605 with freq. > 10)
  • Unique words on English side: 73002 (12716 with freq. > 10)

 

 

Download

 

Please refer to:

M. T. Pilevar, H. Faili, and A. H. Pilevar, “TEP: Tehran English-Persian Parallel Corpus”, in proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2011).

 

 

  • TMC: Monolingual Corpus

 

 

Download

 

 

 

  • Mutual Information

We calculated pair-wise mutual information between english words using 2 gigabytes of Wikipedia documents. This pakage contains two files ‘WordID.txt’ and ‘MI.txt’. In ‘MI.txt’ for each English word a list of english words that have most mutual information with the first English word have been considered. Format of each line of this file is (English word Id:English word Id, mi;English word Id, mi; …), where English word Id come from ‘WordID.txt’.Pair-wise mutual information between Persian words also have been calculated using Hamshahri and Irna text corpus and have same format as English.

 

Download

 

  • PSD

This is a mapping set used for PSD in Persian. TPP is used as sense inventory for 34 most common English preposition. We provide a mapping to Persian translation for each English Preposition sense.

 

Download

 

 

 

  • Grammar and context sensitive spell checker

Here is a real-world test set for grammatical errors and context sensitive spelling errors for Persian languageThis test set contains 1100 context sensitive errors and was collected from Persian Blogs.

 

Download: Test set for grammatical errors and context sensitive spelling errors of Persian language

Download: Test set for context sensitive spelling errors of Persian language

 

Please refer to:

B. Mirzababaei, H. Faili and N. Ehsan, "Discourse-aware Statistical Machine Translation as a Context-Sensitive Spell Checker", In proceeding of Recent Advances in Natrual Language Precessing, pp 475--482, 2013.

 

 

  • Spell checker

Test set for spelling errors for Persian language.​

Download

 

Please refer to:

H.Faili, N.Ehsan, M.Montazery, M.T.Pilehvar, “Vafa Spell-Checker for Detecting Spelling, Grammatical and Real-word Errors of Persian Language”, ‪Digital Scholarship in the Humanities 31 (1), 95-117‬, 2016​.

 

  • Two datasets to predicting the popularity of online content in online news agencies

Here is Tabnak and Alef Datasets which are the most famous online news agencies in Iran. This dataset includes content, title, date, category and number of comments per each news. Besides popularity of these websites, the wide range of news categories they cover and they have the multilevel commenting structure

 

 

Download

 

Please refer to:

A.Balali, A. Rajabi, S. Ghasemi, M. Asadpour, H. Faili. “Content Diffusion Prediction in Social Networks” 5th International Conference on Information and Knowledge Technology, Iran, 2013 (Accepted but still not published).

 

 

 

  • Five Datasets to predict the hierarchical structure of conversation threads

Here is five Datasets. These datasets have been crawled from 5 websites, including Thestandard , Alef , ENENews , Russianblog and Courantblogs Datasets (XML format). Thay are selected due to several reasons:

  1. They have a section that users can write their comments on it per each news or reply to other comments;
  2. They support replies multilevel structure that provide more reply levels;
  3. Users are very active and news usually have many comments;
  4. Comments have author, content and the posting time information;

 

Download

 

 

 

 

  • Hand-aligned Parallel Corpus for Machine Translation Systems

Here is a hand-aligned parallel corpus, which can be used in machine translation systems;

 

 

Download

 

 

 

  • XTAG Treebank

A hybrid method of Supertagging is applied on a subset of Wall Street Journal (WSJ) in order to annotate the corpus with linguistically motivated elementary structures of the English XTAG grammar. The accuracy of annotation differs in three values.

 

Download: XTAG Treebank with accuracy 0.7

Download: XTAG Treebank with accuracy 0.6

Download: XTAG Treebank with accuracy 0.5

 

 

Please refer to:

Zarei, F., Basirat, A., Faili, H., & Mirain, M. (2015). A bootstrapping method for development of Treebank. Journal of Experimental & Theoretical Artificial Intelligence, (ahead-of-print), 1-24

 

 

 

  • Prallel Gold Data from Wikipedia

This dataset contains parallel sentences, which are tagged from 33 wikipedia pages.

 

Download

 

For reference, please follow "readme" file.

 

  • HPSG Treebank

    • Here is a dataset containing about 27000 sentences parsed in HPSG format.
      • Download
      • For password please contact: hfaili {at} ut.ac.ir
    • For online parser click here!
       
  • Automatic Construction of WordNet Using Graph-based WSD

 

We developed an automatic method for constructing wordnet for low-resourced languages. The only required resources are a dictionary from target language to English and a monolingual corpus in target language. 

 

For using this code please refer to the following paper:

Nasrin. Taghizadeh and Hesham. Faili. "Automatic Wordnet Development for Low-resource Languages using Cross-lingual WSD". Journal of Artificial Intelligence Research, Vol. 56

Downlod Source code