nlp
سه‌شنبه، ۳۱ فروردین ۱۴۰۰
  • ورود
EN FA
سه‌شنبه، ۳۱ فروردین ۱۴۰۰
EN FA
  • صفحه اصلی
  • اعضا آزمایشگاه
  • منابع
  • محصولات و دستاوردها
    • فرازین
    • فرازین بار
    • فرادیک
    • خطایاب وفا
    • ویراستار
    • درخت‎بانک
    • خلاصه سازی فارسی

منابع منابع


Tehran NLP Lab Resources

Note: Usage of these resources for any research or non-commercial purpose requires that you cite the mentioned papers.

 

TEP: Tehran English-Persian Parallel Corpus

  • First free English-Persian corpus

  • 4-million tokens on each side

  • Sentence Aligned

  • Extracted from movie subtitles

  • Text domain: informal/conversational

  • Total alinged movie subtitles: 1600

  • Total number of bilingual sentences: 612086

  • Average sentence length: 7.8 words

  • Corpus size (ignoring punctuations): About four million words

  • Unique words on Persian side: 114275 (17605 with freq. > 10)

  • Unique words on English side: 73002 (12716 with freq. > 10)

Download

Please refer to:

M. T. Pilevar, H. Faili, and A. H. Pilevar, "TEP: Tehran English-Persian Parallel Corpus", in proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2011).


 

Mutual Information

We calculated pair-wise mutual information between english words using 2 gigabytes of Wikipedia    documents. This pakage contains two files ‘WordID.txt' and ‘MI.txt'. In ‘MI.txt' for each English word a list of english words that have most mutual information with the first English word have been considered. Format of each line of this file is (English word Id:English word Id, mi;English word Id, mi; …), where English word Id come from ‘WordID.txt'.Pair-wise mutual information between Persian words also have been calculated using Hamshahri and Irna text corpus and have same format as English.

Download

 

Transliteration

This parallel corpus extracted from a Persian novel book which written in both Arabic and Dabire. According to the first and last words of each sentence, we manually aligned the sentences of this book. Then for checking and eliminating the errors, the length of the sentences in both sides was compared. Finally, our parallel corpus with 13933 sentence pairs, 155623 words in Persian text and 170702 Dabire words was created.

Download

 

PSD

This is a mapping set used for PSD in Persian. TPP is used as sense inventory for 34 most common English preposition. We provide a mapping to Persian translation for each English Preposition sense.

Download

 

Persian WordNet

Here is a Persian WordNet containing more than 68,000 links from Persian words to English synsets. Additionally, each word-synset link in this WordNet contains a value, which shows the probability of assigning that word to that synset. This probabilistic version of Persian WordNet is so helpful for conceptual text processing applications, especially cross-language tasks.

Download

 

Wordnet Construction using Supervised Learning

This wordnet was produced by Natural Language Processing of university of Tehran.
It has been constructed by applying a supervised method exploiting the pre-existing Persian wordnet, FarsNet. The presented wordnet consists of more than 38,000 links from Persian words to Princeton WordNet synsets with a precision score of 91.18%. A similar approach was applied to extend this wordnet to cover more comprehensive and accurate verbal entries. The second version includes more than 25,000 words, 26,000 PWN synsets and 56,000 word-sense pairs.
Due to connecting to Princeton WordNet, it can be used in multi-lingual semantic tasks efficiently.
Each line of wordnet contains a link from a Persian word to a Princeton WordNet synset which is addressed by part of speech tag and synset offset.

Download

 

Grammar and context sensitive spell checker

Here is a real-world test set for grammatical errors and context sensitive spelling errors for Persian language. This test set contains 1100 context sensitive errors and was collected from Persian Blogs.

Download: Test set for grammatical errors and context sensitive spelling errors of Persian language

Download: Test set for context sensitive spelling errors of Persian language

Please refer to:

B. Mirzababaei, H. Faili and N. Ehsan, "Discourse-aware Statistical Machine Translation as a Context-Sensitive Spell Checker", In proceeding of Recent Advances in Natrual Language Precessing, pp 475--482, 2013.

 

 Spell checker

Test set for spelling errors for Persian language.​

Download

Please refer to:

H.Faili, N.Ehsan, M.Montazery, M.T.Pilehvar, "Vafa Spell-Checker for Detecting Spelling, Grammatical and Real-word Errors of Persian Language", ‪Digital Scholarship in the Humanities 31 (1), 95-117, 2016​.

 

 

Two datasets to predicting the popularity of online content in online news agencies

Here is Tabnak and Alef Datasets which are the most famous online news agencies in Iran. This dataset includes content, title, date, category and number of comments per each news. Besides popularity of these websites, the wide range of news categories they cover and they have the multilevel commenting structure

Download

Please refer to:

A.Balali, A. Rajabi, S. Ghasemi, M. Asadpour, H. Faili. "Content Diffusion Prediction in Social Networks" 5th International Conference on Information and Knowledge Technology, Iran, 2013 (Accepted but still not published).

 

Five Datasets to predict the hierarchical structure of conversation threads

Here is five Datasets. These datasets have been crawled from 5 websites, including Thestandard , Alef , ENENews , Russianblog and Courantblogs Datasets (XML format). Thay are selected due to several reasons:

1.    They have a section that users can write their comments on it per each news or reply to other comments;

2.    They support replies multilevel structure that provide more reply levels;

3.    Users are very active and news usually have many comments;

4.    Comments have author, content and the posting time information;

Download

 

Hand-aligned Parallel Corpus for Machine Translation Systems

Here is a hand-aligned parallel corpus, which can be used in machine translation systems;

  • All of the 4 Persian references are adjoined for Persian side, and also English side of corpus is repeated for 4 times.
  • It is made by using both of the PCTS (English-Persian, Persian-English).
  • To extract hand-aligned data, we used alignments for each sentence pairs that are produced by using Giza++.
  • All of the Giza's alignments are edited by two persons. Finally, all of the alignments were examined for final confirmation.
  • Not only word alignments in each sentence, but also phrase alignments for each sentence are extracted (grow-diag-final-and) and it is placed in ‘Phrases' folder.
  • This data set is applicable for many fields of Natural Language Processing.
  • Number of words on English-side: 19,763
  • Number of words on Persian-side
  • Number of aligned words: 24,200
  • Number of aligned phrases: 43,671

Download

Please refer to:

​Tavakoli, L., and Faili, H. (2014). Phrase Alignments in Parallel Corpus Using Bootstrapping Approach, The International Journal of Information and Communication Technology Research. 6(3), 63-76.

 

XTAG Treebank

A hybrid method of Supertagging is applied on a subset of Wall Street Journal (WSJ) in order to annotate the corpus with linguistically motivated elementary structures of the English XTAG grammar. The accuracy of annotation differs in three values.

Download: XTAG Treebank with accuracy 0.7

Download: XTAG Treebank with accuracy 0.6

Download: XTAG Treebank with accuracy 0.5

​Please refer to:

Zarei, F., Basirat, A., Faili, H., & Mirain, M. (2015). A bootstrapping method for development of Treebank. Journal of Experimental & Theoretical Artificial Intelligence, (ahead-of-print), 1-24

 

Prallel Gold Data from Wikipedia

This dataset contains parallel sentences, which are tagged from 33 wikipedia pages.

Download

For reference, please follow "readme" file.

 

Dorsa Treebank

Here is a dataset containing about 30000 sentences parsed in HPSG format.

treebank.ut.ac.ir

Please refer to:

Dehghan, Mohammad Hossein, Mohammad Molla-Abbasi, and Heshaam Faili. "Toward a multi-representation persian treebank." In 2018 9th International Symposium on Telecommunications (IST), pp. 581-586. IEEE, 2018.

 

Automatic Construction of WordNet Using Graph-based WSD

We developed an automatic method for constructing wordnet for low-resourced languages. The only required resources are a dictionary from target language to English and a monolingual corpus in target language.

To use this code please refer to:

Nasrin. Taghizadeh and Hesham. Faili. "Automatic Wordnet Development for Low-resource Languages using Cross-lingual WSD". Journal of  Artificial Intelligence Research, (to be published).

Downlod Source code

 

CPG: Corpus of Persian Grammatical Errors

Here is a fully-annotated corpus of grammatical errors collected from 700 essays written by learners of Persian language in Dehkhoda Lexicon Institute & International Centre for Persian Studies and Imam Khomeini International University. This corpus contains 4700 error tags. It is helpful for evaluation and development of Persian grammatical error detection and correction systems.

Download

 

سامانه رخ نما دانشگاه تهران

تمامي حقوق این وب سایت، متعلق به دانشگاه تهران می باشد.