Downloads | References | Contact Us | Will be published soon فارسی
Bijankhan corpus logo
Other Links:
Hamshahri corpus
Univ. of Tehran
DBRG
RCISP
CAASL
Persian Grammar
 
Related Courses
  NLP
  IR
  Adv. Database
  Data Mining
  DB Special Topics
 
 


Welcome to website of Bijankhan corpus

Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags. This collection is prepared and distributed by database research group at University of Tehran. We are indebted to Prof. M.Bijankhan from faculty of Literature & Human Science at University of Tehran because of his invaluable works on the original version of the corpus, so we named this corpus after him.
Moreover, we recommend you to visit web site of Hamshahri corpus that is more suitable for information retrieval researches.

Copyright

Bijankhan corpus was created in DBRG Lab. at University of Tehran ECE department. All rights of this corpus and the tools that are included in this package are reserved for University of Tehran - Database Research Group. Usage of this package for any research or non-commercial purposes is free with the precondition that you cite the related papers below.

This Package's components

  1. Bijankhan processed corpus (149 MB)
  2. Bijankhan original corpus (50.3 MB)
  3. Distinct words of Bijankhan corpus (76707 words in unicode text format)
  4. Five random training and test sets (85% training, 15% test) of the corpus that are used in the following papers.
  5. Source codes of the POS taggers that we used.
  6. Published papers and presentations.

Downloads

Files
Description
1 9.77 MB Processed corpus (11.1 MB): This file is a compressed version of the whole corpus in Unicode text format. This file contains a version of Bijankhan corpus that is processed to be more suitable for NLP tasks according to [1]. It contains nearly 2.6 million tagged words. To download a sample of the corpus click here. Also click here to see tagset description of the corpus.
2
3.7 MB15.4 KB

Original corpus (3.7 MB): This file is a compressed version of the whole corpus in LBL text format. This file contains the original Bijankhan corpus without any changes that was manually tagged and prepared at Research Center of Intelligent Signal Processing (RCISP). Its tag set contains 550 tags and totally it contains 4300 subject categories.

3
256 KB The corpus distinct words (256 KB): This compressed file is unicode text file that contains 76707 distinct word of the Bijankhan corpus.
4 will be added soon Training and test sets(will be added soon): This compressed file contains five diffrent pairs of training and test sets that are created randomly from the Bijankhan corpus. Each training part consists 85% of the corpus and each test part consists 15% of the corpus. For more information please refer [1].
5 15.4 KB15.4 KB MLE Tagger (53.4 KB): This file contains C# source code of Maximum Likelihood Estimation (MLE) tagger that we implemented and used in our studies. Also it contains a demo that shows how to use the program.
6 TnT tagger : In order to prepare a TnT tagger please refer to web site of the TnT: Statistical Part-of-Speech Tagging.
7

MBT Tagger: An open source version of Memory Based POS Tagger (MBT) can be found in this web site.

8 15.4 KB Corpus Words (574 KB): This file contains all words of the corpus and their frequencies.

Published Papers:

Reference
PDF
Power
Point
Description
[1]
Hadi Amiri, Hosein Hojjat, Farhad Oroumchian. Investigation on a Feasible Corpus for Persian POS Tagging. 12th international CSI computer conference, Iran, 2007.
This paper reports creation of test corpus of automatic part of speech tagging purposes based on the Persian tagged corpus of Prof. Bijankhan and includes preprocessing, statistical analysis and experiments with simple statistical POS tagging method, MLE, done on this corpus.
[2]
Farhad Oroumchian, Samira Tasharofi, Hadi Amiri, Hossein Hojjat, Fahime Raja. Creating a Feasible Corpus for Persian POS Tagging. Technical Report, no. TR3/06, University of Wollongong in Dubai, 2006.
This technical report contains a very through analysis and report of the creation of the Bijankhan corpus.
[3]

Samira Tasharofi, Fahimeh Raja, Farhad Oroumchian, Masoud Rahgozar. Evaluation of Statistical Part of Speech Tagging of Persian Text. International Symposium on Signal Processing and its Applications, Sharjah, (U.A.E.), 2007.
This paper study the performance of one of the popular POS taggers namely TnT tagger on the Bijankhan corpus. TNT tagger was shown to have high accuracy in English and some other languages, this paper shows this tagger provides high accuracy in Persian too.
[4]
Fahimeh Raja, Hadi Amiri, Samira Tasharofi, Hossein Hojjat, Farhad Oroumchian. Evaluation of part of speech tagging on Persian text. The Second Workshop on Computational aproaches to Arabic Script-based Languages, Linguistic Institute Stanford University, 2007.
  This paper compares the accuracy of three different POS taggers, MLE, MBT and TNT on the Bijankhan corpus and demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods.
[5]
Abolfazl Aleahmad, Yoosef Ramezani, Farhad Oroumchian. Using OWA for Persian Part of Speech Tagging. Novemner 2006.
  In this study we used OWA method to fuse the result of three different POS tagging systems, namely MLE (Maximum Likelihood Estimation), TnT tagger and PTT (Persian Tree Tagger).
[6]
Hadi Amiri, Persian(Farsi) POS tagging, presented in NLP course on 7 November 2006.    
[7]
Mostafa Keikha, Persian(Farsi) POS tagging, presented in NLP course on 7 November 2006.    

Contact Information:

Please feel free to contact us if you have any question:

Name
Email
Subject
1 Hadi Amiri The corpus, its statistics and POS taggers
3 Abolfazl AleAhmad The corpus, its statistics and POS taggers

 
 
  © Copyright 2007-2008 University of Tehran - Database Research Group. All Rights Reserved. Design by a.aleahmad - Last update 2010 Aug. 17