فارسی

 


  • Hamshahri Collection «
  • Download «
  • Publications «
  • Project Members «
  • Contact us «

Links
  • Bijankhan Corpus «
  • dotIR Collection «
  • Univ. of Tehran «
  • DBRG «
  • TREC «
  • CAASL «
  • CLEF «

 

Hamshahri Collection

Introduction

Text collections are essential for research in different fields like: information retrieval, computational linguistics and natural language processing. Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems. For more information please visit Persian@CLEF2008 and Persian@CLEF2009.


We have prepared this collection to meet the following specifications:

  • It's large enough, so that it can be a reliable representative for Persian text and can be used in different experiments.
  • It contains queries and their relevance judgments for evaluation of different systems.

 

Versions

Documents of the collection are prepared by crawling, preprocessing and tagging of web site of the Hamshahri news paper. Version 1 of the collection was used at CLEF 2008 and 2009 for evaluation of Ad Hoc information retrieval systems. Version 2 of the collection is twice in size compared to the previous version .

  • Version1 of the collection contains two sets of queries and their judgments that were created during Persian track of CLEF2008 and CLEF2009. Each of the topics set was created by 25 different users in years 2008 and 2009 using the DIRECT system which is used for topic creation and evaluation at CLEF
  • Version2 of the collection contains a topic set of 50 queries and their judgment that was created by 25 different users during summer 2009. Topics of this version of Hamshahri were created using the University of Tehran Information Retrieval Evaluation system (UTIRE).

 

Specifications

The following table summarizes specifications of the two versions of Hamshahri collection:

 

Criteria
Version 1
Version 2
Size (Unicode CLEF XML Format)
700 MB
1400 MB
Number of Documents
160,000
318,000
Documents Time Span
From
1996/4/23
1996/4/23
To
2003/2/11
2007/5/13
Documents Category
Yes
Yes
Link to Images
No
Yes
Link to Original Webpages
No
Yes
Query + Relevance Judgments
Yes
Yes

 

Comparison of Versions 1&2

  • Version 2 is more structured.
  • Size and number of documents of version 2 is twice more than version 1.
  • Version 2 contains links to the original webpages (ORIGINALFILE XML tag). This feature enables you to download the original webpages and process them as however you like.
  • Version 2 contains images that were used in the original webpages. All 148,639 Images are downloadable in form of a package named 'Ham2-IMG' that is 1.9 GB. This feature, makes version 2 of the collection suitable for some other tasks like image retrieval.

Please note that only version 1 of the collection was used at the CLEF campaign.

 

Applications

Hamshahri collection can be used for different purposes like:

  • Studying different features of information retrieval algorithms like indexers and retrieval models.

  • Analyzing the Persian language and its features

  • Persian clustering and classification: All documents of the collection contain a 'Cat' tag that specifies thier category. (Totally 9 main categories and 36 subcategories like 'Economic.Bourse')

  • Other algorithms like Persian stemming: this type of algorithms are important that are used in different fields like information retrieval, spell checking, machine translation, Etc. Documents of the collection are prepared from articles of the Hamshahri newspaper and contains no spelling errors that makes it suitable for applications like stemming and analysis of the Persian language. For example one can use it to create a statistical Persian stemmer.

 

Copyright

Hamshahri corpus is created at DBRG Lab. of the University of Tehran – ECE department. All rights of the corpus' news are reserved for Hamshahri newspaper. All rights of the corpus' data and the tools that are included in this website are reserved for University of Tehran - Database Research Group. Usage of this package for any research or non-commercial purposes is free with the precondition that you cite paper number [1] of publications section.

© Copyright 2009 University of Tehran, Database Research Group. All Rights Reserved.
Designed by Farzad Mahdikhani - Last update: 2010 Aug. 17