• Hamshahri Collection «
  • Download «
  • Publications «
  • Project Members «
  • Contact us «

Links
  • Bijankhan Corpus «
  • dotIR Collection «
  • Univ. of Tehran «
  • DBRG «
  • TREC «
  • CAASL «
  • CLEF «

 

Hamshahri Collection

Download

Version 1
Version 2

Note:

Two different versions of the Hamshahri collection are provided:

  • Version 1 of Hamshahri has two subversions:
    • Version 1 of CLEF that was prepared in years 2008-2009 and many systems were evaluated based on this version at CLEF 2008 and 2009 campaigns. This version contains 100 topics with relevance judgment and 160,000+ documents all in XML format. If you want to use version 1 of Hamshahri, we recommend you to use this version. This Version is also downloadable here.
    • Version 1 of DBRG that was prepared in year 2007. Documents of this version are exactly the same as the CLEF version but are stored in TREC format. This version contains 65 topics with relevance judgment that are different from the CLEF version topics. This version of Hamshahri is not available online, if you need it please contact us.
  • Version 2 of the Hamshahri collection was prepared in year 2009 using the UTIRE system at database research group of the University of Tehran based on CLEF standards. This version of the collection contains 320,000+ documents and 50 topics with their relevance judgment. Version 2 is also downloadable here.

Hamshahri Version 1 - CLEF 2008 & 2009 version

Note:

In order to use Hamshahri version 1, you need some passwords. You are required to fill out this copyright form and mail it to a.aleahmad(at)ece.ut.ac.ir, then we will provide you with the passwords. Also, we will keep you updated about latest changes of this version.

 

Item
Size
Description
Download
Documents in XML + DTD 160 MB

This file contains all Hamshahri 1 documents in XML format. Each XML document is tagged with a category and contains news articles of one day from the Hamshahri newspaper. File names have a pattern like "HAM1-YYMMDD" in which YYMMDD shows year, month and day of the documents in the file. Jalali date of the documents are tagged in the XML file itself.

Sample
34 KB
This is a sample from Hamshahri 1 documents

Categories 15.2 KB

This file contains all categories that the collection documents are tagged with

Topic Set1 176 KB
This file contains 50 topics created for Hamshahri 1 at CLEF 2008

Topic Set2 156 KB
This file contains 50 topics created for Hamshahri 1 at CLEF 2009

Persian Stopwprds 9.6 KB This is a list of 800+ stopwords that are extracted from Hamshahri 1 corpus

 

Hamshahri Version 2

Note:

In order to use Hamshahri version 2, you need some passwords. You are required to fill out this copyright form and mail it to a.aleahmad(at)ece.ut.ac.ir, then we will provide you with the passwords.

 

Item
Size
Description
Download
Documents in XML+DTD
399 MB

Contains documents of Hamshahri 2 in CLEF format. Each XML document is tagged with a category and contains news articles of one day from the Hamshahri newspaper. File names have a pattern like "HAM2-YYMMDD" in which YYMMDD shows year, month and day of the documents in the file. Jalali date of the documents are tagged in the XML file itself.

HAM2-IMG Package
1.93 GB

This package contains the images used within the collection documents. Image paths are tagged in the collection documents. So, if you need the images, you should download HAM2-IMG package

Help 221 KB
User manual of the collection. It describes XML tags used and document categories

A Sample Document 139 KB

You can take a look at a sample document of the collection in XML format. Also DTD of the collection can be viewed here.

Document Categories 15.2 KB
This file contains all categories that Hamshahri2 documents are tagged with

Topics
7.71 KB

Contains 50 queries that were created using UTIRE. Topics are all presented in both English and Persian.

Relevance Judgment 485 KB

Contains relevance judgments of the 50 above topics

Persian Words 1.43 MB

List of all Persian words of the collection with their frequencies

Documents in Text Format

295 MB

Contains Hamshahri2 documents in pure text format stored in Windows-1256 encoding. If you need any other encoding format, please use codepage converter software that is also downloadable here

XML to Text Converter 3.53 KB

This software converts Hamshahri 2 XML documents to pure text encoded in Unicode, Windows-1256 or UTF-8. In addition to the Hamshahri 2 collection, you should have .Net Framework 3.5 or higher installed to run the program.

 

© Copyright 2009 University of Tehran, Database Research Group. All Rights Reserved.
Designed by Farzad Mahdikhani - Last update: 2010 Aug. 17