Web site harvesting


Every day, Web sites are created, evolve or disappear. In order to keep track of this mass of information, often ephemeral in nature, BAnQ has integrated Web site harvesting to its activities constitutive of Québec published heritage materials.

In order to ensure that BAnQ's resources are used to their fullest potential and for their content value, the Web site harvesting programme intends to define site selection criteria and frame the harvest process and activities.

Site selection 

Responsibility for the selection
The professionals of the Grande Bibliothèque Branch, the National Library Branch and the National Archives branch will be responsible for selection, under the coordination of the Legal Deposit and Heritage Collections Preservation Branch.
A committee of representatives from these administrative units will validate the proposals and make sure they respect the parameters stated in this document.

Selection criteria
The sites selected must first satisfy the objectives of the harvesting programme as stated in Item 1.  The site producers must be in Québec, as confirmed by the “Contact us” mailing address, and the entire corpus must focus on regional representivity.

The selection must also take into account more specific criteria such as:

  • the relevance of the site for patrons and citizens;
  • the renown of the site producer; the citation of the site in recognized sources;
  • the currency of the subject matter, the importance of the event, the permanent and historic value of the content;
  • the originality of the subject matter and its complementarity with other BAnQ collections;
  • the accessibility of the information;
  • the quality of the language, the presentation of the site, its technical qualities, its user-friendliness;
  • the risk that the site may disappear.

The sites created by private archive donors may be harvested to supplement the archival fonds. 

In general, the sites selected will be in French or English or in a multilingual version if a French version exists.

All types of content are subject to harvesting (video, audio, image, text, etc.), unless the web crawler is unable to collect it.

Certain types of sites or parts of sites are excluded from the outset as a result of the subject matter and the manner in which it is treated or even foreseeable legal constraints:

  • social media;
  • intranets, extranets, emails;
  • sites that violate the laws in force;
  • advertising and transactional sites;
  • databases;
  • paid content.

BAnQ reserves the right to refuse the request of a producer to harvest its site if the site does not satisfy the selection criteria.

How are Web sites harvested?

  • An image is recorded using the Heritrix software.
  • Heritrix can collect videos and PDF documents.
  • This software cannot bypass passwords or access an intranet site or a database.
  • Patrons can view Web sites here. The sites harvested can be accessed under rights obtained through licences. A banner on each page indicates that it is a copy of the site
  • The harvesting does not involve any intervention on the part of the site administrator. It will have no impact on the performance of the site.


Collaboration with other organizations

BAnQ is a member of the International Internet Preservation Consortium (IIPC)whose mission is to design Web harvesting tools, standards and practises and to promote access to Web archives, and further their outreach.

In order to promote the sharing of content, tasks and expertise, BAnQ works closely with the various stakeholders in the field. 

Any questions?

If you have any questions or would like to recommend a site that fits our selection criteria,
write to us!
Email: archivageweb@banq.qc.ca

Our partners

Ville de Montréal. Fondation de BAnQ. Les Amis de BAnQ. Catalogue des bibliothèques du Québec. RDAQ. RFN.