Web site harvesting

 

Thousands of websites are launched, redesigned or removed every day. To keep track of this mass of often ephemeral content, BAnQ has made website harvesting – also known as collecting or capturing – part of its mission to (among other things) assemble Québec's published heritage materials.

 In order to maximize the use of BAnQ resources and ensure that valuable Web content is collected, the Website Harvesting Program has three distinct goals: defining the purpose and extent of the program, establishing site selection criteria, and overseeing the methods and the collection itself.

Purpose and extent of the harvesting program 

BAnQ chooses the websites they wish to collect based on various aspects of their content, namely: whether or not the site covers a wide range of topics that are representative of Québec society at a given moment in time. A number of factors make it difficult if not impossible to thoroughly harvest the Québec Web. These factors include:

  • the size of the body of materials to be collected, given BAnQ's limited resources;
  • legal constraints, i.e., the requirement to obtain a license granting permission from the Web Producer or other copyright owners to make their site accessible;
  • context, because Québec does not have its own domain name. 

 Each year, BAnQ harvests as many of the 500 chosen sites as it is able to, given the resources it can allocate to the program.

 There will be significant turnover in the collection, as Web content and the collected sites change and as the interests of BAnQ's Web portal users evolve.

 Capturing websites enables BAnQ to acquire and preserve content that can be accessed for free on the sites. However, for now, this process does not relieve the publisher of the obligation to carry out a legal deposit.

Types of websites collected 

The types of websites considered for inclusion in the program are as follows:

  • government department and agency sites that are subject to the provisions of the Québec Archives Act; these consist of roughly 150 websites created by centralized public sector bodies, as defined in Section 15 and the Appendix to the Québec Archives Act; collecting these sites relieves organizations that are subject to this Act of the obligation to upload them;
  • topical sites dealing with a specific subject or area of knowledge;
  • short-lived sites created specifically for special events such as elections and celebrations.

BAnQ is responsible for establishing the selection criteria and other program elements such as collection frequency, which can vary depending on the type of site.

Site selection 

Responsibility for the selection
Coordinated by the Legal Deposit and Preservation of Heritage Collections Department, professionals at the Grande Bibliothèque, the National Library and the National Archives branches are responsible for choosing the sites to be collected. The sites are then approved by a committee of representatives from the aforementioned administrative bodies, who check that they meet the criteria mentioned below.

Selection criteria

Sites must be consistent with the program’s goals, as outlined in the Purpose and extent of the harvesting program section of this page. The Producers must reside in Québec, as confirmed by the Contact Us mailing address, and the entire collection must focus on regional representation.

The selection must also take into account more specific criteria such as:

  • the site’s relevance for users and local residents;
  • the renown of the site Producer; and if the site is mentioned in legitimate sources;
  • the topicality of the subject matter, the importance of the event, the content’s permanent and historical value;
  • the originality of the subject matter and its alignment with other BAnQ collections and with other collected sites;
  • the accessible nature of the information;
  • language quality, site layout, technical qualities, user-friendliness;
  • potential removal of the site.

BAnQ does not select government department or agency websites as its goal is to collect all of them.

Some sites created by private archive donors may be collected to complement archival fonds. 

Most of the selected sites are in French or English. However, BAnQ also chooses multilingual sites as long as there is a French version.

Video and audio files, images, text – basically, any type of content the bot can “spider” during a “crawl” is collected.

Exclusions

Some types of sites or parts of sites are automatically excluded due to their subject matter, the way they are handled, or possible legal issues. These include:

  • social media;
  • intranets, extranets and emails;
  • sites that breach applicable laws;
  • advertising and transactional sites;
  • databases;
  • paid content.

BAnQ reserves the right to refuse the request of a producer to harvest its site if the site does not satisfy the selection criteria.

Harvesting methods

Harvesting versus physical storage

Rather than storing them on a physical device, BAnQ collects websites using a Web crawler. This means the sites can be made accessible in their original medium and it maintains the interaction between different site components.

Tools

BAnQ collects and makes accessible the selected sites using Web crawler Heritrix and playback software OpenWayback. BAnQ does not customize these free tools, as doing so would make downloading software updates and fixes more cumbersome.

Licenses

While BAnQ is authorized to collect Québec sites without a license, it cannot make them accessible without prior permission to do so. The license states that the collected sites can be made accessible on BAnQ's premises, or on its Web portal.

The rules pertaining to government department and agency websites are slightly different: BAnQ can collect the sites without having to obtain prior permission, nor does it need prior permission to make them accessible on BAnQ premises.

The Web Producer retains copyright of the collected site.

For technical reasons, collected sites cannot be deleted once they have been indexed. However, the Web Producer can request that access to the site be restricted.

Frequency

The types of sites listed below are collected at different intervals.

Government department and agency websites:

  • When the structure or mission of a department or agency changes.
  • When a new site is uploaded.
  • After significant changes to the site, such as a major overhaul of the content, corporate design or architecture.
  • Just before the site is removed.

Topical sites:

  • Daily, surface-level collection of the main Québec media sites.
  • Yearly, or at intervals set by the time it takes the Web crawler to completely harvest all such sites.
  • Some sites may be collected at shorter intervals, depending on how often their content is updated.

 Short-lived sites:

  • During major political events, such as elections or referendums, selected sites are collected daily or weekly.
  • During major unforeseen events, such as environmental emergencies or social movements such as the 2012 student protests, event-specific sites are collected daily.
  • At major foreseeable events, such as Montréal’s 375th anniversary, such sites are collected at varying intervals, depending on the event’s newsworthiness and how quickly the situation develops.

Levels

  • Government sites: if the Web crawler can handle it, the entire website is collected.
  • Topical and short-lived sites (except for the daily media collection): unless it takes too long or takes up too much space, the entire website is collected.

Honouring robots.txt exclusions

BAnQ deems that having a license conferring prior permission from the Web Producer means it can ignore robots.txt exclusions.

BAnQ's Web crawler identifies itself by displaying the organization’s name in the user-agent field.

Quality of the captured content

In spite of regular quality controls, missing style sheets and other factors may mean the captured content does not fully replicate the original user experience. However, if the website content is by and large preserved, the site is made accessible despite its limitations.

 

Making the harvest accessible

The archived websites are presented as lists on a specific display interface that is searchable by topic and URL.  

The display interface can be viewed on the Web portal, but because viewing the actual sites is regulated by the license conferring permission from the Web Producer, some sites can only be viewed on BAnQ premises.

To avoid confusion, a banner at the top of the captured sites’ pages states that these are archived sites.

Preserving collected websites

The captured websites will be permanently archived by BAnQ.

Content collected with Heritrix 3.0 or later versions is stored as WARC files whereas content collected with earlier versions is stored as ARC files.

Collaboration with other organizations

BAnQ is a member of the International Internet Preservation Consortium (IIPC)whose mission is to design Web harvesting tools, standards and practises and to promote access to Web archives, and further their outreach.

In order to promote the sharing of content, tasks and expertise, BAnQ works closely with the various stakeholders in the field. 

Any questions?

If you have any questions, would like to recommend a site that meets our selection criteria, or if you want to tell us that your website is about to undergo a major overhaul, email us at: archivageweb@banq.qc.ca.

Our partners

Ville de Montréal. Fondation de BAnQ. Les Amis de BAnQ. Catalogue des bibliothèques du Québec. RDAQ. RFN.