If you are including a website as part of your research project to either disseminate, gather or process data, you need to consider how you will archive that site at the end of your project so it is available after the hosting funds run out. As an output and containing data your funder will expect to see that the site is available for the long term
Once the hosting funds run out, your site will not be available for people following links from publications, from records in the Kent Academic Repository (KAR) and social media. The site will not be able to contribute to the UKRI Research Excellence Framework (REF) or as evidence in support of further funding. Even if your project website is hosted by the University of Kent it will only be available for 10 years. The solution is to use a web archiving service that will preserve your site as-is and provide you with a link so you can redirect links to your original site. As soon as your website has finished developing, gathering data and hosting processed data, it can be archived for posterity.
How can you do this?
The easiest way to make sure your site is archived is to use one of the established online applications.
UK Web Archive
The Uk Web Archive is supported by all of the UK legal deposit libraries. Their aim is to collect all UK websites at least once per year but you can also ask them to specifically save your research website using the Save a UK Website nomination form. The UK WEB Archive also has collections of archived sites centred around topics and themes. One of these is the British Stand-up Comedy Archive which complements the physical special collections held at the University of Kent Templeman Library.
WebCite
WebCite is designed to support authors and editors citing websites so it may be more useful for researchers ensuring links in their publications remain up to date. It allows links to an archived local copy of a website to ensure no readers are met with the “404” error when they click on a link. It does mean that the link will be to a snapshot of the site when the work was written rather than the evolved site at the time the reader accessed the publication.
Mirrorweb
Mirrorweb is a commercial service that can enable entities to archive their own websites. It uses Heritix to harvest web pages and offers indexing and hosting services. It is used by the BBC, various banks, the National Arcives and some UK Universities.
How do web archives work?
There are several steps to archiving web sites:
- Selection and planning – depending on the mission of the archive this could be Universal like the Internet Archive, limited to a geographic region like the UK Web Archive, or limited by subject. Archives could also serve a specific institution, commercial company or national government. Within these remits, the web archives also have policies limiting the types of web pages and materials they will collect.
- Permissions – ensuring they have permission to collect the sites. UK Web Archive is supported by Legal Deposit Libraries and has an extension of that role to collect digital materials in the UK .
- Harvesting – Web archives use web crawlers linked to software that harvests or collects websites and their contents. Once they know the URL that they should start from, known as the seed, they work through the links saving each page and its files. Some crawlers are free and some are parts of commercial services. Free crawlers include:
- Heritrix – a crawler that creates WARC files from d by the Internet archive and available under a free software licence.
- HTTrack – creates a snapshot of the web page which you can then open in your web browser (Chrome, Safari etc) and browse from link to link.
- Screaming Frog– the basic web crawler is free and allows up to 500 URLs or seeds, to be crawled. Buying a licence gives access to advanced features and removes the limit.
- Conifer – previously known as Webrecorder. Create a free account and save any URL and content you use. The result has relatively high fidelity and is good for small scale websites.
- Wget – a computer programme that retrieves content from web servers. A non-interactive command-line tool, it is more flexible but possibly less friendly for non-technical users.
The files harvested from the URLs are saved as WARC files. WARC (Web ARChive) file format is the ISO open and standard format for the long term preservation of web pages and the material that supports or accompanies them.
- Description – once gathered the files need to be saved with metadata that accurately describes the characteristics and provenance of the files.
- Long-term preservation – the web archives store their files in accredited secure environments with multiple duplicate copies and processes for self-checking and repair.
- Access – once harvested and described the files need to be available to requestors in the format that was originally intended for this software that can read and present the WARC files as web pages are required. The internet archive uses the Wayback Machine that gives access to the billions of websites it has collected.
Doing it yourself
If you want to archive your own sites without using any of the web archive services you can do it yourself. As it is your own materials, related to specific projects, the first two steps are self-completing. Next, like the web archives, you can use one of the web crawlers mentioned above to create WARC files of your own sites. You can then archive them in a research repository like the Kent Data Repository.
Kent Data Repository (KDR)
You can archive all the data and files that make up your website on the Kent Data Repository (KDR). You will be able to create descriptive metadata for your website files and KDR does provide long term preservation. It is a good idea to do this as a backup to using one of the services above. Moreover, your record on KAR can have a DOI and will link to the fully archived website and any other publications or outputs related to your project.
However, KDR cannot present your website in its finished and functional form it only preserves the files so the website can be recreated. So, once you have your website preserved as WARC files you will need software to play them back. Some of the tools above allow you to playback on ordinary web browsers, or you could use a tool like Webrecorder Player to view the archived website offline.
How to make your site easy to archive
Whether you use an established web archive or you choose to do it yourself, you can take steps to ensure your site is archive fully and accurately while you are creating it. Not all functions on websites can be recorded by web crawlers and there are ways to be sure your website can be accurately recorded.
What cannot be harvested
Web crawlers are also used by search engines like Google and by other software with less benign intent. Controls like Robots.txt files and software designed to limit or block web crawlers will also exclude archival web crawlers. Archival web crawlers also struggle with:
- streaming media, animation and embedded social media,
- database-driven features, as the crawlers only capture a snapshot of the site, they do not make a copy
- password-protected content that needs a user to log in,
- dynamically generated content like built-in search and search results, drop-down menus and tick-boxes, anything with a ‘Submit’ button,
- POST functionality often used with web forms,
- Complex JAVA script,
- Content on external websites unless specifically directed there.
Crawlers can also get caught in Crawler traps. For example, where a calendar does not have an end date, the crawlers can get caught in an endless loop.
What can be harvested
Web crawlers can pick up any code or information in a web programming language like HTML, formatting in Cascading Style Sheets (CSS), text, images, video and audio where the file is embedded (using progressive download) and not streamed, documents and other files in open or standard file formats like txt, XML, PDF, CSV.
What to do…
The way you design your website from the start will impact its ability to be harvested by web archive crawlers. Here are some tips:
- Follow established guidelines and make your site more accessible. Making your site easy to archive, will also make it more open to users of assistive technologies like screen readers and to internet search engines thus improving its visibility and dissemination. Practical advice and resources are available:
- W3C WAI resources include practical tools and tips and a validation tool to make your site accessible and usable for everyone including web archive crawlers
- Google developer tools are designed to help the Google crawler but will also help web archiving crawlers
- ArchiveReady a free website archivability evaluation tool that includes an icon so you can revalidate at any time.
- When creating your content follow University guidelines for Plain English:
-
- Make sure your language is concise and straightforward.
- Use short paragraphs and bulleted points
- Use Headings and don’t use all capitals
- Avoid phrases with 4 or more nouns in a row, e.g. Research Data Management Plan Guidance
- Don’t be cheeky or cute, or make assumptions about your audience – avoid jargon
-
- Follow the guidance for FAIR data with documents and other resources that you upload to your site.
- Use durable, open or standard formats to ensure your website is robust and can be reliably used by a wide variety of devices including crawlers and screen readers.
- Use media/mime-types correctly so that the crawlers and browsers know which application to open, e.g. docx for word; PDF for acrobat; CSV for Excel
- Where you must use proprietary content, dynamically generated content or Javascript, provide a text or plain html alternative. When Adobe stopped supporting Flash software in early 2012, many online resources were rendered completely inaccessible.
- Maintain stable URLs and apply a policy of using persistent URLs to avoid “link rot”:
- If you retire a resource or page leave a “tombstone” record with an explanation and redirect to point users to the new location or back to the home page.
- Use absolute URLs rather than relative ones in links (the full URL rather than the abbreviated form prefixed by … to indicate it is under the same root) in case the relationships between files and pages change on your site.
- Use meaningful URLs with words that indicate the content and do not use code or numbers
- Use static links and one link for each resource or document
- Be transparent:
-
- Include a site map
- Allow users to browse a complete map of your website rather than just searching
- Use a single root URL for all the pages and keep navigation simple
- Do not use too many layers as some web crawlers are limited to the number of layers they will go down. Six is a maximum.
- Include a first publication date and update dates. Always use server- not client-side generated dates.
- Include a licence making it clear how the content can be used. Make it clear whether one licence applies to the whole site and its content or if different licences are in place for different content. Creative Commons offer a useful standard for licences, see our Copyright guidance for more help.
Keeping your web design simple and using standard formats and language reduces the likelihood of problems for your users, for search engine indexing and for web archiving:
“Designing for archivability will also tend to optimize your website for accessibility, (other) crawlers, boost website performance, enhance viability for contemporary users, and improve the likelihood that you’ll be able to refer to and/or recover historical versions of your own web content” (Stanford Libraries)
Finally, when you archive your website leave a reference to the archived version on the live version and ask for a tombstone page to point to the archive when the site is retired.
Other Links
- Where to find historical websites no longer online https://www.waybackmachinedownloader.com/
- Stanford Libraries https://library.stanford.edu/projects/web-archiving/archivability
- National Archives https://www.nationalarchives.gov.uk/documents/web-archiving-technical-guidance.pdf
- Smithsonian Institution Archives Blog https://siarchives.si.edu/blog/
Contact us at ResearchSupport@kent.ac.uk for more information
Cool!
Thanks for this post! I didn’t know you could archive your website – this is really useful information.
Brilliant article, packed full of solutions. Thank you!