From 6757692796d34e04d782af2083c14dce237605d8 Mon Sep 17 00:00:00 2001 From: Brandon Rozek Date: Sat, 24 Dec 2022 11:23:42 -0500 Subject: [PATCH] New post --- content/blog/personal-simple-web-archive.md | 75 +++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 content/blog/personal-simple-web-archive.md diff --git a/content/blog/personal-simple-web-archive.md b/content/blog/personal-simple-web-archive.md new file mode 100644 index 0000000..f8a92e5 --- /dev/null +++ b/content/blog/personal-simple-web-archive.md @@ -0,0 +1,75 @@ +--- +title: "Personal Web Archive and How I Archive Single Web pages" +date: 2022-12-24T10:01:16-04:00 +draft: false +tags: [ "Archive" ] +--- + +The [Internet Archive](https://web.archive.org/) is great for providing a centralized database of snapshots of websites throughout time. What happens though when you want to have +your own offline copy during times of little to no internet +access? [Archivebox](https://archivebox.io/) is one solution +to such problem. It behaves similarly to the Internet +archive and also allows importing of RSS feeds to save local copies of blog posts. To install it, you can use `pipx`. +```bash +pipx install archivebox +``` + +For the rest of this post, however, I want to talk about a simpler tool. A combination of `wget` and `python -m http.server`. In the past, I've used `wget` to [mirror entire websites](/blog/archivingsites/). We can adjust the command slightly so that it doesn't follow links and instead only looks at a single webpage. + +```bash +wget --convert-links \ + --adjust-extension \ + --no-clobber \ + --page-requisites \ + INSERT_URL_HERE +``` + +Now assume we have a folder full of downloaded websites. To view them, we can use any HTTP server. One of the easiest to temporarily setup currently is Python's built in one. + +To serve all the files in the current directory +```bash +python -m http.server OPTIONAL_PORT_NUM +``` +If you leave the port number field empty, then this returns +``` +Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ... +``` + +A few nice side-effects of using `wget` and `python`. +- Python's default webserver shows a list of files in the directory. This can make it easier to browse around the web archive. +- The `wget` flags make it so that if you want to archive `https://brandonrozek.com/blog/personal-simple-web-archive/` then all you need to access is `http://localhost:8000/brandonrozek.com/blog/personal-simple-web-archive`. In other words, it preserves URLs. + + +Now this approach isn't perfect, if a webpage makes heavy use of javascript or server side features then it'll be incomplete. Though for the majority of the wiki pages or blog posts I want to save for future reference, this approach works well. + +My full script is below: +```bash +#!/bin/sh + +set -o errexit +set -o nounset +set -o pipefail + +ARCHIVE_FOLDER="$HOME/webarchive" + +show_usage() { + echo "Usage: archivesite [URL]" + exit 1 +} + +# Check argument count +if [ "$#" -ne 1 ]; then + show_usage +fi + +cd "$ARCHIVE_FOLDER" + +wget --convert-links \ + --adjust-extension \ + --page-requisites \ + --no-verbose \ + "$1" + +# Keep track of requested URLs +echo "$1" >> saved_urls.txt +``` \ No newline at end of file