mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-11-09 10:40:34 -05:00
New post
This commit is contained in:
parent
76e3c51d8a
commit
6757692796
1 changed files with 75 additions and 0 deletions
75
content/blog/personal-simple-web-archive.md
Normal file
75
content/blog/personal-simple-web-archive.md
Normal file
|
@ -0,0 +1,75 @@
|
|||
---
|
||||
title: "Personal Web Archive and How I Archive Single Web pages"
|
||||
date: 2022-12-24T10:01:16-04:00
|
||||
draft: false
|
||||
tags: [ "Archive" ]
|
||||
---
|
||||
|
||||
The [Internet Archive](https://web.archive.org/) is great for providing a centralized database of snapshots of websites throughout time. What happens though when you want to have
|
||||
your own offline copy during times of little to no internet
|
||||
access? [Archivebox](https://archivebox.io/) is one solution
|
||||
to such problem. It behaves similarly to the Internet
|
||||
archive and also allows importing of RSS feeds to save local copies of blog posts. To install it, you can use `pipx`.
|
||||
```bash
|
||||
pipx install archivebox
|
||||
```
|
||||
|
||||
For the rest of this post, however, I want to talk about a simpler tool. A combination of `wget` and `python -m http.server`. In the past, I've used `wget` to [mirror entire websites](/blog/archivingsites/). We can adjust the command slightly so that it doesn't follow links and instead only looks at a single webpage.
|
||||
|
||||
```bash
|
||||
wget --convert-links \
|
||||
--adjust-extension \
|
||||
--no-clobber \
|
||||
--page-requisites \
|
||||
INSERT_URL_HERE
|
||||
```
|
||||
|
||||
Now assume we have a folder full of downloaded websites. To view them, we can use any HTTP server. One of the easiest to temporarily setup currently is Python's built in one.
|
||||
|
||||
To serve all the files in the current directory
|
||||
```bash
|
||||
python -m http.server OPTIONAL_PORT_NUM
|
||||
```
|
||||
If you leave the port number field empty, then this returns
|
||||
```
|
||||
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
|
||||
```
|
||||
|
||||
A few nice side-effects of using `wget` and `python`.
|
||||
- Python's default webserver shows a list of files in the directory. This can make it easier to browse around the web archive.
|
||||
- The `wget` flags make it so that if you want to archive `https://brandonrozek.com/blog/personal-simple-web-archive/` then all you need to access is `http://localhost:8000/brandonrozek.com/blog/personal-simple-web-archive`. In other words, it preserves URLs.
|
||||
|
||||
|
||||
Now this approach isn't perfect, if a webpage makes heavy use of javascript or server side features then it'll be incomplete. Though for the majority of the wiki pages or blog posts I want to save for future reference, this approach works well.
|
||||
|
||||
My full script is below:
|
||||
```bash
|
||||
#!/bin/sh
|
||||
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
ARCHIVE_FOLDER="$HOME/webarchive"
|
||||
|
||||
show_usage() {
|
||||
echo "Usage: archivesite [URL]"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Check argument count
|
||||
if [ "$#" -ne 1 ]; then
|
||||
show_usage
|
||||
fi
|
||||
|
||||
cd "$ARCHIVE_FOLDER"
|
||||
|
||||
wget --convert-links \
|
||||
--adjust-extension \
|
||||
--page-requisites \
|
||||
--no-verbose \
|
||||
"$1"
|
||||
|
||||
# Keep track of requested URLs
|
||||
echo "$1" >> saved_urls.txt
|
||||
```
|
Loading…
Reference in a new issue