I saw this post and I was curious what was out there.
https://neuromatch.social/@jonny/113444325077647843
Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?
I have a script that archives to:
- Internet Archive: Digital Library of Free & Borrowable Texts, Movies, Music & Wayback Machine
- Webpage archive
- Ghostarchive, a website archive
- Self-hosted https://archivebox.io/
I used to solely depend on archive.org, but after the recent attacks, I expanded my options.
Script: https://gist.github.com/YasserKa/9a02bc50e75e7239f6f0c8f04fe4cfb1
EDIT: Added script. Note that the script doesn’t include archiving to archivebox, since its API isn’t available in stable verison yet. You can add a function depending on your setup. Personally, I am depending on Caddy and docker, so I am using caddy module [1] to execute commands with this in my
Caddyfile
:route /add { @params query url=* exec docker exec --user=archivebox archivebox archivebox add {http.request.uri.query.url} { timeout 0 } }
One option that I’ve heard of in the past
ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline.