Sunday, December 29, 2024
Google search engine
HomeData Modelling & AIInstall and use ArchiveBox self-hosted internet archiving

Install and use ArchiveBox self-hosted internet archiving

ArchiveBox is a self-hosted and powerful internet archiving solution written in Python. It enables one to collect, save and view sites you want to save offline. ArchiveBox can be set as a command-line tool, desktop app, or accessed via the web. This is a cross-platform tool available for Linux, macOS, and Windows systems.

Below are the cool features for ArchiveBox.

  • It allows one to feed it URLs one at a time, or schedule regular imports from your browser’s bookmarks, history, feeds e.t.c
  • It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WAR e.t.c

In this guide, we will walk through how to install and configure and use ArchiveBox self-hosted internet archiving solution.

Install ArchiveBox self-hosted internet archiving solution

There are several methods you can use to install ArchiveBox self-hosted internet archiving solution.

  • Using PIP3
  • Using Docker

#1. Install ArchiveBox using Pip3

For this method, ensure that you have Python 3.7 and above, and Node version 12 and above installed on your system. Then install PIP on your system.

##On Debian/Ubuntu
sudo apt install python3-pip

##On RHEL/CentOS/Rocky Linux 8
sudo yum install epel-release 
sudo yum install python3-pip

##On openSUSE
sudo zypper install python3-pip

##On Arch Linux
sudo pacman -S python-pip

With PIP3 installed, you can install ArchiveBox as below.

sudo pip3 install archivebox

Initialize ArchiveBox as below.

mkdir ~/archivebox && cd ~/archivebox
archivebox init --setup

Start the ArchiveBox webserver.

archivebox server 0.0.0.0:8000

This method has a lot of dependency problems and is thus not suitable.

#2. Install ArchiveBox using Docker-Compose(Recommended)

Begin by installing docker on Linux using the aid below.

Start and enable docker

sudo systemctl enable docker
sudo systemctl start docker

Install docker-compose.

curl -s https://api.github.com/repos/docker/compose/releases/latest | grep browser_download_url  | grep docker-compose-linux-x86_64 | cut -d '"' -f 4 | wget -qi -
chmod +x docker-compose-linux-x86_64
sudo mv docker-compose-linux-x86_64 /usr/local/bin/docker-compose

Add your user to the docker group.

sudo usermod -aG docker $USER
newgrp docker

Download the docker-compose YAML file

curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'

Start the ArchiveBox server.

docker-compose run archivebox init --setup

Proceed as below.

[√] Done. A new ArchiveBox collection was initialized (0 links).

[+] Creating new admin user for the Web UI...
Username (leave blank to use 'archivebox'): admin 
Email address: [email protected]
Password: Enter your Password
Password (again): Enter the Password again

Start the container.

$ docker-compose up

The server is now up and running.

[+] Running 1/1
 ⠿ Container thor-archivebox-1  Created                                    0.3s
Attaching to thor-archivebox-1
thor-archivebox-1  | [i] [2021-12-20 09:32:05] ArchiveBox v0.6.2: archivebox server --quick-init 0.0.0.0:8000
thor-archivebox-1  |     > /data
thor-archivebox-1  | 
thor-archivebox-1  | [^] Verifying and updating existing ArchiveBox collection to v0.6.2...
.......

Access the webpage at 0.0.0.0:8000

Use ArchiveBox self-hosted internet archiving solution

Once installed, you are set to start using ArchiveBox on your system to take a backup of sites you want to save offline.

You can add a URL to save as below.

$ archivebox add 'https://example.com'                                    

Using docker-compose.

$ docker-compose run archivebox add 'https://example.com'

Sample output:

Install and use ArchiveBox self hosted internet archiving solution

To schedule automatic adding of URLs use the command:

$ archivebox schedule --every=day --depth=1 https://example.com/rss.xml 

On Docker-compose:

$ docker-compose run archivebox schedule --every=day --depth=1 https://example.com/rss.xml 

View Archived pages.

On ArchiveBox, you can view the saved pages using the CLI or the web as below.

Using the CLI, view archived pages:

$ archivebox list 'https://example.com'

Accessing and Using ArchiveBox Web UI

From the web page, view the archived pages using the URL http://IP_Address:8000

Install and use ArchiveBox self hosted internet archiving solution 1

Add more pages and manage ArchiveBox by clicking on the + icon. provide login credentials to proceed.

Install and use ArchiveBox self hosted internet archiving solution 2

On this ArchiveBox admin dashboard, you can manage users, accounts, snapshots e.t.c

Install and use ArchiveBox self hosted internet archiving solution 3

Add a URL by clicking on Add + as shown above. Provide the list of URLs to archive.

Install and use ArchiveBox self hosted internet archiving solution 4

Scroll to the bottom of the page and add the URLs. The URLs will be added as below.

Install and use ArchiveBox self hosted internet archiving solution 5

View the list of added URLs by navigating to the home page as shown.

Install and use ArchiveBox self hosted internet archiving solution 6

You can view what is archived by clicking on the snapshot.

Install and use ArchiveBox self hosted internet archiving solution 7

That is it!

I hope you enjoyed this guide on how to install and use ArchiveBox self-hosted internet archiving solution.

Interested in more?

RELATED ARTICLES

Most Popular

Recent Comments