CRIU (Checkpoint and Restore In User-space) on Linux enables the users to backup and restore any live user-space process. This means that the process state can be frozen in time and stored as image files. These images can be used to restore the process.
Some interesting use cases that can be supported by CRIU are.
- Process persistence across server reboot: Even after you have rebooted the server the image file can be used to restore the process
- VMotion like Live migration for Processes: The image files can be copied over to another server and the process can be restored in the new server.
In this post, I will be exploring CRIU to checkpoint and restore a simple webserver process. We will also explore migration of the process across servers
Lets start by installing CRIU packages. I am using Ubuntu vivid and the package is available in the Ubuntu repository. We can use the following command to check and install CRIU
apt-cache search criu apt-get install criu
The webserver process
For this experiment, I wanted a simple server process with an open network port and some internal state information to verify if CRIU can successfully restore the network connectivity as well as the state of the process.
I will start a simple webserver using this python script. This webserver keeps count of the request from the client and maintains an in memory list of all the previous requests it served. Create a new directory as shown below and change to this directory. Use wget to download the script ash shown below.
chandan@chandan-VirtualBox:~$ mkdir criu chandan@chandan-VirtualBox:~$ cd criu chandan@chandan-VirtualBox:~/criu$ wget https://bitbucket.org/api/2.0/snippets/xchandan/jge8x/9464e8e341c4c845aebf3a21e9d20e472baa4c5e/files/server.py
Now start the webserver from this directory by executing the following command
chandan@chandan-VirtualBox:~/criu$ python server.py 8181
Verify that the webserver is running by pointing your browser to http://localhost:8181 and refresh the page a few times to build the application’s internal state. Every refresh should increase the request number
You should see output similar to this.
Keep this process running and open another terminal. Use ps command to find the process id (PID) of the webserver
chandan@chandan-VirtualBox:~/criu/dump_dir$ ps aux|grep server.py chandan 32601 0.0 0.1 40696 12760 pts/18 S+ 20:40 0:00 python server.py 8181 chandan 32717 0.0 0.0 9492 2252 pts/1 S+ 20:47 0:00 grep --color=auto server.py
We now have the PID of the webserver that is 32601
Checkpoint the webserver process
Checkpointing the webserver will freeze its process state and dump this state information into a directory. Make a new directory and go to the new directory.
Now execute the “criu dump -t <process id> –shell-job” command to checkpoint the process. Flag “–shell-job” is required if you want to use CRIU with processes directly started from a shell.
chandan@chandan-VirtualBox:~/criu/dump_dir$ sudo criu dump -t 32601 --shell-job
When the process exists, the directory will have many new files, which stores the state of the webserver process in the form of image files.
The dump command actually kills the webserver process; you can verify the same with the ps and grep command. This can also be verified by trying to browse the webserver address using your browser (which should fail).
Restoring the process
To restore the process go to the directory where the image files for the process are stored and execute the following command
chandan@chandan-VirtualBox:~/criu/dump_dir$ sudo criu restore --shell-job
This command will not return, as it is now the web server process. Keep this process running and verify that you can open the webserver URL.
You should see the output similar to this, the request count should continue from where it was before the application checkpoint was made. In this case we continue from Request No: 15 and all the state information is successfully restored as shown in the screenshot.
Restoring after machine reboot
You can now reboot your machine and again try to restore the webserver process. You should to able to restore the process and it should again continue from the check pointed request no.
Migrating webserver to another machine
To migrate the webserver process we need an exact match of the runtime environment of the process on the target machine. This means the working directories, any resources like files, ports etc should be present on the target system. This is why process migration with CRIU will make more sense in a container based environment where the environment for the process can be closely controlled.
To start the migration, first copy the image files to the target machine
chandan@chandan-VirtualBox:~/criu$ scp -r dump_dir/ email@example.com:
Make sure that the environment for the process is present on the target machine, in my case i had to create the current working directory for the webserver after CRIU prompted with an error message.
chandan@chandan-ubuntu15:~/dump_dir$ sudo criu restore --shell-job 32601: Error (files-reg.c:1024): Can't open file home/chandan/criu on restore: No such file or directory 32601: Error (files-reg.c:967): Can't open file home/chandan/criu: No such file or directory 32601: Error (files.c:1070): Can't open cwd Error (cr-restore.c:1185): 32601 exited, status=1 Error (cr-restore.c:1838): Restoring FAILED. chandan@chandan-ubuntu15:~/dump_dir$ chandan@chandan-ubuntu15:~/dump_dir$ mkdir ~/criu chandan@chandan-ubuntu15:~/dump_dir$ sudo criu restore --shell-job 192.168.90.2 - - [10/Aug/2015 01:26:47] "GET / HTTP/1.1" 200 - 192.168.90.2 - - [10/Aug/2015 01:26:47] code 404, message File not found 192.168.90.2 - - [10/Aug/2015 01:26:47] "GET /favicon.ico HTTP/1.1" 404 - 192.168.90.2 - - [10/Aug/2015 01:26:47] code 404, message File not found 192.168.90.2 - - [10/Aug/2015 01:26:47] "GET /favicon.ico HTTP/1.1" 404 -
Here is a screenshot of the webserver restored on a remote machine side-by-side of the local machine. You can see that both the processes start with the same internal state and continue on different path by looking at the state info for Request No 15.
In this post we saw how to checkpoint and restore any linux application. We could verify that the application could be restarted on a different server and its internal state can be restored.
In my future post I will explore using CRIU with containers to provide migration of containers
I found this interesting post describing live migration of LXD/LXC containers, and a demo video of the live migration of container running the game Doom. Here is one more post about running a live migration of Docker container running Quake