Git for smaller deployment artifacts - Hypothesis - Knowledge Base - Mark Eschbach (Software Developer && System Analyst)

Hypothesis: Using Git for deployment will consume less bandwidth

For my static site I generate an archive of artifacts, including static pages, images, and other resources. This artifact is then uploaded and unpacked on target machines within a cluster. Often the changes between the previous version and the current version are insignificant, perhaps a few kilobytes of text changes. Other times entire parts of the target directory structure may be rewritten. The archive is produced by using tar against the generated artifacts, then piping tared contents through gzip. This produces an artifact in a reasonable amount of time, and includes a full copy of the website. The entire archive is approximately 472 kilobytes. For clairity I will call this method the Colonization method, named because it moves the remote server, displacing the previous inhabitants.

The deployment methodology I plan to use with Git is slightly different. I plan translate the website from the template format into the static production files on the build server. For each successful build the reuslting static files will be committed to a deployment repository. For each stage (alpha, beta, production) we wish to deploy to the build system will log into the remote sevice, push into the remote repository. To roll back we can manually log into the remote servers within the cluster and issue a manual rollback. I hypothesis this will reduce the overall bandwidth per deployment due to the compressed nature of the diffs sent by Git. I will call this the Federation method.

Colonization Method Data

The deployment scripts at the time of writing use three SSH connections. The first is uploading a compressed archive of the static content. Next is an SSH uploading a script for deploying the files within the archive to the production location. Within the third connection the deployment script is executed. For the purposes of gathering the data I will measure the bandwidth across the SSH connections. This should be multiplied across the number of hosts this exchanges occurs with.

To capture the bandwidth usage I will use Network Calipers, a tool which I built within Node.js for this purpose. This counts the bytes, for each connection and a total number of bytes for the life time for the application. SSH will be reconfigured to utilize localhost host on an alternative port, which will be reversed proxied to the actual host.

Deployment #	Content Upload		Script Upload		Script Execution		Total
In Bytes	Sent	Received	Sent	Received	Sent	Received	Sent	Received
All values are in bytes.
1	488815	2263	2815	2071	2479	7735	494109	12069
2	489215	2263	2815	2071	2479	6519	494509	10853
3	489199	2263	2815	2071	2479	7815	494493	12149

Federation Method Data Gathering

The SSH connections to the remote hosts for the script control will be monitored, along with the connection from the remote host pulling the code. Git will automatically update the working copy of the remote code base for us, so no additional connections are required. According to my hypothesis, this bandwidth consumed should be relative to the size of the changes. The size on disk could possibly be several factors greater than the source.

Deployment #	Git
In Bytes	Sent	Received
All values are in bytes.
Deployment #1.a-c contains the `git init`, `post-receive` copy, and initial `git push`
1.a: `git init`	2591	2119
1.b: `post-receive`	2847	2071
1.c: `git push`	477423	5175
1: Total	482861	9365
2	4191	2295
3	4031	2663
4	3599	3271

Conclusion

The federation method is more efficient when there are small, incremental chagnes. The primary overhead in the federation method is the establishment of an SSH connection, which appears to consume approximately 2 kilobytes of bandwidth. Comparing this with the full deployment of 500K and counting there is considerable savings.

Federation Method Application

This method may be applicable to any source based production system, including static HTML, PHP, JavaScript, etc.

Deployment technique for removing the second connection