Containerd meets IPFS. Peer-to-peer distribution of content blobs.
Converting a manifest from DockerHub to p2p manifest:
# Term 1: Start a IPFS daemon $ make ipfs # Term 2: Start a rootless containerd backed by ipcs. $ make containerd # Term 3: Convert alpine to a p2p manifest $ make convert 2019/06/04 13:54:40 Resolved "docker.io/library/alpine:latest" as "docker.io/library/alpine:[email protected]:769fddc7cc2f0a1c35abb2f91432e8beecf83916c421420e6a6da9f8975464b6" 2019/06/04 13:54:40 Original Manifest  sha256:769fddc7cc2f0a1c35abb2f91432e8beecf83916c421420e6a6da9f8975464b6: // ... 2019/06/04 13:54:41 Converted Manifest  sha256:9181f3c247af3cea545adb1b769639ddb391595cce22089824702fa22a7e8cbb: // ... 2019/06/04 13:54:41 Successfully pulled image "localhost:5000/library/alpine:p2p"
Converting two manifests from DockerHub to p2p manifests, and then comparing the number of shared IPLD nodes (layers chunked into 262KiB blocks):
# Term 1: Start a IPFS daemon $ make ipfs # Term 2: Start a rootless containerd backed by ipcs. $ make containerd # Term 3: Convert ubuntu:bionic and ubuntu:xenial into p2p manifests, then bucket IPLD nodes into nodes unique to each image, and nodes inside intersect. $ make compare // ... 2019/06/04 13:51:33 Comparing manifest blocks for "docker.io/library/ubuntu:xenial" ("sha256:8d382cbbe5aea68d0ed47e18a81d9711ab884bcb6e54de680dc82aaa1b6577b8") 2019/06/04 13:51:34 Comparing manifest blocks for "docker.io/titusoss/ubuntu:latest" ("sha256:cfdf8c2f3d5a16dc4c4bbac4c01ee5050298db30cea31088f052798d02114958") 2019/06/04 13:51:34 Found 322 blocks docker.io/library/ubuntu:xenial: 4503 docker.io/library/ubuntu:xenial n docker.io/titusoss/ubuntu:latest: 87550251 docker.io/titusoss/ubuntu:latest: 76117824 // 87550251 shared bytes in IPLD nodes
IPFS backed container image distribution is not new. Here is a non-exhaustive list of in-the-wild implementations:
P2P container image distribution is also implemented with different P2P networks:
The previous IPFS implementations all utilize the Docker Registry HTTP API V2 to distribute. However, the connection between containerd pulling the image and registry is not peer-to-peer, and if the registry was ran as a sidecar the content would be duplicated twice in the local system. Instead, I chose to implement it as a containerd content plugin for the following reasons:
- Containerd natively uses IPFS as a
content.Store, no duplication.
- Allow p2p and non-p2p manifests to live together.
- Potentially do file-granularity chunking by introducing new layer mediatype.
- Fulfilling the
content.Storeinterface will allow using
ipcsto also back the buildkit cache.
IPFS imposes a 4 MiB limit for blocks because it may be run in a public network with adversarial peers. Since its not able to verify hashes until all the content has arrived, an attacker can send gibberish flooding connections and consuming bandwidth. Chunking data into smaller blocks also aids in deduplication:
IPCS implements containerd's
content.Store interface and can be built as a golang plugin to override containerd's default local store. A converter implementation is also provided that converts a regular OCI image manifest to a manifest where every descriptor is replaced with the descriptor of the root DAG node added to IPFS. The root node is the merkle root of the 262KiB chunks of the layer.
Although the IPFS daemon or its network may already have the bytes for all image's P2P content, containerd has a boltdb metadata store wrapping the underlying
A image pull, starting from the client side goes through the following layers:
- content.ContentClient (gRPC client)
- content.NewService (gRPC server: plugin.GRPCPlugin "content")
- content.newContentStore (content.Store: plugin.ServicePlugin, services.ContentService)
- metadata.NewDB (bolt *metadata.DB: plugin.MetadataPlugin "bolt")
- ipcs.NewContentStore (content.Store: plugin.ContentPlugin, "ipcs")
So in the case of this project
ipcs, a pull is simply flushing through its
content.Store layers to register the image in containerd's metadata stores. Note that the majority of the blocks don't need to be downloaded into IPFS's local storage in order to complete a pull, and can be delayed until unpacking the layers into snapshots.
Collected data on:
- m5.large x 3
- 8.0 GiB Memory
- 2 vCPUs
- Up to 10 Gigabit (Throttled by AWS network credits)
- Linux kernel 4.4.0
- Ubuntu 16.04.6 LTS
- Containerd v1.2.6
- IPFS v0.4.21
- Switch libp2p mux from
- Set flatfs
- Enable experimental
- Pull from DockerHub / Private docker registries
- Shard content chunks evenly to 3 nodes such that each node has roughly 33% of IPFS blocks.
|Image||Total size (bytes)||IPFS blocks||DockerHub pull (secs)||IPFS pull (secs)||Diff (Hub/IPFS)|
IPFS's performance seems to slow down as the number of nodes (size of total image) goes up. There was a recent regression in
go-ipfs v0.4.21 that was fixed in this commit on
As seen from
make compare, there also doesn't seem to be any improvements in deduplication between IPFS chunks as opposed to OCI layers:
$ GO111MODULE=on IPFS_PATH=./tmp/ipfs go run ./cmd/compare docker.io/library/alpine:latest docker.io/library/ubuntu:latest docker.io/library/golang:latest docker.io/ipfs/go-ipfs:latest // ... 2019/06/04 13:39:55 Found 1381 blocks docker.io/ipfs/go-ipfs:latest: 46891351 docker.io/library/alpine:latest: 5516903 docker.io/library/golang:latest: 828096081 docker.io/library/ubuntu:latest: 57723854 // Zero block intersection, they are very different images though.
Serious usage of p2p container image distribution should consider Dragonfly and Kraken, because IPFS suffers from performance issues:
- Bitswap transfer is slow ipfs/go-ipfs#5723
- IPFS uses excessive bandwidth ipfs/go-ipfs#3429
- Slow pins when there are many pins ipfs/go-ipfs#5221
Explore deduplication by adding each layer's uncompressed, untared files into IPFS to get chunked-file-granular deduplication. IPFS's Unixfs (UNIX/POSIX fs features implemented via IPFS) needs the following:
- Support for tar file metadata (uid, gid, modtime, xattrs, executable bit, etc):
- Support for hard links, character/block devices, fifo:
- Implementation of
diff.Applierto apply custom IPFS layer mediatype to containerd's tmpmount.
Explore IPFS-FUSE mounted layers for lazy container rootfs:
- Same requirements as above.
- Possible snapshotter interface changes:
Explore IPFS tuning to improve performance
- Tune goroutine/parallelism in various IPFS components.
- Tune datastore (use experimental go-ds-badger?)
badgerdsis hard to shard for benchmarking purposes, because GC doesn't remove data on disk: https://github.com/ipfs/go-ds-badger/issues/54
- Profile / trace performance issues and identify hotspots.