Use Glusterfs as Persistent Storage in Kubernetes

Following my previous post of mounting Glusterfs inside Super Privileged Container, I am pleased to announce that Glusterfs can be also used as a persistent storage for Kubernetes.

My recent Kubernetes pull request makes Glusterfs a new Kubernetes volume plugin. As explained in the example POD,  there are a number of advantages of using Glusterfs.

First, mount storm can be alleviated. Since Glusterfs is a scale-out filesystem, mount can be dispatched to any replica. This is especially helpful in scaling considering you may have containers started simultaneously on hundreds and thousands of nodes and each host mounts the remote filesystem at the same time. Such mount storm leads to latency or even service unavailability. In this Glusterfs volume plugin, however, mounts are balanced on different Gluster hosts and mount storm is thus alleviated.

Second, HA is built into this Glusterfs HA. As seen in the example POD, an array of Glusterfs hosts can be provided, Kubelet node pick one randomly and mount from there. If that host is unresponsive, kubelet goes to the next and so on until a successful mount is observed. This mechanism thus requires no other 3rd party solution (e.g. DNS round robin, etc).

The last feature of using this Glusterfs volume is that there is a support for using Super Privileged Container to enable Kubernets host to mount. This is illustrated in the helper utility in the example POD.

Vault 2015 Notes: Second Day Morning

Maxim’s FUSE improvement talk.  The writeback cache was the first slide when I arrived. The writeback cache reduced write latency and parallel writeback processing.  It accumulated page cache and kernel writeback would kick off the actual I/O. I vaguely heard “tripled”.  The performance comparison showed both baseline and improvement parity (~30% better) and commodity vs. Dell EQL SAN (mixed). The future improvement included eliminate global lock, variable message size, multi-queue, NUMA affinity. FUSE daemon might be able to talk to multiple queues in /dev/fuse and thus avoided contention. Oracle was said to submit patches to just do those things. The patches were said to improve performance quite a bit. Ben England from Red Hat asked about zero copy inside FUSE. Maxim pondered on kernel bypass for a second but hesitated to come a conclusion. Jeff Darcy asked about if FUSE API change needed to take advantage of these features, answers seemed to be not much. Second and following questions on invalidation writeback cache while one client still held them, answers seemed to be “depend” (expect “stale” data). Writeback cache could be disabled but on a volume level.

Anand’s talk on Glusterfs and NFS Ganesha. Ganesha became much better than last time I worked on it. Stackable FSALs, RDMA (libmooshika), dynamic exports. His focus was on CMAL (cluster manager abstraction layer), i.e. making active/active NFS heads possible. And you don’t need to have a clustered filesystem to use the CMAL framework. CMAL is able to migrate service IP. The clustered Ganesha with Glusterfs used VIP and Pacemaker/Corosync (could it scale?). Each Ganesha node is notified by DBUS message to initiate migration. The active/active tricks seemed to be embeded in the protocol NLM protocol (for v3 via SM_NOTIFY) and STALE_CLIENTID/STALE_STATEID (for v4). Jeff Layton didn’t object such architecture. Anand’s next topic was pNFS with Glusterfs, File Layout of course, anonymous FD was mentioned.  This appeared a more economic and scalable solution alternative. Questions on Ganesha vs. in-kernel NFS server performance parity, cluster scalability.

Venky’s Glusterfs compliance topic started with a low key tone. But think about it, there are many opportunities in his framework.: BitRot detection, tiering, dedupe, compression were quickly talked. It is easy to double that list and point to a use case. The new Glusterfs journal features callback mechanism, supports richer format. The “log mining” is on individual bricks, it could require some programming to get the (especially distributed) volume level picture. The metadata journals contain enough information, so say if you like to run forensics utilities, they could be very helpful to plot the data lifecycle.

Vault 2015 Notes: First Day Morning

After surviving the morning commute, I found myself 10 minutes late for the first talk.

The first talk was a joint topic on different aspects of the future and current storage system: Persistent Memory, Multiqueue (mentioned new IO scheduler), SMR, SCSI queue tree (better maintenance), LIO/SCST merger, iSCSI performance reconciling multiqueue and multi-connection conflicts by proposing new IETF iSCSI extension for Linux, kernel Rescan.

Second topic from SanDisk is about Data Center architectures. I came into a revelation that the Data Centers were consolidated into different resource poolings and scaling granularity.   As I reckoned the recent industry consolidation: Avago’s big acquistions making it relevant as a fabrics provider, SanDisk’s ascend into Enterprise storage was also leapfrogging, and multiple storage vendors had acquired some sorts of data management outfits (Pentaho/HDS for instance). This topic reviewed heterogeneous replication (one on SSD, more on HDD), erasure coding on Flash. SanDisk’s contributions/patches to Ceph and NoSQL improved performance by several X’s, future reducing price/performance gap.

Next session in Brfs was interesting, though I lost most part of it due to limited seating in the room. I vaguely remembered Chris was excited about CRC verification, improved scrub code, upcoming inline dedup, sub-volume quota, and new tests that made critical issues consistently reproducible, less write amplification using RocksDB, etc. I also had a good time learning how Facebook used and improved Glusterfs.

The pNFS talk was most about the basics but Christopher did attract my attention when he mentioned using SCSI3 reservation for fencing during error handling, and mentioned the projects/products I worked on before.  His then went to explain how his new pNFS server was structured and coded. The server used XFS and heavily reused the existing code base (like direct IO, no separate layout modules, etc). The performance was said to be linearly scaled.  And yes, he did mentioned omission of small files through pNFS protocol. The source code is kernel 4.0

Tachyon 0.6.0 Coming to Apache Bigtop

BIGTOP-1722 is now resolved. Tachyon 0.6.0 is now released in Apache Bigtop.

Tachyon released the long anticipated 0.6.0 version. This release is loaded with many new features, including hierarchical storage layers, Vagrant deploy (AWS EC2, OpenStack, Docker, VirtualBox), Netty server, and many bug fixes (including using Glusterfs as under filesystem).

I will write a few tutorials on deploying Tachyon 0.6.0 and run MapReduce tasks after I come back from Vault and Spark Summit East.

How to Mount Glusterfs on Docker Host?


A Docker host (such as CoreOS and RedHat Atomic Host) usually is a minimal OS without Gluster client package. If you want to mount a Gluster filesystem, it is quite hard to do it on the host.


I just worked out a solution to create a Super Privileged Container and run mount in the SPC’s namespace but create the mount in host’s namespace. The idea is to inject my own mount before mount(2) is called, so we can reset the namespace, thank Colin for the mount patch idea. But since I don’t want to patch any existing util, I followed Sage Weil’s suggestion and used ld.preload instead. This idea can thus be applied to gluster, nfs, cephfs, and so on, once we update the switch here The code is at my repo. Docker image is hchen/install-glusterfs-on-fc21

How it works

First pull my Docker image

# docker pull hchen/install-glusterfs-on-fc21

Then run the image in Super Privileged Container mode

#  docker run  --privileged -d  --net=host -e sysimage=/host -v /:/host -v /dev:/dev -v /proc:/proc -v /var:/var -v /run:/run hchen/install-glusterfs-on-fc21

Get the the container’s PID:

# docker inspect --format  {{.State.Pid}}  <your_container_id>

My PID is 865, I use this process’s namespace to run the mount, note the /mnt is in host’s name space

# nsenter --mount=/proc/865/ns/mnt mount -t glusterfs <your_gluster_brick>:<your_gluster_volueme>  /mnt

Alas, you can check on your Docker host to see this gluster fs mount at /mnt.