Use Glusterfs as Persistent Storage in Kubernetes

Following my previous post of mounting Glusterfs inside Super Privileged Container, I am pleased to announce that Glusterfs can be also used as a persistent storage for Kubernetes.

My recent Kubernetes pull request makes Glusterfs a new Kubernetes volume plugin. As explained in the example POD,  there are a number of advantages of using Glusterfs.

First, mount storm can be alleviated. Since Glusterfs is a scale-out filesystem, mount can be dispatched to any replica. This is especially helpful in scaling considering you may have containers started simultaneously on hundreds and thousands of nodes and each host mounts the remote filesystem at the same time. Such mount storm leads to latency or even service unavailability. In this Glusterfs volume plugin, however, mounts are balanced on different Gluster hosts and mount storm is thus alleviated.

Second, HA is built into this Glusterfs HA. As seen in the example POD, an array of Glusterfs hosts can be provided, Kubelet node pick one randomly and mount from there. If that host is unresponsive, kubelet goes to the next and so on until a successful mount is observed. This mechanism thus requires no other 3rd party solution (e.g. DNS round robin, etc).

The last feature of using this Glusterfs volume is that there is a support for using Super Privileged Container to enable Kubernets host to mount. This is illustrated in the helper utility in the example POD.

Vault 2015 Notes: Second Day Afternoon

Afternoon session in Ted Ts’o’s talk on lazytime mount option.  On surface, his talk shared many with the paper I wrote years back. He emphasized on tail latency than average (and I agree).  After spending time on showing where latency irregularity coming from, he pointed out that mtime update was problematic. Not flushing mtime was scary but Ted said other information (i_size, etc) could help you find if files were modified. And if no i_size change (like database), application usually didn’t care about mtime. He added dirty flags to inode as hint for fdata_sync (no mtime change) or fsync (mtime change). ext4 is the current new lazytime compliant filesystem. He ended with a ftrace demo. His multithreaded random write fio benchmark on RAM Disk showed double bandwidth, lockstat showed locking contention on journal went away. He also mentioned of DIO read lock removal after eliminating the chance of reading stale data on the write path. That dioread_nolock (?) option enabled ext4 to DIO read in parallel at raw high speed Flash speed.

Next topic was loaded with all buzz: Multipath, PCI-e, NVM. He showed a chart pointing to software was the last one to reduce latency in NVM era.

Vault 2015 Notes: Second Day Morning

Maxim’s FUSE improvement talk.  The writeback cache was the first slide when I arrived. The writeback cache reduced write latency and parallel writeback processing.  It accumulated page cache and kernel writeback would kick off the actual I/O. I vaguely heard “tripled”.  The performance comparison showed both baseline and improvement parity (~30% better) and commodity vs. Dell EQL SAN (mixed). The future improvement included eliminate global lock, variable message size, multi-queue, NUMA affinity. FUSE daemon might be able to talk to multiple queues in /dev/fuse and thus avoided contention. Oracle was said to submit patches to just do those things. The patches were said to improve performance quite a bit. Ben England from Red Hat asked about zero copy inside FUSE. Maxim pondered on kernel bypass for a second but hesitated to come a conclusion. Jeff Darcy asked about if FUSE API change needed to take advantage of these features, answers seemed to be not much. Second and following questions on invalidation writeback cache while one client still held them, answers seemed to be “depend” (expect “stale” data). Writeback cache could be disabled but on a volume level.

Anand’s talk on Glusterfs and NFS Ganesha. Ganesha became much better than last time I worked on it. Stackable FSALs, RDMA (libmooshika), dynamic exports. His focus was on CMAL (cluster manager abstraction layer), i.e. making active/active NFS heads possible. And you don’t need to have a clustered filesystem to use the CMAL framework. CMAL is able to migrate service IP. The clustered Ganesha with Glusterfs used VIP and Pacemaker/Corosync (could it scale?). Each Ganesha node is notified by DBUS message to initiate migration. The active/active tricks seemed to be embeded in the protocol NLM protocol (for v3 via SM_NOTIFY) and STALE_CLIENTID/STALE_STATEID (for v4). Jeff Layton didn’t object such architecture. Anand’s next topic was pNFS with Glusterfs, File Layout of course, anonymous FD was mentioned.  This appeared a more economic and scalable solution alternative. Questions on Ganesha vs. in-kernel NFS server performance parity, cluster scalability.

Venky’s Glusterfs compliance topic started with a low key tone. But think about it, there are many opportunities in his framework.: BitRot detection, tiering, dedupe, compression were quickly talked. It is easy to double that list and point to a use case. The new Glusterfs journal features callback mechanism, supports richer format. The “log mining” is on individual bricks, it could require some programming to get the (especially distributed) volume level picture. The metadata journals contain enough information, so say if you like to run forensics utilities, they could be very helpful to plot the data lifecycle.

Vault 2015 Notes: First Day Afternoon

Afternoon talks were brain tests. There were many good and interesting topics. I started from Sage’s librados talk.  In addition to RGW, RBD, and CephFS, Ceph’s librados is also open to developers/users. Sage’s talk was to promote librados to app developers. In fact, it was the building block for RBD, RGW, and CephFS. He started with simple Hello Word type of snippets, then in a more complicated atomic compound and conditional models, following by models on K/V values (random access, structure data). The new RADOS methods run inside I/O path (a .so file) on a per object basis. This is very interesting, you can implement any plugins to add values to your data, e.g. checksum, archive, replication, encryption, etc. The watch/notify mechanism was extensively reviewed, this could implement cache invalidation on this.  He mentioned dynamic object in LUA from Noah Wakins that used LUN clent wrapper for librados and made programming RADOS classes easy, VAULTAIRE (preserving all data points no MRTG, a data vault for metrics), ZLOG – CORFU (a high performancing distributed shared log for flash ???), radosfs (hey, not my RadosFS), glados (gluster fs xlator on RADOS), iRODS, Synnefo, and dropbox like app, libradosstriper. He concluded the talk with a list of others in the CAP space: Gluster, Swift, Riak, Cassandra.

Next talk on NFSv4.2 and beyond. Interesting to see the NFSv4 timeline, 12 years into production since working group created. But labeled NFS was much accelerated. Security labels were into RHEL 7 supporting SELinux enforced by server. Sparse file in kernel 3.18 but not in RHEL, it reduced network traffic by not sending holes, good for virtualization. Space reservation (fallocate) in 3.19, not in RHEL yet.  Server side copy (no glibc support yet?),  IO hint (io_advice). If you have an idea, supply a patch and RFC.

Last talk on Ceph today (4 in a row!) was from SanDisk. The 512TB InfiniFlash was mentioned. He explained a collection of patches to Ceph OSD to make all-flash OSD high performing 6~7x on read. Code in Hammer. He siad TCmalloc increased contetio in sharded thread pool. This was not in JEmalloc. My poor eye sight spotted a ~350K IOPS read with queue depth 100, and they were said to saturate the box (which was 780K IOPS and 7Gb/s).

Also during breakout, I peeked into Facebook’s storage box. A 30-bay 1U server, fan-only cooling (and still able to run without A/C!), no visible vibration reducer.

Vault 2015 Notes: First Day Morning

After surviving the morning commute, I found myself 10 minutes late for the first talk.

The first talk was a joint topic on different aspects of the future and current storage system: Persistent Memory, Multiqueue (mentioned new IO scheduler), SMR, SCSI queue tree (better maintenance), LIO/SCST merger, iSCSI performance reconciling multiqueue and multi-connection conflicts by proposing new IETF iSCSI extension for Linux, kernel Rescan.

Second topic from SanDisk is about Data Center architectures. I came into a revelation that the Data Centers were consolidated into different resource poolings and scaling granularity.   As I reckoned the recent industry consolidation: Avago’s big acquistions making it relevant as a fabrics provider, SanDisk’s ascend into Enterprise storage was also leapfrogging, and multiple storage vendors had acquired some sorts of data management outfits (Pentaho/HDS for instance). This topic reviewed heterogeneous replication (one on SSD, more on HDD), erasure coding on Flash. SanDisk’s contributions/patches to Ceph and NoSQL improved performance by several X’s, future reducing price/performance gap.

Next session in Brfs was interesting, though I lost most part of it due to limited seating in the room. I vaguely remembered Chris was excited about CRC verification, improved scrub code, upcoming inline dedup, sub-volume quota, and new tests that made critical issues consistently reproducible, less write amplification using RocksDB, etc. I also had a good time learning how Facebook used and improved Glusterfs.

The pNFS talk was most about the basics but Christopher did attract my attention when he mentioned using SCSI3 reservation for fencing during error handling, and mentioned the projects/products I worked on before.  His then went to explain how his new pNFS server was structured and coded. The server used XFS and heavily reused the existing code base (like direct IO, no separate layout modules, etc). The performance was said to be linearly scaled.  And yes, he did mentioned omission of small files through pNFS protocol. The source code is kernel 4.0

NAS Server in the Cloud?

What does it mean, really?

Cloud evangelists forecast the future of data center are in the Cloud, yet I am not convinced that leads to the demise of storage servers. I believe storage vendors will find a new home for their products: Cloud.

Actually, NetApp, a storage vendor who sells NAS boxes, already transforms itself to an AWS server image provider. The server image just provides the same function as the NAS boxes do.

Why people still need storage servers, even in Cloud?


Cloud storage like S3 and Swift are object store, while most enterprise applications still work with file and block based storage.
Shifting data center into the Cloud must first deal with this API level differences.


Cloud storage technologies may vary from one vendor to another, with absence of industry wide protocols. This poses a migration risk for
Cloud hoppers. In contrast, NFS/CIFS/iSCSI/FC protocols are found in all storage servers, as long as in-cloud storage server exists,
such migration risk is much diminished.


It is undeniable that storage vendors like EMC, NetApp, HP, Hitachi, and IBM pride themselves on technologies (and patents) that Cloud storage don’t yet have. Their value proposition won’t evaporate any time soon.

How does it look like?

My very rough component level comparison is illustrated here.


Tachyon 0.6.0 Coming to Apache Bigtop

BIGTOP-1722 is now resolved. Tachyon 0.6.0 is now released in Apache Bigtop.

Tachyon released the long anticipated 0.6.0 version. This release is loaded with many new features, including hierarchical storage layers, Vagrant deploy (AWS EC2, OpenStack, Docker, VirtualBox), Netty server, and many bug fixes (including using Glusterfs as under filesystem).

I will write a few tutorials on deploying Tachyon 0.6.0 and run MapReduce tasks after I come back from Vault and Spark Summit East.

How to Mount Glusterfs on Docker Host?


A Docker host (such as CoreOS and RedHat Atomic Host) usually is a minimal OS without Gluster client package. If you want to mount a Gluster filesystem, it is quite hard to do it on the host.


I just worked out a solution to create a Super Privileged Container and run mount in the SPC’s namespace but create the mount in host’s namespace. The idea is to inject my own mount before mount(2) is called, so we can reset the namespace, thank Colin for the mount patch idea. But since I don’t want to patch any existing util, I followed Sage Weil’s suggestion and used ld.preload instead. This idea can thus be applied to gluster, nfs, cephfs, and so on, once we update the switch here The code is at my repo. Docker image is hchen/install-glusterfs-on-fc21

How it works

First pull my Docker image

# docker pull hchen/install-glusterfs-on-fc21

Then run the image in Super Privileged Container mode

#  docker run  --privileged -d  --net=host -e sysimage=/host -v /:/host -v /dev:/dev -v /proc:/proc -v /var:/var -v /run:/run hchen/install-glusterfs-on-fc21

Get the the container’s PID:

# docker inspect --format  {{.State.Pid}}  <your_container_id>

My PID is 865, I use this process’s namespace to run the mount, note the /mnt is in host’s name space

# nsenter --mount=/proc/865/ns/mnt mount -t glusterfs <your_gluster_brick>:<your_gluster_volueme>  /mnt

Alas, you can check on your Docker host to see this gluster fs mount at /mnt.

iSCSI as on-premise Persistent Storage for Kubernetes and Docker Container

Why iSCSI Storage?

iSCSI has been widely adopted in data centers. It is the default implementation for OpenStack Cinder. Cinder defines a common block storage interface so storage vendors can supply their own plugins to present their storage products to Nova compute. As it happens, most of the vendor supplied plugins use iSCSI.

Containers: How to Persist Data to iSCSI Storage?

Persisting data inside a container can be done in two ways.

Container sets up iSCSI session

The iSCSI session is initiated inside the container, iSCSI traffic goes through Docker NAT to external iSCSI target. This approach doesn’t require host’s support and is thus portable. However, the Container is likely to suffer from suboptimal network performance, because Docker NAT doesn’t deliver good performance, as reseachers at IBM found. Since iSCSI is highly senstive to network performance, delay or jitters will cause iSCSI connection timeout and retries. This approach is thus not preferred for mission-critical services.

Host sets up iSCSI session

Host initiates the iSCSI session, attaches iSCSI disk, mounts the filesystem on the disk to a local directory, and shares the filesystem with Container. This approach doesn’t need Docker NAT and is conceivably higher performing than the first approach. This approach is implemented in the iSCSI persistent storage for Kubernetes, discussed in the following.

What is Kubernetes?

Kubernetes is an open source Linux Container orchestrator developed by Google, Red Hat, etc. Kubernetes creates, schedules, minotors, and deletes containers across a cluster of Linux hosts. Kubernetes defines Containers as “pod”, which is declared in a set of json files.

How Containers Persist Data in Kubernetes?

A Container running MySQL wants persistent storage so the database can survive. The persistent storage can either be on local host or ideally a shared storage that the host clusters can all access so that when the container is migrated, it can find the persisted data on the new host. Currently Kubernetes provides three storage volume types: empty_dir, host_dir, and GCE Persistent Disk.

  • empty_dir. empty_dir is not meant to be long lasting. When the pod is deleted, the data on empty_dir is lost.
  • host_dir. host_dir presents a directory on the host to the container. Container sees this directory through a local mountpoint. Steve Watts has written an excellent blog on provisioning NFS to containers by way of host_dir.
  • GCE Persistent Disk. You can also use the persistent storage service available at Google Compute Engine. Kubernetes allows containers to access data residing on GCE Persisent Disk.

iSCSI Disk: a New Persistent Storage for Kubernetes

Since on-premise enterprise data centers and OpenStack providers have already invested in iSCSI storage. When they deploy Kubernetes, it is logical that they want Containers access data living on iSCSI storage. It is thus desirable for Kubernetes to support iSCSI disk based persistent volume


My Kubernetes pull request provides a solution to this end. As seen in this high level architecture _config.yml When kubelete creates the pod on the node(previously known as minion), it logins into iSCSI target, and mounts the specified disks to the container’s volumes. Containers can then access the data on the persistent storage. Once the container is deleted and iSCSI disks are not used, kubelet logs out of the target. A Kubernetes pod can use iSCSI disk as persistent storage for read and write. As exhibited in this pod example, this pod declares two containers: both uses iSCSI LUNs. Container iscsipd-ro mounts the read-only ext4 filesystem backed by iSCSI LUN 0 to /mnt/iscsipd, and Container iscsipd-ro mounts the read-write xfs filesystem backed by iSCSI LUN 1 to /mnt/iscsipd.

How to Use it?

Here is my setup to setup Kubernetes with iSCSI persistent storage. I use Fedora 21 on Kubernetes node. First get my github repo

# git clone -b iscsi-pd-merge

then build and install on the Kubernetes master and node. Install iSCSI initiator on the node:

# yum -y install iscsi-initiator-utils

then edit /etc/iscsi/initiatorname.iscsi and /etc/iscsi/iscsid.conf to match your iSCSI target configuration. I mostly follow these instructions to setup iSCSI initiator and these instructions to setup iSCSI target. Once you have installed iSCSI initiator and new Kubernetes, you can create a pod based on my example. In the pod JSON, you need to provide portal (the iSCSI target’s IP address and port if not the default port 3260), target’s iqn, lun, and the type of the filesystem that has been created on the lun, and readOnly boolean. Once your pod is created, run it on the Kubernetes master:

#cluster/ create -f your_new_pod.json

Here is my command and output:

    # cluster/ create -f examples/iscsi-pd/iscsi-pd.json 
    current-context: ""
    Running: cluster/../cluster/gce/../../_output/local/bin/linux/amd64/kubectl create -f examples/iscsi-pd/iscsi-pd.json
    # cluster/ get pods
    current-context: ""
    Running: cluster/../cluster/gce/../../_output/local/bin/linux/amd64/kubectl get pods
    POD                                    IP                  CONTAINER(S)        IMAGE(S)                 HOST                      LABELS              STATUS
    iscsipd                                iscsipd-ro          kubernetes/pause         fed-minion/   <none>              Running
                                                           iscsipd-rw          kubernetes/pause                                                    

On the Kubernetes node, I got these in mount output

    #mount |grep kub
    /dev/sdb on /var/lib/kubelet/plugins/ type ext4 (ro,relatime,stripe=1024,data=ordered)
    /dev/sdb on /var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/ type ext4 (ro,relatime,stripe=1024,data=ordered)
    /dev/sdc on /var/lib/kubelet/plugins/ type xfs (rw,relatime,attr2,inode64,noquota)
    /dev/sdc on /var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/ type xfs (rw,relatime,attr2,inode64,noquota)

Run docker inspect and I found the Containers mounted the host directory into the their /mnt/iscsipd directory.

    # docker ps
    CONTAINER ID        IMAGE                     COMMAND                CREATED             STATUS              PORTS                    NAMES
    cc9bd22d9e9d        kubernetes/pause:latest   "/pause"               3 minutes ago       Up 3 minutes                                 k8s_iscsipd-rw.12d8f0c5_iscsipd.default.etcd_4ab78fdc-b927-11e4-ade6-d4bed9b39058_e3f49dcc                               
    a4225a2148e3        kubernetes/pause:latest   "/pause"               3 minutes ago       Up 3 minutes                                 k8s_iscsipd-ro.f3c9f0b5_iscsipd.default.etcd_4ab78fdc-b927-11e4-ade6-d4bed9b39058_3cc9946f                               
    4d926d8989b3        kubernetes/pause:latest   "/pause"               3 minutes ago       Up 3 minutes                                 k8s_POD.8149c85a_iscsipd.default.etcd_4ab78fdc-b927-11e4-ade6-d4bed9b39058_c7b55d86                                      
    #docker inspect --format   {{.Volumes}}  cc9bd22d9e9d
    map[/mnt/iscsipd:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/ /dev/termination-log:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/containers/iscsipd-rw/cc9bd22d9e9db3c88a150cadfdccd86e36c463629035b48bdcfc8ec534be8615]
    #docker inspect --format  {{.Volumes}}  a4225a2148e3
    map[/dev/termination-log:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/containers/iscsipd-ro/a4225a2148e38afc1a50a540ea9fe2e747886f1011ac5b3be4badee938f2fc5f /mnt/iscsipd:/var/lib/kubelet/pods/4ab78fdc-b927-11e4-ade6-d4bed9b39058/volumes/]