[Alpha] GlusterFS CSI ( Container Storage Interface) Driver for Container Orchestrators!

Every Container or Cloud space storage vendor wants a standard interface for a unbiased solution development which then will not require them to have a non-trivial testing matrix .

“Container Storage Interface” (CSI) is a proposed new Industry standard for cluster-wide volume plugins. CSI will enable storage vendors (SP) to develop a plugin once and have it work across a number of container orchestration (CO) systems.

The latest kubernetes release 1.9 has rolled out an alpha implementation of the Container Storage Interface (CSI) which makes installing new volume plugins as easy as deploying a pod. It also enables third-party storage providers to develop solutions without the need to add to the core Kubernetes codebase.

This blog is about GlusterFS CSI driver which is capable of creating/deleting volumes dynamically and mount/unmount whenever there is a request. I will explain about the deployment parts later. For now, I have compiled (github.com/humblec/drivers/commit/452e76c623c96b7222599ea94bb7e809f03b156c) and set the deployment of kubernetes ready for GlusterFS CSI driver in my setup.

What I have:


*) Kubernetes cluster with required feature gates enabled
*) Running CSI helpers for kubernetes.

To demonstrate how GlusterFS CSI driver works, let us follow the same workflow of dynamically provisioned PVs starting from creation of a storageclass. Please note the provisioner parameter in below storage class file. It point to csi glusterfs plugin.

[root@localhost cluster]# cat csi-sc.yaml 

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: glusterfscsi 
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: csi-glusterfsplugin
[root@localhost cluster]#

Once the storage class is created, let us create a claim, this claim point to storageclass called glusterfscsi.

[root@localhost cluster]# cat glusterfs-pvc-claim12_fast.yaml 
{
   "kind": "PersistentVolumeClaim",
   "apiVersion": "v1",
   "metadata": {
     "name": "claim12",
     "annotations": {
     "volume.beta.kubernetes.io/storage-class": "glusterfscsi"
     }
   },
   "spec": {
     "accessModes": [
       "ReadWriteMany"
     ],
     "resources": {
       "requests": {
         "storage": "4Gi"
       }
     }
   }
}

As soon as you make a request to create the claim, the gluster csi driver received the request and created a PV object as you can see here:

[root@localhost cluster]# ./kubectl.sh get pvc
NAME      STATUS    VOLUME                                                        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
claim12   Bound     kubernetes-dynamic-pvc-ad8014ec-febd-11e7-bf55-c85b7636c232   4Gi        RWX            glusterfscsi   35m
[root@localhost cluster]#

As an excited user/admin, examine the details of PVC and PV as shown below:

[root@localhost kubernetes]# kubectl describe pvc
Name:          claim12
Namespace:     default
StorageClass:  glusterfscsi
Status:        Bound
Volume:        kubernetes-dynamic-pvc-79eb02cd-fd17-11e7-ac3c-c85b7636c232
Labels:        
Annotations:   control-plane.alpha.kubernetes.io/leader={"holderIdentity":"79e518d1-fd17-11e7-ac3c-c85b7636c232","leaseDurationSeconds":15,"acquireTime":"2018-01-19T12:51:32Z","renewTime":"2018-01-19T12:51:34Z","lea...
               pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-class=glusterfscsi
               volume.beta.kubernetes.io/storage-provisioner=csi-glusterfsplugin
Finalizers:    []
Capacity:      4Gi
Access Modes:  RWX
Events:
  Type    Reason                 Age              From                                                                            Message
  ----    ------                 ----             ----                                                                            -------
  Normal  ExternalProvisioning   5m (x7 over 6m)  persistentvolume-controller                                                     waiting for a volume to be created, either by external provisioner "csi-glusterfsplugin" or manually created by system administrator
  Normal  Provisioning           5m               csi-glusterfsplugin localhost.localdomain 79e518d1-fd17-11e7-ac3c-c85b7636c232  External provisioner is provisioning volume for claim "default/claim12"
  Normal  ProvisioningSucceeded  5m               csi-glusterfsplugin localhost.localdomain 79e518d1-fd17-11e7-ac3c-c85b7636c232  Successfully provisioned volume kubernetes-dynamic-pvc-79eb02cd-fd17-11e7-ac3c-c85b7636c232

PV:

[root@localhost cluster]# ./kubectl.sh describe pv
Name:            kubernetes-dynamic-pvc-ad8014ec-febd-11e7-bf55-c85b7636c232
Labels:          
Annotations:     csi.volume.kubernetes.io/volume-attributes={"glusterserver":"172.18.0.3","glustervol":"vol_64d3ac458bc17bec44a919336656fbfb"}
                 csiProvisionerIdentity=1516547610828-8081-csi-glusterfsplugin
                 pv.kubernetes.io/provisioned-by=csi-glusterfsplugin
StorageClass:    glusterfscsi
Status:          Bound
Claim:           default/claim12
Reclaim Policy:  Delete
Access Modes:    RWX
Capacity:        4Gi
Message:         
Source:
    Type:          CSI (a Container Storage Interface (CSI) volume source)
    Driver:        csi-glusterfsplugin
    VolumeHandle:  ad817b9d-febd-11e7-96d9-c85b7636c232
    ReadOnly:      false
Events:            
[root@localhost cluster]#

Above outputs shows that, the PV object is created by the Gluster CSI driver!

Let us create a pod with this claim and see the mount works.

[root@localhost cluster]# cat ../demo/fedora-pod.json 
{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "name": "gluster",
        "labels": {
            "name": "gluster"
        }
    },
    "spec": {
        "containers": [{
            "name": "gluster",
            "image": "fedora",
            "imagePullPolicy": "IfNotPresent",
            "volumeMounts": [{
                "mountPath": "/mnt/gluster",
                "name": "gluster"
            }]
        }],
       "volumes": [{
            "name": "gluster",
            "persistentVolumeClaim": {
                "claimName": "claim12"
            }
        }]
    }
}

Create the pod and check mount

[root@localhost cluster]#kubectl create -f demo/fedora-pod.json

...

[root@localhost cluster]# mount |grep gluster
172.18.0.3:vol_64d3ac458bc17bec44a919336656fbfb on /var/lib/kubelet/pods/e6476013-febd-11e7-bde6-c85b7636c232/volumes/kubernetes.io~csi/kubernetes-dynamic-pvc-ad8014ec-febd-11e7-bf55-c85b7636c232/mount type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Cooool? Isnt it, when you delete the pod, it unmounts the volume as expected.

[root@localhost cluster]# ./kubectl.sh delete pod gluster
pod "gluster" deleted
[root@localhost cluster]# mount |grep glusterfs
[root@localhost cluster]#

PS # I will write about the deployment and other details in next blog. Happy to receive your feedback if any.

22 Jan 2018

Building Gluster with Address Sanitizer

We occasionally find leaks in Gluster via bugs filed by users and customers. We definitely have benefits from checking for memory leaks and address corruption ourselves. The usual way has been to run it under valgrind. With ASAN, the difference is we can compile the binary with ASAN and then anyone can run their tests on top of this binary and it should crash in case it comes across a memory leak or memory corruption. We’ve fixed at least one bug with the traceback from ASAN.

Here’s how you run Gluster under ASAN.

./autogen.sh
./configure --enable-gnfs --enable-debug --silent --sanitize=address

You need to make sure you have libasan installed or else this might error out. Once this is done, compile and install like you would normally. Now run tests and see how it works. There are problems with this approach though. If there’s a leak in cli, it’s going to complain about it all the time. The noise doesn’t imply that fixing that is important. The Gluster CLI is going away soon. Additionally, the CLI isn’t a long running daemon. It’s started, does it’s job, and dies immediately.

The tricky part though is catches memory you’ve forgotten to free. It does not catch memory that you’ve allocated unnecessarily. In the near future, I want to create downloadable RPMs which you can download and run tests against.

11 Dec 2017

Want to Install Ceph, but afraid of Ansible?



There is no doubt that Ansible is a pretty cool automation engine for provisioning and configuration management. ceph-ansible builds on this versatility to deliver what is probably the most flexible Ceph deployment tool out there. However, some of you may not want to get to grips with Ansible before you install Ceph...weird right?

No, not really.


If you're short on time, or just want a cluster to try ceph for the first time, a more guided installation approach may help. So I started a project called ceph-ansible-copilot

The idea is simple enough; wrap the ceph-ansible playbook with a text GUI. Very 1990's, I know, but now instead of copying and editing various files you simply start the copilot tool, enter the details and click 'deploy'. The playbook runs in the background within the GUI and any errors are shown there and then...no more drowning in an ocean of scary ansible output :)

The features and workflows of the UI are described in the project page's README file.

Enough rambling, lets look at how you test this stuff out. The process is fairly straight forward;
  1. configure some hosts for Ceph
  2. create the Ansible environment
  3. run copilot
The process below describes each of these steps using CentOS7 as the deployment target for Ansible and the Ceph cluster nodes.
    1. Configure Some Hosts for Ceph
    Call me lazy, but I'm not going to tell you how to build vm's or physical servers. To follow along, the bare minimum you need are a few virtual machines - as long as they have some disks on them for Ceph, you're all set!

    2. Create the Ansible environment
    Typically for a Ceph cluster you'll want to designate a host as the deployment or admin host. The admin host is just a deployment manager, so it can be a virtual machine, a container or even a real (gasp!) server. All that really matters is that your admin host has network connectivity to the hosts you'll be deploying ceph to.

    On the admin host, perform these tasks (copilot needs ansible 2.4 or above)
    > yum install git ansible python-urwid -y
    Install ceph-ansible (full installation steps can be found here)
    > cd /usr/share
    > git clone https://github.com/ceph/ceph-ansible.git
    > cd ceph-ansible
    > git checkout master
    Setup passwordless ssh between the admin host and for candidate ceph hosts
    > ssh-keygen
    > ssh-copy-id root@<ceph_node>
    On the admin host install copilot
    > cd ~
    > git clone https://github.com/pcuzner/ceph-ansible-copilot.git
    > cd ceph-ansible-copilot
    > python setup.py install 
    3. Run copilot
    The main playbook for ceph-ansible is in /usr/share/ceph-ansible - this is where you need to run copilot from (it will complain if you try to run it in some other place!)
    > cd /usr/share/ceph-ansible
    > copilot
    Then follow the UI..

    Example Run
    Here's a screen capture showing the whole process, so you can see what you get before you hit the command line.



    The video shows the deployment of a small 3 node ceph cluster, 6 OSDs, a radosgw (for S3), and an MDS for cephfs testing. It covers the configuration of the admin host, the copilot UI and finally a quick look at the resulting ceph cluster. The video is 9mins in length, but for those of us with short attention spans, here's the timeline so you can jump to the areas that interest you.

    00:00 Pre-requisite rpm installs on the admin host
    01:12 Installing ceph-ansible from github
    01:52 Installing copilot
    02:58 Setting up passwordless ssh from the admin host to the candidate ceph hosts
    04:04 Ceph hosts before deployment
    05:04 Starting copilot
    08:10 Copilot complete, review the Ceph hosts



    What's next?
    More testing...on more and varied hardware...

    So far I've only tested 'simple' deployments using the packages from ceph.com (community deployments) against a CentOS target. So like I said, more testing is needed, a lot more...but for now there's enough of the core code there for me to claim a victory and write a blog post!

    Aside from the testing, these are the kinds of things that I'd like to see copilot handle
    • collocation rules (which daemons can safely run together)
    • resource warnings (if you have 10 HDD's but not enough RAM, or CPU...issue a warning)
    • handle the passwordless ssh setup. copilot already checks for passwordless ssh, so instead of leaving it to the admin to resolve any issues, just add another page to the UI.
    That's my wishlist - what would you like copilot to do? Leave a comment, or drop by the project on github.

    Demo'd Versions
    • copilot 0.9.1
    • ceph-ansible MASTER as at December 11th 2017
    • ansible 2.4.1 on CentOS




    4 Dec 2017

    Static Analysis for Gluster

    Static analysis programs are quite useful, but also prone to false positives. It’s really hard to keep track of static analysis failures on a fairly large project. We’ve looked at several approaches in the past. The one that we used to do was to publish a report every day which people could look at if they wished. This guaranteed that nobody looked at it. Despite knowing where to look for it, even I barely looked at it.

    The second approach was to run them twice, before your patch is merged and after your patch is merged in. If the count goes up with your patch, the test fails. This has a problem that it doesn’t account for false positives. An argument could be made that you could go fix another static analysis failure in your patch. But that means your patch now does two things, which isn’t fun for when you want to do a backport, for instance. Or even for history purposes. That’s landing two unrelated changes in one patch.

    The approach that we’ve now gone with is to have them run on a nightly basis with Jenkins. Deepshika did almost all the work for this and wrote about it on her blog. It has more details on the actual implementation. This puts all the results in one place for everyone to take a look at. Jenkins also gives us a visual view of what changed over the course of time, which wasn’t as easy in the past.

    She’s working on further improving the visual look by uniting all the jobs that are tied to static analysis. That way, we’ll have a nightly pipeline run for each branch that will put all the tests we care about for a particular branch in one place.

    1 Dec 2017

    Gluster Summit 2017

    Right after Open Source Europe, we had Gluster Summit. It was a 2-day event with talks and BoFs. I had two key things to do at the Gluster Summit. One was build out the minnowboard setup to demo Tendrl. This didn’t work out. I had volunteered to help with the video work as well. According to my plans. The setup for minnowboards would take about 1h and then I’d be free to help with camera work. I had a talk scheduled for the second day of the event. I’d have expected one of these to two wrong. I didn’t expect all to go wrong :)

    The venue had a balcony, which made for great photos

    On the first day, Amar and I arrived early and did the camera setup. The venue staff were helpful. They gave us a line out from their audio setup for the camera. Our original plan was that speakers would have a lapel mic for the camera. That was prone to errors from speakers and also would need us to check batteries every few hours. When we first tried to work with the line in, we had interference. The camera power supply wasn’t grounded (there wasn’t even a ground out. The venue staff switched out the boxes they used for line out and it worked like a charm after that.

    We did not have a good start for the demo. Jim had pre-setup the networking on the boards from home and brought them to Prague. But whatever we did, we couldn’t connect to it’s network the night before the event. That was the day we kept free to do this. That night we gave up, because we needed a monitor, an HDMI cable, and a keyboard to debug it. At the venue, we borrowed a keyboard and hooked up the board to the monitor. There was no user for dnsmasq, so it wasn’t assigning out IPs and that’s why the networking didn’t work. Once we got past that point, it was about getting the network to work with my laptop. That took a while. We decided to go with a server in the cloud as the Tendrl server. By evening, we got the playbook run and get everything installed and configured. But I’d made a mistake. I used IPs instead of FQDNs, so the dashboard wouldn’t work. This meant re-installing the whole setup. That’s the point where I gave up on it.

    We even took the group picture from the balcony

    My original content for my talk was to look at our releases. Especially to list out what we committed to at the start of the release and what we finished with. There is definitely a gap. This is common for software projects and how people estimate work. This topic was more or less covered on the first day. I instead focused on how we fail. How we fail our users, developers, and community. I followed the theme of my original talk a bit, pointing out that we can small large problems in smaller chunks.

    We’re running a marathon, not a sprint.

    29 Nov 2017

    16 Nov 2017

    Upgrading the Gluster Jenkins Server

    I’ve been wanting to work on upgrading build.gluster.org setup for ages. There’s a lot about that setup that isn’t ideal in how people use Jenkins anymore.

    We used the unix user accounts for access to Jenkins. This means Jenkins needs to read /etc/passwd and everyone has SSH access via passwords by default. Very often, the username wasn’t tied to an actual email address. I had to guess the account owner based on their usernames elsewhere. This was also open to brute force attacks. The only way to change passwords was to login to the server and run passwd command. We fixed this problem a few months ago by switching our auth to Github. Now access control is a Github group which gives you more permissions. Logging in will not give you any more permissions than not logging in.

    Our todo list during the Jenkins upgrade

    Jenkins community now recommends not running jobs on the master node at all. But our old setup depended on certain jobs always running on master. One by one, I’ve eliminated them so that they can now run on any node agent. The last job left is our release job. We make the tar from every release available on an FTP-like server. In our old setup, the this server and Jenkins were the same machine. The job ran on master and depended on them both being the same machine. We decided to split up the systems so we could take down Jenkins without any issue. We intend to fix this with an SCP command at the end of the release job to copy artifacts to the FTP-like server.

    One of the Red Hat buildings in Brno

    Now, we have a Jenkins setup that I’m happy with. At this point, we’ve fixed a vast majority of the annoying CI-related infra issues. In a few years, we’ll rip them all out and re-do them. For now, spending a week with my colleague in Brno working on an Infra sprint has been well worth our time and energy.

    5 Nov 2017

    Catching up with Infrastructure Debt

    If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why something is done in the way it is. This is just how things are in general. After I joined Gluster, I’ve worked with my fellow sysadmin to tackle our large infrastructure technical debt over the course of time. It goes like this:

    • We run a pretty old version of Gerrit on CentOS 5.
    • We run a pretty old version of Jenkins on CentOS 6.
    • We run CentOS 6 for all our regressions machines.
    • We run CentOS 6 for all our build machines.
    • We run NetBSD on Rackspace in a setup that is not easy to automate nor is it currently part of our automation.
    • We have a bunch of physical machines in a DC, but we haven’t had time to move our VMs over and use Rackspace as burstable capacity.

    That is in no way and exhaustive list. But we’ve managed to tackle 2.5 items from the list. Here’s what we did in order:

    • Upgraded Gerrit to the then latest version.
    • Setup Gerrit staging to test newer versions regularly for scheduling migration.
    • Created new CentOS 7 VMs on our hardware and moved the builds in there.
    • Moved Gerrit over to a new CentOS 7 host.
    • Wrote ansible scripts to manage most of Gerrit, but deployed currently only to staging.
    • Upgraded Jenkins to the latest LTS.
    • Moved Jenkins to a CentOS 7 host (Done last week, more details coming up!)

    If I look at it, it almost looks like I’ve failed. But again, like dealing with most infrastructure debt, you touch one thing and you realize it’s broken in someway and someone depended on that breakage. What I’ve done is I’ve had to pick and prioritize what things I would spend my time on. At the end of the day, I have to justify my time in terms of moving the project forward. Fixing the infrastructure debt for Gerrit was a great example. I could actually focus on it with everyone’s support. Fixing Jenkins was a priority since we wanted to use some of the newer features, again I had backing to do that. Moving things to our hardware is where things get tricky. There’s some financial goals we can hit if we make the move, but outside of that, we have no reason to move. But long-term, we want to me mostly in our hardware, since we spent money on it. This is, understandably going slow. There’s a subtle capacity difference and the noisy neighbor problem affects us quite strongly when we try to do anything in this regard.

    14 Oct 2017

    4 Oct 2017

    Containers aren’t just for applications

    Containers have grabbed so much attention because they demonstrated a way to solve the software packaging problem that the IT industry had been poking and prodding at for a very long time. Linux package management, application virtualization (in all its myriad forms), and virtual machines had all taken cuts at making it easier to bundle and install software along with its dependencies. But it was the container image format and runtime that is now standardized under the Open Container Initiative (OCI) that made real headway toward making applications portable across different systems and environments.

    Containers have also both benefited from and helped reinforce the shift toward cloud-native application patterns such as microservices. However, because the most purist approaches to cloud-native architectures de-emphasized stateful applications, the benefits of containers for storage portability haven’t received as much attention. That’s an oversight. Because it turns out that the ability to persist storage and make it portable matters, especially in the hybrid cloud environments spanning public clouds, private clouds, and traditional IT that are increasingly the norm.

    Data gravity

    One important reason that data portability matters is “data gravity,” a term coined by Dave McCrory. He’s since fleshed it out in more detail, but the basic concept is pretty simple. Because of network bandwidth limits, latency, costs, and other considerations, data “wants” to be near the applications analyzing, transforming, or otherwise working on it. This is a familiar idea in computer science. Non-Uniform Memory Access (NUMA) architectures — which describes pretty much all computer systems today to a greater or lesser degree — have similarly had to manage the physical locality of memory relative to the processors accessing that memory.

    Likewise, especially for applications that need fast access to data or that need to operate on large data sets, you need to think about where the data is sitting relative to the application using that data. And, if you decide to move an application from on-premise to a public cloud for rapid scalability or other reasons, you may find you need to move the data as well.

    Software-defined storage

    But moving data runs into some roadblocks. Networking limits and costs were and are one limitation; they’re a big part of data gravity in the first place. However, traditional proprietary data storage imposed its own restrictions. You can’t just fire up a storage array at an arbitrary public cloud provider to match the one in your own datacenter.

    Enter software-defined storage.

    As the name implies, software-defined storage decouples storage software from hardware. It lets you abstract and pool storage capacity across on-premise and cloud environments to scale independently of specific hardware components. Fundamentally, traditional storage was built for applications developed in the 1970s and 1980s. Software-defined storage is geared to support the applications of today and tomorrow, applications that look and behave nothing like the applications of the past. Among these are rapid scalability, especially for high volume unstructured data that may need to expand rapidly.

    However, with respect to data portability specifically, one of the biggest benefits of software-defined storage like Gluster is that the storage software itself runs on generic industry standard hardware and virtualized infrastructure. This means that you can spin up storage wherever it makes the most sense for reasons of cost, performance, or flexibility.

    Containerizing the storage

    What remains is to simplify the deployment of persistent software-defined storage. It turns out that containers are the answer to this as well. In effect, storage can be treated just like a containerized application within a Kubernetes cluster — Kubernetes being the orchestration tool that groups containerized application components into a complete application.
    With this approach, storage containers are deployed alongside other containers within the Kubernetes nodes. Rather than simply accessing ephemeral storage from within the container, this model deploys storage in its own containers, alongside the containerized application. For example, storage containers can implement a Red Hat Gluster Storage brick to create a highly-available GlusterFS volume that handles the storage resources present on each server node.

    Depending on system configuration, some nodes might only run storage containers, some might only run containerized applications, and some nodes might run a mixture of both. Using Kubernetes with its support for persistent storage as the overall coordination tool, additional storage containers could be easily started to accommodate storage demand, or to recover from a failed node. For instance, Kubernetes might start additional containerized web servers in response to demand or load, but might restart both application and storage containers in the event of a hardware failure.

    Kubernetes manages this through Persistent Volumes (PV). PV is a resource in the cluster just like a node is a cluster resource. PVs are a plugin related to the Kubernetes Volumes abstraction, but have a lifecycle independent of any individual pod that uses the PV. This allows for clustered storage that doesn’t depend on the availability or health of any specific application container.

    Modular apps plus data

    The emerging model for cloud-native application designs is one in which components communicate through well-documented interfaces. Whether or not a given project adopts a pure “microservices” approach, applications are generally becoming more modular and service-oriented. Dependencies are explicitly declared and isolated. Scaling is horizontal.