8 Sep 2017

Sneak peak into Gluster’s native subdir mount feature

Gluster’s recently announced glusterfs-3.12 release brings feature of sub-directory mount option in native fuse mount. In this post, I would like to give snippets of how the functionality works!

Commit message

Below is the snippet from commit message:

glusterfsd: allow subdir mount
Changes:
1. Take subdir mount option in client (mount.gluster / glusterfsd)
2. Pass the subdir mount to server-handshake (from client-handshake)
3. Handle subdir-mount dir’s lookup in server-first-lookup and handle all fops resolution accordingly with proper gfid of subdir
4. Change the auth/addr module to handle the multiple subdir entries in option, and valid parsing.
How to use the feature:
# mount -t glusterfs $hostname:/$volname/$subdir /$mount_point 
Or
# mount -t glusterfs $hostname:/$volname -osubdir_mount=$subdir /$mount_point
Options can be set like:
# gluster volume set <volname> auth.allow "/(192.168.10.*|192.168.11.*),/subdir1(192.168.1.*),/subdir2(192.168.8.*)”

I am a sys-admin, why do I need this feature?

This feature will just provide the namespace isolation for the separate clients where a single Gluster volume can be shared to many different clients, and they all can be mounting only subset of the volume namespace.

This can also be seen as NFS’s subdirectory mount feature, where one can also export a subdirectory of the already exported volume. If you have a use case where you need to restrict the full access to volume (or other user’s data), this feature can be used.

All the features of Gluster will work with subdir mount. Snapshot work at volume level, and hence we can’t take just a single directory snapshot. Other than this, one can continue to use the feature.

More things to know before using the feature

For any user who is starting fresh with glusterfs-3.12 or later, this feature comes as default, and with the default authentication being "*” (ie, allow everyone) for the volume, any given subdirectory in the volume can be mounted by default. If admin sets the auth.allow option to control the access, then only the directories present in auth-allow string will be allowed to mount.

If one has already set the auth.allow option then, make sure to change the format same as described above in snippet.

Try out the option, and write to us at gluster-users@gluster.org

6 Sep 2017

Where to buy bitcoin in Australia

This past week the Bitcoin drop hit the headlines quite a few times. China's ICO regulation announcement caused quite the stir and opening the opportunity for many spectators to jump in.

However purchasing bitcoin in Australia is not as simple as that. The past week I went through three of the most popular options. Jumped through the loops of identity verification and played the waiting game hoping the prices won't shoot up again while my bank deposit took place.

BTCMarkets

BTCMarkets is an exchange where you place buy and sell orders for the amount of bitcoin you want to buy. They currently have a number of pairings like Bitcoin, Ethereum, Litecoin and Monero.

They charge a 0.85% sliding commission which means the more you trade the lower the commission. Depositing funds is done through Poli where they charge you a $3.30 fee and a maximum deposit of $2,000.

BTCMarkets seems to be the most popular with multiple trades happening every minute so you are not left there waiting for your buy order to be fulfilled (assuming you place a reasonable request).

I saw a large volume of orders ranging from $200 all the way up to $100k+ so it's definitely not a small time exchange.

However, their support is an absolute let down. Prepare to wait days for a unhelpful response. My poli deposit has taken almost a whole week and is still pending. There are lots of complaints on their Facebook page about slow payments too, one might guess they're just holding money during these dips..

If you wish to get in quick BTCMarkets is not the option as their verification process will take up to 10 days as they opt to send you a letter in the mail for ID verification. Once you're in though, the trade volume is definitely looks good.

Cointree

Unlike the other two options, Cointree only allows you to buy at the "market" price. You must put faith in their business model where they promise to find you the best possible price.

They charge a 3% commission which can be quite high if you plan on purchasing a large amount.

Cointree's payment model is not very comforting as you will be required to place an order without knowing exactly how much you are getting in return.

Cointree has three payment methods:

  • Bank transfer payments (with a limit of $500?!) can take up to 2-3 days, so you must wait and pray that the price does not fluctuate too much.
  • Their Poli payment option is "instant" so you are able to purchase at their current rate, however your Bitcoin is locked in their account for 1-2 days until the funds clear.
  • Their over the counter deposit at NAB bank allows you to deposit up to $5,000 and they claim it takes about 30 minutes for the confirmation.

If you simply want to buy Bitcoin without worrying about the price, then Cointree will be the best option for you.

If you understand the bitcoin market then their business model may not be the best fit for you. For example, I put through 1 bank transfer order during the dip, however by the time it took for my funds to clear I ended up purchasing at a higher rate. My poli transfer was "instant" but locked in the account for 2 days, not allowing me to withdraw the funds and trade in the exchanges. (This was depressing). This is unlike independent reserve who let you buy and withdraw instantly.

Cointree support seems to be very responsive, live chat is always online however I found to be rather rude and unhelpful.

If you do decide to end up using Cointree please use my referral link https://www.cointree.com.au/?r=3300

Independent Reserve

Independent reserve is another exchange like BTCMarkets. They allow trading in multiple AUD pairings as well as USD and NZD.

Their website is rather confusing initially as it does not portray the same kind of feel as the other two websites.

The best thing about Independent Reserve is their verification and purchase time, I was able to verify and place an instant order with Poli in less than 1 hour. Unlike Cointree, your poli payments allow you to withdraw the Bitcoin you purchase immediately.

This was my biggest savior as it allowed me to withdraw my funds straight away into exchanges like Bitrex or Binance and get back to trading.

Unfortunately their market seems to be alot smaller, perhaps because of their diversity with multiple currency pairings. There are not as many trades happening on there at the time I was looking to purchase (compared to BTCMarkets).

I cannot comment on their support as I did not use it, but the fact I did not have to use it is a good thing.

If you want to buy in NOW then these guys are the best choice considering the speed of ID verification and payment. Kudos to them.

If you decide to use the Independent Reserve, please checkout with my referral code https://www.independentreserve.com?invite=WJPMJN


In summary, if you want the best price and just to rely on someone, use the Cointree Poli instant payment. That way you can lock in a price which is generally slightly lower than other vendors. Don't bother with their Bank Transfer or you will take the risk of a price hike like I did.

Use Independent Reserve if you want quick bitcoin to take to the exchanges, and well BTCMarkets if you want to wait a long time.. I'm still waiting.

As always, no FOMO and just HODL :)

21 Aug 2017

Clang Analyze for Gluster

Deepshika recently worked on getting a clang analyze job for Gluster setup with Jenkins. This job worked on both our laptops, but not on our build machines that run CentOS. It appears that the problem was clang on CentOS is 3.4 vs 4.0 on Fedora 26. It fails because one of our dependencies need -fno-stack-protector, which wasn’t in clang until 3.8 or so. It’s been on my list of things to fix. I realized that the right way would be to get a newer version of clang on Fedora. I could have just compiled clang or build 4.0 packages but I didn’t want to end up having to maintain the package for our specific install. I decided to reduce complexity by doing a compilation inside a Fedora 6 chroot. This sounded like the least likely to add maintenance burden. When I looked for documentation on how to go about this, I couldn’t find much. The mock man page, however, is very well written and that’s all I needed. This is the script I used comments about each step.

#!/bin/bash
    # Create a new chroot
    sudo mock -r fedora-26-x86_64 --init

    # Install the build dependencies
    sudo mock -r fedora-26-x86_64 --install langpacks-en glibc-langpack-en automake autoconf libtool flex bison openssl-devel libxml2-devel python-devel libaio-devel libibverbs-devel librdmacm-devel readline-devel lvm2-devel glib2-devel userspace-rcu-devel libcmocka-devel libacl-devel sqlite-devel fuse-devel redhat-rpm-config clang clang-analyzer git

    # Copy the Gluster source code inside the chroot at /src
    sudo mock -r fedora-26-x86_64 --copyin $WORKSPACE /src

    # Execute commands in the chroot to build with clang
    sudo mock -r fedora-26-x86_64 --chroot "cd /src && ./autogen.sh"
    sudo mock -r fedora-26-x86_64 --chroot "cd /src && ./configure CC=clang --enable-gnfs --enable-debug"
    sudo mock -r fedora-26-x86_64 --chroot "cd /src && scan-build -o /src/clangScanBuildReports -v -v --use-cc clang --use-analyzer=/usr/bin/clang make"

    # Copy the output back into the working directory
    sudo mock -r fedora-26-x86_64 --copyout /src/clangScanBuildReports $WORKSPACE/clangScanBuildReports

    # Clean up the chroot
    sudo mock -r fedora-26-x86_64 --clean

16 Aug 2017

GlusterFS 3.8.15 is available, likely the last 3.8 update

The next Long-Term-Maintenance release for Gluster is around the corner. Once GlusterFS-3.12 is available, the oldest maintained version (3.8) will be retired and no maintenance updates are planned. With this last update to GlusterFS-3.8 a few more bugs have been fixed.

Packages for this release will become available for the different distributions and their versions listed on the community packages page.

Release notes for Gluster 3.8.15

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2, 3.8.3, 3.8.4, 3.8.5, 3.8.6, 3.8.7, 3.8.8, 3.8.9, 3.8.10, 3.8.11, 3.8.12, 3.8.13 and 3.8.14 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

End Of Life Notice

This is most likely the last bugfix release for the GlusterFS 3.8 Long-Term-Support version. GlusterFS 3.12 is planned to be released at the end of August 2017 and will be the next Long-Term-Support version. It is highly recommended to upgrade any Gluster 3.8 environment to either the 3.10 or 3.12 release. More details about the different Long-Term-Support versions can be found on the release schedule.

Bugs addressed

A total of 4 patches have been merged, addressing 4 bugs:
  • #1470495: gluster volume status --xml fails when there are 100 volumes
  • #1471613: metadata heal not happening despite having an active sink
  • #1480193: Running sysbench on vm disk from plain distribute gluster volume causes disk corruption
  • #1481398: libgfapi: memory leak in glfs_h_acl_get

29 Jul 2017

Hyper-converged GlusterFS + heketi on Kubernetes

gluster-kubernetes is a project to provide Kubernetes administrators a mechanism to easily deploy a hyper-converged GlusterFS cluster along with heketi onto an existing Kubernetes cluster. This is a convenient way to unlock the power of dynamically provisioned, persistent GlusterFS volumes in Kubernetes.

Link: https://github.com/gluster/gluster-kubernetes

Component Projects

  • Kubernetes, the container management system.
  • GlusterFS, the scale-out storage system.
  • heketi, the RESTful volume management interface for GlusterFS.

Presentations

You can find slides and videos of community presentations here.

>>> Video demo of the technology! <<<

Documentation

Quickstart

You can start with your own Kubernetes installation ready to go, or you can use the vagrant setup in the vagrant/directory to spin up a Kubernetes VM cluster for you. To run the vagrant setup, you'll need to have the following installed:

  • ansible
  • vagrant
  • libvirt or VirtualBox

To spin up the cluster, simply run ./up.sh in the vagrant/ directory.

Next, copy the deploy/ directory to the master node of the cluster.

You will have to provide your own topology file. A sample topology file is included in the deploy/ directory (default location that gk-deploy expects) which can be used as the topology for the vagrant libvirt setup. When creating your own topology file:

  • Make sure the topology file only lists block devices intended for heketi’s use. heketi needs access to whole block devices (e.g. /dev/sdb, /dev/vdb) which it will partition and format.
  • The hostnames array is a bit misleading. manage should be a list of hostnames for the node, but storage should be a list of IP addresses on the node for backend storage communications.

If you used the provided vagrant libvirt setup, you can run:

$ vagrant ssh-config > ssh-config
$ scp -rF ssh-config ../deploy master:
$ vagrant ssh master
[vagrant@master]$ cd deploy
[vagrant@master]$ mv topology.json.sample topology.json

The following commands are meant to be run with administrative privileges (e.g. sudo su beforehand).

At this point, verify the Kubernetes installation by making sure all nodes are Ready:

$ kubectl get nodes
NAME STATUS AGE
master Ready 22h
node0 Ready 22h
node1 Ready 22h
node2 Ready 22h

NOTE: To see the version of Kubernetes (which will change based on latest official releases) simply do kubectl version. This will help in troubleshooting.

Next, to deploy heketi and GlusterFS, run the following:

$ ./gk-deploy -g

If you already have a pre-existing GlusterFS cluster, you do not need the -g option.

After this completes, GlusterFS and heketi should now be installed and ready to go. You can set the HEKETI_CLI_SERVERenvironment variable as follows so that it can be read directly by heketi-cli or sent to something like curl:

$ export HEKETI_CLI_SERVER=$(kubectl get svc/heketi --template 'http://{{.spec.clusterIP}}:{{(index .spec.ports 0).port}}')
$ echo $HEKETI_CLI_SERVER
http://10.42.0.0:8080
$ curl $HEKETI_CLI_SERVER/hello
Hello from Heketi

Your Kubernetes cluster should look something like this:

$ kubectl get nodes,pods
NAME STATUS AGE
master Ready 22h
node0 Ready 22h
node1 Ready 22h
node2 Ready 22h
NAME READY STATUS RESTARTS AGE
glusterfs-node0-2509304327-vpce1 1/1 Running 0 1d
glusterfs-node1-3290690057-hhq92 1/1 Running 0 1d
glusterfs-node2-4072075787-okzjv 1/1 Running 0 1d
heketi-3017632314-yyngh 1/1 Running 0 1d

You should now also be able to use heketi-cli or any other client of the heketi REST API (like the GlusterFS volume plugin) to create/manage volumes and then mount those volumes to verify they're working. To see an example of how to use this with a Kubernetes application, see the following:

Hello World application using GlusterFS Dynamic Provisioning

14 Jul 2017

GlusterFS 3.8.14 is here, 3.8 even closer to End-Of-Life

The 10th of the month has passed again, that means a 3.8.x update can't be far out. So, here it is, we're announcing the availability of glusterfs-3.8.14. Note that this is one of the last updates in the 3.8 Long-Term-Maintenance release stream. This schedule on the website shows what options you have for upgrading your environment. Remember that many distributions have packages included in their standard repositories, and other versions might be available from external locations. All the details about what packages to find where are on the Community Package page in the docs.

Release notes for Gluster 3.8.14

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2, 3.8.3, 3.8.4, 3.8.5, 3.8.6, 3.8.7, 3.8.8, 3.8.9, 3.8.10, 3.8.11, 3.8.12 and 3.8.13 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 3 patches have been merged, addressing 2 bugs:
  • #1462447: brick maintenance - no client reconnect
  • #1467272: Heal info shows incorrect status

11 Jul 2017

Storage for RHV and OCP: Two Glusters on one platform

Architecture is an interesting discipline. There are whitepapers and best practices and reference architectures to offer pristine views of what your perfect deployment should look like. And then there are budgets and timelines and business requirements to derail all of that. It’s what makes this job so interesting and challenging—hacking together the best pieces of disparate and often seemingly unrelated systems to meet goals driven by six leaders whose bonuses are met by completely different metrics.

A recent project has involved combining OpenShift Container Platform (OCP), Red Hat Virtualization (RHV), and Red Hat Gluster Storage (Gluster) into a unified system with common lifecycle operations, minimized management points, and the lowest overall footprint in terms of both capital cost and TCO. The primary storage challenge here is in creating a Gluster environment to support both RHV and its VMs as well as OCP container persistent volume requirements.

Our architectural goals include:

  • Purchase a single flexible hardware platform to serve all the storage needs
  • Segregate Gluster for RHV and Gluster for OCP into separate pools for resource allocation and to avoid possible administration snafus (such as we experienced in early testing)
  • Maintain a single-point and single-method of management—one Heketi server to rule them all
  • Containerize as much as possible to keep lifecycle maintenance atomic

Our early version of the architecture had Gluster running as container-native storage (CNS) for OCP on top of RHV while also serving storage to RHV, but this proved to introduce a chicken-and-egg problem where a single failure (such as an etcd crash) could cause a cascading outage. So our redesign involved splitting Gluster off from OCP as a stand-alone system while still being a unified storage provider and leveraging container atomicity.

The approach we wanted involved containerized Gluster running on bare-metal container hosts. Fundamentally, this is actually pretty straightforward today with pre-build Gluster containers available from the Red Hat registry. What complicated this was our desire to run two separate containerized Gluster pools on the same hardware nodes.

Disclaimer

There’s a pretty good chance that this architecture is not explicitly supported by Red Hat. While all the components we use here are definitely supported, this particular combination is untested by our engineering, QE, and performance teams. Don’t consider anything here a recommendation for how you should run your environment, only an academic study of a possible approach to solving an interesting challenge. If you have any questions, please reach out to your Red Hat sales and support teams.

The platform

We initially wanted to build this on top of Red Hat Enterprise Linux Atomic Host, but our lab environment wasn’t setup to provision this build on our systems, so we had to go forward with RHEL plus the docker packages. For a production build, we would return to using Atomic.

Networking

Gluster containers are usually configured with host networking because they need to communicate freely with each other and need to serve storage out to other systems and containers. However, with host networking, the Gluster ports are bound to all interfaces, so it is not possible to run two Gluster containers in this mode due to port conflicts. To solve this, the networks for each Gluster pool had to be segregated.

First, a VLAN sub-interface was created on each Gluster node for the storage network interface and using VLAN ID 199. There are ifcfg files to make these persistent. So each node includes a 192.168.99.0/24 IP on the primary interface and a 192.168.199.0/24 IP on a VLAN sub-interface. The Switch ports for the storage network interfaces have been configured for the tagged VLAN ID 199. The 802.1q kernel module (for VLANs) was set to load at boot time on each node with a /etc/modules-load.d/8021q.conf file.

Containerized Gluster

Networks

Each Gluster container needs to exist on its own interface and subnet. So leveraging the system-level network stuff done above, the two interfaces were each attached to a docker macvlan network on each node.

docker network create -d macvlan --subnet=192.168.99.0/24 \

-o parent=eth1 gluster-rhv-net
docker network create -d macvlan --subnet=192.168.199.0/24 \

-o parent=eth1.199 gluster-ocp-net

Containers

The containers were pulled down from the Red Hat registry.

docker pull registry.access.redhat.com/rhgs3/rhgs-server-rhel7
docker pull registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7

The Gluster containers need to be privileged in order to access the /dev/sdX block devices. They also need a number of local persistent volume stores in order to ensure they start up properly each time.

The container fstab file needs a persistent mount. So first we should touch these files, otherwise the gluster-startup command in the container will fail.

touch /var/lib/heketi-{rhv,ocp}/fstab

Then we can run the containers.

docker run -d --privileged=true --net=gluster-rhv-net \

--ip=192.168.99.28  --name=gluster-rhv-1 -v /run \

-v /home/gluster-rhv-1-root:/root:z \

-v /etc/glusterfs-rhv:/etc/glusterfs:z \

-v /var/lib/glusterd-rhv:/var/lib/glusterd:z \

-v /var/log/glusterfs-rhv:/var/log/glusterfs:z \

-v /var/lib/heketi-rhv:/var/lib/heketi:z \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /dev:/dev rhgs3/rhgs-server-rhel7
docker run -d --privileged=true --net=gluster-ocp-net \

--ip=192.168.199.28 --name=gluster-ocp-1 -v /run \

-v /home/gluster-ocp-1-root:/root:z \

-v /etc/glusterfs-ocp:/etc/glusterfs:z \

-v /var/lib/glusterd-ocp:/var/lib/glusterd:z \

-v /var/log/glusterfs-ocp:/var/log/glusterfs:z \

-v /var/lib/heketi-ocp:/var/lib/heketi:z \

-v /sys/fs/cgroup:/sys/fs/cgroup:ro \

-v /dev:/dev rhgs3/rhgs-server-rhel7

Block device assignments

Running the containers in privileged mode allows them to access all system block devices. For our particular architectural needs, we intend to use from each node only one SSD for the gluster-rhv pool and the remaining five SSDs for the gluster-ocp pool.

 Gluster Pool  Block Devices
 gluster-rhv  sdb
 gluster-ocp  sdc, sdd, sde, sdf, sdg

Heketi

Config

The persistent Heketi config is being stored in the /etc/heketi directory on one of the nodes (we’ll call it node1). First, an ssh keypair is created and placed there.

ssh-keygen -f /etc/heketi/heketi_key -t rsa -N ''

Next, the heketi.json file is created. Right now, no auth is being used — obviously don’t do this in production. Note the ssh port is 2222, which is what the Gluster containers are configured to listen on.

{
  "_port_comment": "Heketi Server Port Number",
  "port": "8080",

  "_use_auth": "Enable JWT authorization. Please enable for deployment",
  "use_auth": false,

  "_jwt": "Private keys for access",
  "jwt": {
    "_admin": "Admin has access to all APIs",
    "admin": {
      "key": "My Secret"
    },
    "_user": "User only has access to /volumes endpoint",
    "user": {
      "key": "My Secret"
    }
  },

  "_glusterfs_comment": "GlusterFS Configuration",
  "glusterfs": {
    "_executor_comment": [
      "Execute plugin. Possible choices: mock, ssh",
      "mock: This setting is used for testing and development.",
      "      It will not send commands to any node.",
      "ssh:  This setting will notify Heketi to ssh to the nodes.",
      "      It will need the values in sshexec to be configured.",
      "kubernetes: Communicate with GlusterFS containers over",
      "            Kubernetes exec api."
    ],
    "executor": "ssh",

    "_sshexec_comment": "SSH username and private key file information",
    "sshexec": {
      "keyfile": "/etc/heketi/heketi_key",
      "user": "root",
      "port": "2222"
    },

    "_db_comment": "Database file name",
    "db": "/var/lib/heketi/heketi.db",

    "_loglevel_comment": [
      "Set log level. Choices are:",
      "  none, critical, error, warning, info, debug",
      "Default is warning"
    ],
    "loglevel" : "debug"
  }
}

SSH access

The Heketi server needs passwordless SSH access to all Gluster containers on port 2222. The public key generated above needs to be added to the authorized_keys for all of the Gluster containers. Note that we have a local persistent volume (PV) for each Gluster container’s /root directory, so this authorized_key entry was simply added to each one of those.

cat /etc/heketi/heketi_key.pub >> \

/home/gluster-rhv-1-root/.ssh/authorized_keys

NOTE: This needs to be done for each of the root home directories for each Gluster container

Container

The single Heketi container will run on node1. It needs access to both of the subnets, so the best thing to do is run the container in host networking mode. It also needs a few persistent volumes.

docker run -d --net=host --name=gluster-heketi \

-v /etc/heketi:/etc/heketi:z -v /var/lib/heketi:/var/lib/heketi:z \

rhgs3/rhgs-volmanager-rhel7

Network

Since we are running heketi-cli on the same node that we are running the Heketi container, there is a security issue we have to work through. By default, the container host cannot directly access the local container via the IP assigned to its macvlan network interface. So on the container host node1 we need to create local macvlan interfaces for each of the subnets. Use this at runtime and the /etc/rc.d/rc.local file:

/usr/sbin/ip link add macvlan0 link eth1 type macvlan mode bridge
/usr/sbin/ip addr add 192.168.99.228/24 dev macvlan0
/usr/sbin/ifconfig macvlan0 up

/usr/sbin/ip link add macvlan1 link eth1.199 type macvlan mode bridge
/usr/sbin/ip addr add 192.168.199.228/24 dev macvlan1
/usr/sbin/ifconfig macvlan1 up

The rc.local file in RHEL is for legacy support, so it has to be made executable and its systemd service has to be enabled.

chmod 755 /etc/rc.d/rc.local
systemctl enable rc-local.service

Heketi CLI

The heketi-cli needs to run $somewhere. For simplicity, the RPM is installed on node1. With the container running with networking in host mode, heketi is listening on localhost port 8080. Export the environment variable in order to be able to run heketi-cli commands.

export HEKETI_CLI_SERVER=http://localhost:8080

Setting up the Heketi clusters

A JSON file is populated at /root/heketi-rhv-plus-ocp-topology.json on node1. This file defines two separate Heketi clusters with their respective Gluster nodes (containers) and block devices.

{
    "clusters": [
        {
            "nodes": [
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.99.28"
                            ],
                            "storage": [
                                "192.168.99.28"
                            ]
                        },
                        "zone": 1
                    },
                    "devices": [
                        "/dev/sdb"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.99.29"
                            ],
                            "storage": [
                                "192.168.99.29"
                            ]
                        },
                        "zone": 2
                    },
                    "devices": [
                        "/dev/sdb"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.99.30"
                            ],
                            "storage": [
                                "192.168.99.30"
                            ]
                        },
                        "zone": 3
                    },
                    "devices": [
                        "/dev/sdb"
                    ]
                }
            ]
        },

        {
            "nodes": [
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.199.28"
                            ],
                            "storage": [
                                "192.168.199.28"
                            ]
                        },
                        "zone": 1
                    },
                    "devices": [
                        "/dev/sdc",
                        "/dev/sdd",
                        "/dev/sde",
                        "/dev/sdf",
                        "/dev/sdg"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.199.29"
                            ],
                            "storage": [
                                "192.168.199.29"
                            ]
                        },
                        "zone": 2
                    },
                    "devices": [
                        "/dev/sdc",
                        "/dev/sdd",
                        "/dev/sde",
                        "/dev/sdf",
                        "/dev/sdg"
                    ]
                },
                {
                    "node": {
                        "hostnames": {
                            "manage": [
                                "192.168.199.30"
                            ],
                            "storage": [
                                "192.168.199.30"
                            ]
                        },
                        "zone": 3
                    },
                    "devices": [
                        "/dev/sdc",
                        "/dev/sdd",
                        "/dev/sde",
                        "/dev/sdf",
                        "/dev/sdg"
                    ]
                }
            ]
        }
    ]
}

This file is passed (once) to Heketi to setup the two clusters.

heketi-cli topology load --json=heketi-rhv-plus-ocp-topology.json

It’s important to note the two different clusters. It’s not (AFAIK) possible to “name” the clusters, so we have to reference them by their UUIDs. The Gluster volumes for RHV will be created on one cluster, and those orchestrated for OCP PVs will be created on a different cluster.

RHV Gluster volumes

For the purposes of RHV, two volumes were requested—one for the Hosted Engine and one for the VM storage. These were created via heketi-cli. Note the cluster ID passed to the commands.

heketi-cli volume create --size 100 --name rhv-hosted-engine \

--clusters ae2a309d02781816adfed567693221a9
heketi-cli volume create --size 1024 --name rhv-virtual-machines \

--clusters ae2a309d02781816adfed567693221a9

These can be mounted to the RHV nodes via the 192.168.99.0/24 subnet using the Gluster native client. Example fstab entries:

192.168.99.28:rhv-hosted-engine      /100g   glusterfs       backupvolfile-server=192.168.99.29:192.168.99.30 0 0
192.168.99.28:rhv-virtual-machines      /1t   glusterfs       backupvolfile-server=192.168.99.29:192.168.99.30 0 0

OCP PV Gluster volumes

Our OCP pods are attached to the 192.168.199.0/24 subnet to communicate with the storage. First on node1 the Heketi API port (8080) needs to be opened in the firewall.

firewall-cmd --add-port 8080/tcp
firewall-cmd --add-port 8080/tcp --permanent

Then the storage class for OCP is defined with the below YAML. Note that we aren’t currently doing any authentication (but obviously we should). You see here that we explicitly define the Heketi cluster ID for this class in order to ensure that all volumes for PVCs are created only on the Gluster pool we have identified for OCP use.

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
 name: gluster-dyn
provisioner: kubernetes.io/glusterfs
parameters:
 resturl: "http://192.168.199.128:8080"
 restauthenabled: "false"
 clusterid: "74edade536c80f14486edfbabd204151"

Then the storage class is added to OCP on the master.

oc create -f glusterfs-storageclass.yaml

From this point, PVCs (persistent volume claims) made against this storage class will interface with Heketi to dynamically provision Gluster volumes to match the claim.

Miscellaneous

Auto-start containers

Docker container systemd init scripts are tricky. I’ve found that every example on the internet is either wrong, outdated, or uses an approach I don’t like.

Below is an example systemd service file for the Heketi container, which is simple and works the way we expect it to with the docker run command in the ExecStart (/etc/systemd/system/docker-container-gluster-heketi.service). NOTE: Do not daemonize (-d) the docker run command in the init script. Also, the SuccessExitStatus is important here.

[Unit]
Description=Gluster Heketi Container
Requires=docker.service
After=docker.service

[Service]
TimeoutStartSec=60
Restart=on-abnormal
SuccessExitStatus=0 137
ExecStartPre=-/usr/bin/docker stop gluster-heketi
ExecStartPre=-/usr/bin/docker rm gluster-heketi
ExecStart=/usr/bin/docker run --net=host --name=gluster-heketi -v /etc/heketi:/etc/heketi:z -v /var/lib/heketi:/var/lib/heketi:z rhgs3/rhgs-volmanager-rhel7
ExecStop=/usr/bin/docker stop gluster-heketi

[Install]
WantedBy=multi-user.target

Reload the systemd daemon:

systemctl daemon-reload

Enable and start the service

systemctl enable docker-container-gluster-heketi

systemctl start docker-container-gluster-heketi

Known issues and TODOs

  • Security needs to be taken into account. We’ll set up appropriate key-based authentication and JWT for Heketi. We’d also like to use role-based auth. Hopefully we’ll cover this in a future blog post.
  • Likely $other_things I haven’t realized yet, or better ways of approaching this. I’d love to hear your comments.

28 Jun 2017

GlusterFS 3.8.13 update available, and 3.8 nearing End-Of-Life

The Gluster releases follow a 3-month cycle and, with alternating Short-Term-Maintenance and Long-Term-Maintenance versions. GlusterFS 3.8 is currently the oldest Long-Term-Maintenance release, and will become End-Of-Life with the GlusterFS 3.12 version. If all goes according to plan, 3.12 will get released in August and is the last 3.x version before Gluster 4.0 hits the disks.

There will be a few more releases in the GlusterFS 3.8 line, but users should start to plan an upgrade to a version that receives regular bugfix updates after August.

Release notes for Gluster 3.8.13

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2, 3.8.3, 3.8.4, 3.8.5, 3.8.6, 3.8.7, 3.8.8, 3.8.9, 3.8.10, 3.8.11 and 3.8.12 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 13 patches have been merged, addressing 8 bugs:
  • #1447523: Glusterd segmentation fault in ' _Unwind_Backtrace' while running peer probe
  • #1449782: quota: limit-usage command failed with error " Failed to start aux mount"
  • #1449941: When either killing or restarting a brick with performance.stat-prefetch on, stat sometimes returns a bad st_size value.
  • #1450055: [GANESHA] Adding a node to existing cluster failed to start pacemaker service on new node
  • #1450380: GNFS crashed while taking lock on a file from 2 different clients having same volume mounted from 2 different servers
  • #1450937: [New] - Replacing an arbiter brick while I/O happens causes vm pause
  • #1460650: posix-acl: Whitelist virtual ACL xattrs
  • #1460661: "split-brain observed [Input/output error]" error messages in samba logs during parallel rm -rf

30 May 2017

Library of Ceph and Gluster reference architectures – Simplicity on the other side of complexity

The Storage Solution Architectures team at Red Hat develops reference architectures, performance and sizing guides, and test drives for Gluster- and Ceph-based solutions. We’re a group of architects who perform lab validation, tuning, and interoperability development for composable storage services with target workloads on optimized server and network configurations. We seek simplicity on the other side of complexity.

At the end of this blog entry is a full library of our current publications and test drives.

In our modern era, a top company asset is pivotability. Pivotability based on external market changes. Pivotability after unknowns become known. Pivotability after golden ideas become dark alleys. For most enterprises, pivotability requires a composable technology infrastructure for shifting resources to meet changing needs. Composable storage services, such as those provided by Ceph and Gluster, are part of many companies’ composable infrastructures.

Composable technology infrastructures are most frequently described by the following attributes:

  • Open source v. closed development.
  • On-demand architectures v. fixed architectures.
  • Commodity hardware v. proprietary appliances.
  • Cross-industry collaboration v. isolated single-vendor silos.

As noted in the following figure, a few companies with large staffs of in-house experts can create composable infrastructures from raw technologies. Their large investments in in-house expertise allows them to convert raw technologies into solutions with limited pre-integration by technology suppliers. AWS, Google, and Azure are all examples of DIY businesses. A larger number of other companies, also needing composable infrastructures, rely on technology suppliers and the community for solution pre-integration and guidance to reduce their in-house expertise costs. We’ll label them “Assisted DIY.” Finally, the majority of global enterprises lack the in-house expertise for deploying these composable infrastructures. They rely on public cloud providers and pre-packaged solutions for their infrastructure needs. We’ll call them “Pre-packaged.”

Brent_Slide

The reference architectures, performance and sizing guides, and test drives produced by our team are primarily focused on the “Assisted DIY” segment of companies. Additionally, we strive to make Gluster and Ceph composable storage services available to the “Pre-packaged” segment of companies by using what we learn to produce pre-packaged combinations of Red Hat software with partner hardware targeting specific workload use cases.

We enjoy our roles at Red Hat because of the many of you with whom we collaborate to produce value.  We hope you find these guides useful.

Team-produced with partner collaboration:

Partner-produced with team collaboration:

Pre-packaged solutions:

Hands-on test drives:

22 May 2017

Enjoy more bugfixes with GlusterFS 3.8.12

Like every month, there is an update for the GlusterFS 3.8 stable version. A few more bugfixes have been included in this release. Packages are already available for many distributions, some distributions might still need to promote the update from their testing repository to release, so hold tight if there is no update for your favourite OS yet.

Release notes for Gluster 3.8.12

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2, 3.8.3, 3.8.4, 3.8.5, 3.8.6, 3.8.7, 3.8.8, 3.8.9, 3.8.10 and 3.8.11 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 13 patches have been merged, addressing 11 bugs:
  • #1440228: NFS Sub-directory mount not working on solaris10 client
  • #1440635: Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance
  • #1440810: Update rfc.sh to check Change-Id consistency for backports
  • #1441574: [geo-rep]: rsync should not try to sync internal xattrs
  • #1441930: [geo-rep]: Worker crashes with [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166 while renaming directories
  • #1441933: [Geo-rep] If for some reason MKDIR failed to sync, it should not proceed further.
  • #1442933: Segmentation fault when creating a qcow2 with qemu-img
  • #1443012: snapshot: snapshots appear to be failing with respect to secure geo-rep slave
  • #1443319: Don't wind post-op on a brick where the fop phase failed.
  • #1445213: Unable to take snapshot on a geo-replicated volume, even after stopping the session
  • #1449314: [whql][virtio-block+glusterfs]"Disk Stress" and "Disk Verification" job always failed on win7-32/win2012/win2k8R2 guest

4 May 2017

2 May 2017

Struggling to containerize stateful applications in the cloud?

Struggling to containerize stateful applications in the cloud? Here’s how with Red Hat Gluster Storage.

The newest release of Red Hat’s Reference Architecture “OpenShift Container Platform 3.5 on Amazon Web Services” now incorporates container-native storage, a unique approach based on Red Hat Gluster Storage to avoid lock-in, enable stateful applications, and simplify those applications’ high availability.

In the beginning, everything was so simple. Instead of going through the bureaucracy and compliance-driven process of requesting compute, storage, and networking resources, I would pull out my corporate credit card and register at the cloud provider of my choice. Instead of spending weeks forecasting the resource needs and costs of my newest project, I would get started in less than 1 hour. Much lower risk, virtually no capital expenditure for my newest pet project. And seemingly endless capacity — well, as long as my credit card was covered. If my project didn’t turn out to be a thing, I didn’t end up with excess infrastructure, either.

Until I found out that basically what I was doing was building my newest piece of software against a cloud mainframe. Not directly, of course. I was still operating on top of my operating system with the libraries and tools of my choice, but essentially I spend significant effort getting to that point with regards to orchestration and application architecture. And these are not easily ported to another cloud provider.

I realize that cloud providers are vertically integrated stacks, just as mainframes were. Much more modern and scalable with an entirely different cost structure — but, still, eventually and ultimately, lock-in.

Avoid provider lock-in with OpenShift Container Platform

This is where OpenShift comes in. I take orchestration and development cycles to a whole new level when I stop worrying about operating system instances, storage capacity, network overlays, NAT gateways, firewalls — all the things I need to make my application accessible and provide value.

Instead, I deal with application instances, persistent volumes, services, replication controllers, and build configurations — things that make much more sense to me as an application developer as they are closer to what I am really interested in: deploying new functionality into production. Thus, OpenShift offers abstraction on top of classic IT infrastructure and instead provides application infrastructure. The key here is massive automation on top of the concept of immutable infrastructure, thereby greatly enhancing the capability to bring new code into production.

The benefit is clear: Once I have OpenShift in place, I don’t need to worry about any of the underlying infrastructure — I don’t need to be aware of whether I am actually running on OpenStack, VMware, Azure, Google Cloud, or Amazon Web Services (AWS). My new common denominator is the interface of OpenShift powered by Kubernetes, and I can forget about what’s underneath.

Well, not quite. While OpenShift provides a lot of drivers for various underlying infrastructure, for instance storage, they are all somewhat different. Their availability, performance, and feature set is tied to the underlying provider, for instance Elastic Block Storage (EBS) on AWS. I need to make sure that critical aspects of the infrastructure below OpenShift are reflected in OpenShift topology. A good example are AWS availability zones (AZs): They are failure domains in a region across which an application instance should be distributed to avoid downtime in the event a single AZ is lost. So OpenShift nodes need to be deployed in multiple AZs.

This is where another caveat comes in: EBS volumes are present only inside a single AZ. Therefore, my application must replicate the data across other AZs if it uses EBS to store it.

So there are still dependencies and limitations a developer or operator must be aware of, even if OpenShift has drivers on board for EBS and will take care about provisioning.

Introducing container-native storage

With container-native storage (CNS), we now have a robust, scalable, and elastic storage service out-of-the-box for OpenShift Container Platform — based on Red Hat Gluster Storage. The trick: GlusterFS runs containerized on OpenShift itself. Thus, it runs on any platform that OpenShift is supported on — which is basically everything, from bare metal, to virtual, to private and public cloud.

With CNS, OpenShift gains a consistent storage feature set across, and independent of, all supported cloud providers. It’s deployed with native OpenShift / Kubernetes resources, and GlusterFS ends up running in pods as part of a DaemonSet:

[ec2-user@ip-10-20-4-55 ~]$ oc get pods
NAME READY STATUS RESTARTS AGE
glusterfs-0bkgr 1/1 Running 9 7d
glusterfs-4fmsm 1/1 Running 9 7d
glusterfs-bg0ls 1/1 Running 9 7d
glusterfs-j58vz 1/1 Running 9 7d
glusterfs-qpdf0 1/1 Running 9 7d
glusterfs-rkhpt 1/1 Running 9 7d
heketi-1-kml8v 1/1 Running 8 7d

The pods are running in privileged mode to access the nodes’ block device directly. Furthermore, for optimal performance, the pods are using host-networking mode. This way, OpenShift nodes are running a distributed, software-defined, scale-out file storage service, just as any distributed micro-service application.

There is an additional pod deployed that runs heketi — a RESTful API front end for GlusterFS. OpenShift natively integrates via a dynamic storage provisioner plug-in with this service to request and delete storage volumes on behalf of the user. In turn, heketi controls one or more GlusterFS Trusted Storage Pools.

Container-native storage on Amazon Web Services

The EBS provisioner has been available for OpenShift for some time. To understand what changes with CNS on AWS, a closer look at how EBS is accessible to OpenShift is in order.

Dynamic provisioning EBS volumes are dynamically created and deleted as part of storage provisioning requests (PersistentVolumeClaims) in OpenShift.

Local block storage
EBS appears to the EC2 instances as a local block device. Once provisioned, it is attached to the EC2 instance, and a PCI interrupt is triggered to inform the operating system.

NAME                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 15G 0 disk
├─xvda1 202:1 0 1M 0 part
└─xvda2 202:2 0 15G 0 part /
xvdb 202:16 0 25G 0 disk
└─xvdb1 202:17 0 25G 0 part
├─docker_vol-docker--pool_tmeta 253:0 0 28M 0 lvm
│ └─... 253:2 0 23.8G 0 lvm
│ ├─... 253:8 0 3G 0 dm
│ └─... 253:9 0 3G 0 dm
└─docker_vol-docker--pool_tdata 253:1 0 23.8G 0 lvm
└─docker_vol-docker--pool 253:2 0 23.8G 0 lvm
├─... 253:8 0 3G 0 dm
└─... 253:9 0 3G 0 dm
xvdc 202:32 0 50G 0 disk
xvdd 202:48 0 100G 0 disk

OpenShift on AWS also uses EBS to back local docker storage. EBS storage is formatted with a local filesystem like XFS..

Not shared storage EBS volumes cannot be attached to more than one EC2 instance. Thus, all pods mounting an EBS-based PersistentVolume in OpenShift must run on the same node. The local filesystem on top of the EBS block device does not support clustering either.

AZ-local storage EBS volumes cannot cross AZs. Thus, OpenShift cannot failover pods mounting EBS storage into different AZs. Basically, an EBS volume is a failure domain.

Performance characteristics
The type of EBS storage, as well as capacity, must be selected up front. Specifically, for fast storage a certain minimum capacity must be requested to have a minimum performance level in terms of IOPS.

This is the lay of the land. While these characteristics may be acceptable for stateless applications that only need to have local storage, they become an obstacle for stateful applications.

People want to containerize databases, as well. Following a micro-service architecture where every service maintains its own state and data model, this request will become more common. The nature of these databases differs from the classic, often relational, database management system IT organizations have spent millions on: They are way smaller and store less data than their big brother from the monolithic world. Still, with the limitations of EBS, I would need to architect replication and database failover around those just to deal with a simple storage failure.

Here is what changes with CNS:

Dynamic provisioning The user experience actually doesn’t change. CNS is represented like any storage provider in OpenShift, by a StorageClass. PersistentVolumeClaims (PVCs) are issued against it, and the dynamic provisioner for GlusterFS creates the volume and returns it as a PersistentVolume (PV). When the PVC is deleted, the GlusterFS volume is deleted, as well.

Distributed file storage on top of EBS
CNS volumes are basically GlusterFS volumes, managed by heketi. The volumes are built out of local block devices of the OpenShift nodes backed by EBS. These volumes provide shared storage and are mounted on the OpenShift nodes with the GlusterFS FUSE client.

[ec2-user@ip-10-20-5-132 ~]$ mount
...
10.20.4.115:vol_0b801c15b2965eb1e5e4973231d0c831 on /var/lib/origin/openshift.local.volumes/pods/80e27364-2c60-11e7-80ec-0ad6adc2a87f/volumes/kubernetes.io~glusterfs/pvc-71472efe-2a06-11e7-bab8-02e062d20f83 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
...

Container-shared storage Multiple pods can mount and write to the same volume. The access mode for the corresponding node is known as “RWX” — read-write many. The containers can run on different OpenShift nodes, and the dynamic provisioner will mount the GlusterFS volume on the right nodes accordingly. Then, this local mount directory is bind-mounted to the container.

Cross-availability zone
CNS is deployed across AWS AZs. The integrated, synchronous replication of GlusterFS will mirror every write 3 times. GlusterFS is deployed across OpenShift nodes in at least different AZs, and thus the storage is available in all zones. The failure of a single GlusterFS pod, an OpenShift node running the pod, or a block device accessed by the pod will have no impact. Once the failed resources come back, the storage is automatically re-replicated. CNS is actually aware of the failure zones as part of the cluster topology and will schedule new volumes, as well as recovery, so that there is no single point of failure.

Predictable performance
CNS storage performance is not tied to the size of storage request by the user in OpenShift. It’s the same performance whether 1GB or 100GB PVs are requested.

Storage performance tiers
CNS allows for multiple GlusterFS Trusted Storage Pools to be managed at once. Each pool consists of at least 3 OpenShift nodes running GlusterFS pods. While the OpenShift nodes belong to a single OpenShift cluster, the various GlusterFS pods form their own Trusted Storage Pools. An administrator can use this to equip the nodes with different kinds of storage and offer their pools with CNS as distinct storage tiers in OpenShift, via its own StorageClass. An administrator instance might, for example, run CNS on 3 OpenShift nodes with SSD (e.g., EBS gp2) storage and call it “fast,” whereas another set of OpenShift nodes with magnetic storage (e.g., EBS st1) runs a separate set of GlusterFS pods as an independent Trusted Storage Pool, represented with a StorageClass called “capacity.”

This is a significant step toward simplifying and abstracting provider infrastructure. For example, a MySQL database service running on top of OpenShift is now able to survive the failure of an AWS AZ, without needing to set up MySQL Master-Slave replication or change the micro-service to replicate data on its own.

Storage provided by CNS is efficiently allocated and provides performance with the first Gigabyte provisioned, thereby enabling storage consolidation. For example, consider six MySQL database instances, each in need of 25 GiB of storage capacity and up to 1500 IOPS at peak load. With EBS, I would create six EBS volumes, each with at least 500 GiB capacity out of the gp2 (General Purpose SSD) EBS tier, in order to get 1500 IOPS guaranteed. Guaranteed performance is tied to provisioned capacity with EBS.
With CNS, I can achieve the same using only 3 EBS volumes at 500 GiB capacity from the gp2 tier and run these with GlusterFS. I would create six 25 GiB volumes and provide storage to my databases with high IOPS performance, provided they don’t peak all at the same time.

Doing that, I would halve my EBS cost and still have capacity to spare for other services. My read IOPS performance is likely even higher because in CNS with 3-way replication I would read from data distributed across 3×1500 IOPS gp2 EBS volumes.

Finally, the setup for CNS is very simple and can run on any OpenShift installation based on version 3.4 or newer.

This way, no matter where I plan to run OpenShift (i.e., which cloud provider currently offers lowest prices), I can rely on the same storage features and performance. Furthermore, the Storage Service grows with the OpenShift cluster but still provides elasticity. Only a subset of OpenShift nodes must run CNS, at least 3 ideally across 3 AZs.

Deploying container-native storage on AWS

Installing OpenShift on AWS is dramatically simplified based on the OpenShift on Amazon Web Services Reference Architecture. A set of Ansible playbooks augments the existing openshift-ansible installation routine and creates all the required AWS infrastructure automatically.

A simple python script provides a convenient wrapper to the playbooks found in the openshift-ansible-contrib repository on GitHub for deploying on AWS.

All the heavy lifting of setting up Red Hat OpenShift Container Platform on AWS is automated with best practices incorporated.

The deployment finishes with an OpenShift Cluster with 3 master nodes, 3 infrastructure nodes, and 2 application nodes deployed in a highly available fashion across AWS AZs. The external and internal traffic is load balanced, and all required network, firewall, and NAT resources are stood up.

Since version 3.5, the reference architecture playbooks now ship with additional automation to make deployment of CNS as easy. Through additional AWS CloudFormation templates and Ansible playbook tasks, the additional, required infrastructure is stood up. This mainly concerns provisioning of additional OpenShift nodes with an amended firewall configuration, additional EBS volumes, and then joining them to the existing OpenShift cluster.

In addition, compared to previous releases, the CloudFormation templates now emit more information as part of the output. These are picked up by the playbooks to further reduce the information needed from the administrator. They will simply get the right information from the existing CloudFormation stack to retrieve the proper integration points.

The result is AWS infrastructure ready for the administrator to deploy CNS. Most of the manual steps of this process can therefore be avoided. Three additional app nodes are deployed with configurable instance type and EBS volume type. Availability zones of the selected AWS region are taken into account.

Subsequent calls allow for provisioning of additional CNS pools. The reference architecture makes reasonable choices for the EBS volume type and the EC2 instance with a balance between running costs and initial performance. The only thing left for the administrator to do is to run the cns-deploy utility and create a StorageClass object to make the new storage service accessible to users.

At this point, the administrator can choose between labeling the nodes as regular application nodes or provide a storage-related label that would initially exclude them from the OpenShift scheduler for regular application pods.

Container-ready storage

The reference architecture also incorporates the concept of Container-Ready Storage (CRS). In this deployment flavor, GlusterFS runs on dedicated EC2 instances with a heketi-instance deployed separately, both running without containers as ordinary system services. The difference is that these instances are not part of the OpenShift cluster. The storage service is, however, made available to, and used by, OpenShift in the same way. If the user, for performance or cost reasons, wants the GlusterFS storage layer outside of OpenShift, this is made possible with CRS. For this purpose, the reference architecture ships add-crs-storage.py to automate the deployment in the same way as for CNS.

Verdict

CNS provides further means of OpenShift Container Platform becoming an equalizer for application development. Consistent storage services, performance, and management are provided independently of the underlying provider platform. Deployment of data-driven applications is further simplified with CNS as the backend. This way, not only stateless but also stateful applications become easy to manage.

For developers, nothing changes: The details of provisioning and lifecycle of storage capacity for containerized applications is transparent to them, thanks to CNS’s integration with native OpenShift facilities.

For administrators, achieving cross-provider, hybrid-cloud deployments just became even easier with the recent release of the OpenShift Container Platform 3.5 on Amazon Web Service Reference Architecture. With just two basic commands, an elastic and fault-tolerant foundation for applications can be deployed. Once set up, growth becomes a matter of adding nodes.

It is now possible to choose the most suitable cloud provider platform without worrying about various tradeoffs between different storage feature sets or becoming too close to one provider’s implementation, thereby avoiding lock-in long term.

The reference architecture details the deployment and resulting topology. Access the document here.

Originally published at redhatstorage.redhat.com on May 2, 2017 by Daniel Messer.

Struggling to containerize stateful applications in the cloud? Here’s how.

The newest release of Red Hat’s Reference Architecture “OpenShift Container Platform 3.5 on Amazon Web Services” now incorporates container-native storage, a unique approach based on Red Hat Gluster Storage to avoid lock-in, enable stateful applications, and simplify those applications’ high availability.


In the beginning, everything was so simple. Instead of going through the bureaucracy and compliance-driven process of requesting compute, storage, and networking resources, I would pull out my corporate credit card and register at the cloud provider of my choice. Instead of spending weeks forecasting the resource needs and costs of my newest project, I would get started in less than 1 hour. Much lower risk, virtually no capital expenditure for my newest pet project. And seemingly endless capacity—well, as long as my credit card was covered. If my project didn’t turn out to be a thing, I didn’t end up with excess infrastructure, either.

Until I found out that basically what I was doing was building my newest piece of software against a cloud mainframe. Not directly, of course. I was still operating on top of my operating system with the libraries and tools of my choice, but essentially I spend significant effort getting to that point with regards to orchestration and application architecture. And these are not easily ported to another cloud provider.

I realize that cloud providers are vertically integrated stacks, just as mainframes were. Much more modern and scalable with an entirely different cost structure—but, still, eventually and ultimately, lock-in.

Avoid provider lock-in with OpenShift Container Platform

This is where OpenShift comes in. I take orchestration and development cycles to a whole new level when I stop worrying about operating system instances, storage capacity, network overlays, NAT gateways, firewalls—all the things I need to make my application accessible and provide value.

Instead, I deal with application instances, persistent volumes, services, replication controllers, and build configurations—things that make much more sense to me as an application developer as they are closer to what I am really interested in: deploying new functionality into production. Thus, OpenShift offers abstraction on top of classic IT infrastructure and instead provides application infrastructure. The key here is massive automation on top of the concept of immutable infrastructure, thereby greatly enhancing the capability to bring new code into production.

The benefit is clear: Once I have OpenShift in place, I don’t need to worry about any of the underlying infrastructure—I don’t need to be aware of whether I am actually running on OpenStack, VMware, Azure, Google Cloud, or Amazon Web Services (AWS). My new common denominator is the interface of OpenShift powered by Kubernetes, and I can forget about what’s underneath.

Well, not quite. While OpenShift provides a lot of drivers for various underlying infrastructure, for instance storage, they are all somewhat different. Their availability, performance, and feature set is tied to the underlying provider, for instance Elastic Block Storage (EBS) on AWS. I need to make sure that critical aspects of the infrastructure below OpenShift are reflected in OpenShift topology. A good example are AWS availability zones (AZs): They are failure domains in a region across which an application instance should be distributed to avoid downtime in the event a single AZ is lost. So OpenShift nodes need to be deployed in multiple AZs.

This is where another caveat comes in: EBS volumes are present only inside a single AZ. Therefore, my application must replicate the data across other AZs if it uses EBS to store it.

So there are still dependencies and limitations a developer or operator must be aware of, even if OpenShift has drivers on board for EBS and will take care about provisioning.

Introducing container-native storage

With container-native storage (CNS), we now have a robust, scalable, and elastic storage service out-of-the-box for OpenShift Container Platform—based on Red Hat Gluster Storage. The trick: GlusterFS runs containerized on OpenShift itself. Thus, it runs on any platform that OpenShift is supported on—which is basically everything, from bare metal, to virtual, to private and public cloud.

With CNS, OpenShift gains a consistent storage feature set across, and independent of, all supported cloud providers. It’s deployed with native OpenShift / Kubernetes resources, and GlusterFS ends up running in pods as part of a DaemonSet:

[ec2-user@ip-10-20-4-55 ~]$ oc get pods
NAME              READY     STATUS    RESTARTS   AGE
glusterfs-0bkgr   1/1       Running   9          7d
glusterfs-4fmsm   1/1       Running   9          7d
glusterfs-bg0ls   1/1       Running   9          7d
glusterfs-j58vz   1/1       Running   9          7d
glusterfs-qpdf0   1/1       Running   9          7d
glusterfs-rkhpt   1/1       Running   9          7d
heketi-1-kml8v    1/1       Running   8          7d

The pods are running in privileged mode to access the nodes’ block device directly. Furthermore, for optimal performance, the pods are using host-networking mode. This way, OpenShift nodes are running a distributed, software-defined, scale-out file storage service, just as any distributed micro-service application.

There is an additional pod deployed that runs heketi—a RESTful API front end for GlusterFS. OpenShift natively integrates via a dynamic storage provisioner plug-in with this service to request and delete storage volumes on behalf of the user. In turn, heketi controls one or more GlusterFS Trusted Storage Pools.

Container-native storage on Amazon Web Services

The EBS provisioner has been available for OpenShift for some time. To understand what changes with CNS on AWS, a closer look at how EBS is accessible to OpenShift is in order.

  1. Dynamic provisioning
    EBS volumes are dynamically created and deleted as part of storage provisioning requests (PersistentVolumeClaims) in OpenShift.
    .
  2. Local block storage
    EBS appears to the EC2 instances as a local block device. Once provisioned, it is attached to the EC2 instance, and a PCI interrupt is triggered to inform the operating system.
    NAME                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    xvda                                  202:0    0   15G  0 disk
    ├─xvda1                               202:1    0    1M  0 part
    └─xvda2                               202:2    0   15G  0 part /
    xvdb                                  202:16   0   25G  0 disk
    └─xvdb1                               202:17   0   25G  0 part
      ├─docker_vol-docker--pool_tmeta     253:0    0   28M  0 lvm
      │ └─...                             253:2    0 23.8G  0 lvm
      │   ├─...                           253:8    0    3G  0 dm
      │   └─...                           253:9    0    3G  0 dm
      └─docker_vol-docker--pool_tdata     253:1    0 23.8G  0 lvm
        └─docker_vol-docker--pool         253:2    0 23.8G  0 lvm
          ├─...                           253:8    0    3G  0 dm
          └─...                           253:9    0    3G  0 dm
    xvdc                                  202:32   0   50G  0 disk 
    xvdd                                  202:48   0  100G  0 disk

    OpenShift on AWS also uses EBS to back local docker storage. EBS storage is formatted with a local filesystem like XFS..

  3. Not shared storage
    EBS volumes cannot be attached to more than one EC2 instance. Thus, all pods mounting an EBS-based PersistentVolume in OpenShift must run on the same node. The local filesystem on top of the EBS block device does not support clustering either.
    .
  4. AZ-local storage
    EBS volumes cannot cross AZs. Thus, OpenShift cannot failover pods mounting EBS storage into different AZs. Basically, an EBS volume is a failure domain.
    .
  5. Performance characteristics
    The type of EBS storage, as well as capacity, must be selected up front. Specifically, for fast storage a certain minimum capacity must be requested to have a minimum performance level in terms of IOPS.

This is the lay of the land. While these characteristics may be acceptable for stateless applications that only need to have local storage, they become an obstacle for stateful applications.

People want to containerize databases, as well. Following a micro-service architecture where every service maintains its own state and data model, this request will become more common. The nature of these databases differs from the classic, often relational, database management system IT organizations have spent millions on: They are way smaller and store less data than their big brother from the monolithic world. Still, with the limitations of EBS, I would need to architect replication and database failover around those just to deal with a simple storage failure.

Here is what changes with CNS:

  1. Dynamic provisioning
    The user experience actually doesn’t change. CNS is represented like any storage provider in OpenShift, by a StorageClass. PersistentVolumeClaims (PVCs) are issued against it, and the dynamic provisioner for GlusterFS creates the volume and returns it as a PersistentVolume (PV). When the PVC is deleted, the GlusterFS volume is deleted, as well.
    .
  2. Distributed file storage on top of EBS
    CNS volumes are basically GlusterFS volumes, managed by heketi. The volumes are built out of local block devices of the OpenShift nodes backed by EBS. These volumes provide shared storage and are mounted on the OpenShift nodes with the GlusterFS FUSE client.
    [ec2-user@ip-10-20-5-132 ~]$ mount
    ...
    10.20.4.115:vol_0b801c15b2965eb1e5e4973231d0c831 on /var/lib/origin/openshift.local.volumes/pods/80e27364-2c60-11e7-80ec-0ad6adc2a87f/volumes/kubernetes.io~glusterfs/pvc-71472efe-2a06-11e7-bab8-02e062d20f83 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
    ...
  3. Container-shared storage
    Multiple pods can mount and write to the same volume. The access mode for the corresponding node is known as “RWX”—read-write many. The containers can run on different OpenShift nodes, and the dynamic provisioner will mount the GlusterFS volume on the right nodes accordingly. Then, this local mount directory is bind-mounted to the container.
    .
  4. Cross-availability zone
    CNS is deployed across AWS AZs. The integrated, synchronous replication of GlusterFS will mirror every write 3 times. GlusterFS is deployed across OpenShift nodes in at least different AZs, and thus the storage is available in all zones. The failure of a single GlusterFS pod, an OpenShift node running the pod, or a block device accessed by the pod will have no impact. Once the failed resources come back, the storage is automatically re-replicated. CNS is actually aware of the failure zones as part of the cluster topology and will schedule new volumes, as well as recovery, so that there is no single point of failure.
    .
  5. Predictable performance
    CNS storage performance is not tied to the size of storage request by the user in OpenShift. It’s the same performance whether 1GB or 100GB PVs are requested.
    .
  6. Storage performance tiers
    CNS allows for multiple GlusterFS Trusted Storage Pools to be managed at once. Each pool consists of at least 3 OpenShift nodes running GlusterFS pods. While the OpenShift nodes belong to a single OpenShift cluster, the various GlusterFS pods form their own Trusted Storage Pools. An administrator can use this to equip the nodes with different kinds of storage and offer their pools with CNS as distinct storage tiers in OpenShift, via its own StorageClass. An administrator instance might, for example, run CNS on 3 OpenShift nodes with SSD (e.g., EBS gp2) storage and call it “fast,” whereas another set of OpenShift nodes with magnetic storage (e.g., EBS st1) runs a separate set of GlusterFS pods as an independent Trusted Storage Pool, represented with a StorageClass called “capacity.”

This is a significant step toward simplifying and abstracting provider infrastructure. For example, a MySQL database service running on top of OpenShift is now able to survive the failure of an AWS AZ, without needing to set up MySQL Master-Slave replication or change the micro-service to replicate data on its own.

Storage provided by CNS is efficiently allocated and provides performance with the first Gigabyte provisioned, thereby enabling storage consolidation. For example, consider six MySQL database instances, each in need of 25 GiB of storage capacity and up to 1500 IOPS at peak load. With EBS, I would create six EBS volumes, each with at least 500 GiB capacity out of the gp2 (General Purpose SSD) EBS tier, in order to get 1500 IOPS guaranteed. Guaranteed performance is tied to provisioned capacity with EBS.
With CNS, I can achieve the same using only 3 EBS volumes at 500 GiB capacity from the gp2 tier and run these with GlusterFS. I would create six 25 GiB volumes and provide storage to my databases with high IOPS performance, provided they don’t peak all at the same time.

Doing that, I would halve my EBS cost and still have capacity to spare for other services. My read IOPS performance is likely even higher because in CNS with 3-way replication I would read from data distributed across 3×1500 IOPS gp2 EBS volumes.

Finally, the setup for CNS is very simple and can run on any OpenShift installation based on version 3.4 or newer.

This way, no matter where I plan to run OpenShift (i.e., which cloud provider currently offers lowest prices), I can rely on the same storage features and performance. Furthermore, the Storage Service grows with the OpenShift cluster but still provides elasticity. Only a subset of OpenShift nodes must run CNS, at least 3 ideally across 3 AZs.

Deploying container-native storage on AWS

Installing OpenShift on AWS is dramatically simplified based on the OpenShift on Amazon Web Services Reference Architecture. A set of Ansible playbooks augments the existing openshift-ansible installation routine and creates all the required AWS infrastructure automatically.

A simple python script provides a convenient wrapper to the playbooks found in the openshift-ansible-contrib repository on GitHub for deploying on AWS.

All the heavy lifting of setting up Red Hat OpenShift Container Platform on AWS is automated with best practices incorporated.

The deployment finishes with an OpenShift Cluster with 3 master nodes, 3 infrastructure nodes, and 2 application nodes deployed in a highly available fashion across AWS AZs. The external and internal traffic is load balanced, and all required network, firewall, and NAT resources are stood up.

Since version 3.5, the reference architecture playbooks now ship with additional automation to make deployment of CNS as easy. Through additional AWS CloudFormation templates and Ansible playbook tasks, the additional, required infrastructure is stood up. This mainly concerns provisioning of additional OpenShift nodes with an amended firewall configuration, additional EBS volumes, and then joining them to the existing OpenShift cluster.

In addition, compared to previous releases, the CloudFormation templates now emit more information as part of the output. These are picked up by the playbooks to further reduce the information needed from the administrator. They will simply get the right information from the existing CloudFormation stack to retrieve the proper integration points.

The result is AWS infrastructure ready for the administrator to deploy CNS. Most of the manual steps of this process can therefore be avoided. Three additional app nodes are deployed with configurable instance type and EBS volume type. Availability zones of the selected AWS region are taken into account.

Subsequent calls allow for provisioning of additional CNS pools. The reference architecture makes reasonable choices for the EBS volume type and the EC2 instance with a balance between running costs and initial performance. The only thing left for the administrator to do is to run the cns-deploy utility and create a StorageClass object to make the new storage service accessible to users.

At this point, the administrator can choose between labeling the nodes as regular application nodes or provide a storage-related label that would initially exclude them from the OpenShift scheduler for regular application pods.

Container-ready storage

The reference architecture also incorporates the concept of Container-Ready Storage (CRS). In this deployment flavor, GlusterFS runs on dedicated EC2 instances with a heketi-instance deployed separately, both running without containers as ordinary system services. The difference is that these instances are not part of the OpenShift cluster. The storage service is, however, made available to, and used by, OpenShift in the same way. If the user, for performance or cost reasons, wants the GlusterFS storage layer outside of OpenShift, this is made possible with CRS. For this purpose, the reference architecture ships add-crs-storage.py to automate the deployment in the same way as for CNS.

Verdict

CNS provides further means of OpenShift Container Platform becoming an equalizer for application development. Consistent storage services, performance, and management are provided independently of the underlying provider platform. Deployment of data-driven applications is further simplified with CNS as the backend. This way, not only stateless but also stateful applications become easy to manage.

For developers, nothing changes: The details of provisioning and lifecycle of storage capacity for containerized applications is transparent to them, thanks to CNS’s integration with native OpenShift facilities.

For administrators, achieving cross-provider, hybrid-cloud deployments just became even easier with the recent release of the OpenShift Container Platform 3.5 on Amazon Web Service Reference Architecture. With just two basic commands, an elastic and fault-tolerant foundation for applications can be deployed. Once set up, growth becomes a matter of adding nodes.

It is now possible to choose the most suitable cloud provider platform without worrying about various tradeoffs between different storage feature sets or becoming too close to one provider’s implementation, thereby avoiding lock-in long term.

The reference architecture details the deployment and resulting topology. Access the document here.

18 Apr 2017

Bugfix release GlusterFS 3.8.11 has landed

An other month has passed, and more bugs have been squashed in the 3.8 release. Packages should be available or arrive soon at the usual repositories. The next 3.8 update is expected to be made available just after the 10th of May.

Release notes for Gluster 3.8.11

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2, 3.8.3, 3.8.4, 3.8.5, 3.8.6, 3.8.7, 3.8.8, 3.8.9 and 3.8.10 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 15 patches have been merged, addressing 13 bugs:
  • #1422788: [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible
  • #1427390: systemic testing: seeing lot of ping time outs which would lead to splitbrains
  • #1430845: build/packaging: Debian and Ubuntu don't have /usr/libexec/; results in bad packages
  • #1431592: memory leak in features/locks xlator
  • #1434298: [Disperse] Metadata version is not healing when a brick is down
  • #1434302: Move spit-brain msg in read txn to debug
  • #1435645: Disperse: Provide description of disperse.eager-lock option.
  • #1436231: Undo pending xattrs only on the up bricks
  • #1436412: Unrecognized filesystems (i.e. btrfs, zfs) log many errors about "getinode size"
  • #1437330: Sharding: Fix a performance bug
  • #1438424: [Ganesha + EC] : Input/Output Error while creating LOTS of smallfiles
  • #1439112: File-level WORM allows ftruncate() on read-only files
  • #1440635: Application VMs with their disk images on sharded-replica 3 volume are unable to boot after performing rebalance

11 Apr 2017

Script for creating EBS persistent volumes in OpenShift/Kubernetes

If you aren't using the automated dynamic volume provisioning (which you should!). Here is a short bash script to help you automatically create both the EBS volume and Kubernetes persistent volume:

#!/bin/bash

if [ $# -ne 2 ]; then  
    echo "Usage: sh create-volumes.sh SIZE COUNT"
    exit
fi

for i in `seq 1 $2`; do  
  size=$1
  vol=$(ec2-create-volume --size $size --region ap-southeast-2 --availability-zone ap-southeast-2a --type gp2 --encrypted | awk '{print $2}')
  size+="Gi"

  echo "
  apiVersion: v1
  kind: PersistentVolume
  metadata:
    labels:
      failure-domain.beta.kubernetes.io/region: ap-southeast-2
      failure-domain.beta.kubernetes.io/zone: ap-southeast-2a
    name: pv-$vol
  spec:
    capacity:
      storage: $size
    accessModes:
      - ReadWriteOnce
    awsElasticBlockStore:
      fsType: ext4
      volumeID: aws://ap-southeast-2a/$vol
    persistentVolumeReclaimPolicy: Delete" | oc create -f -
done  

3 Apr 2017

Docker 4th B’day Celebration – Bangalore

In Bangalore we celebrated Docker’s 4th Birthday at Microsoft Office on 25th March’17. Over 300 participants signed up and around 100 turned up for the event. We had around 15 mentors. We started the event at 9:30 AM. After a quick round  to introduction with Mentors,  participants started doing the Docker Birthday Labs.  Docker community and team have done a great job in creating self explanatory labs with Play with Docker environment.
Docker Mentors
Most of the participants followed the instructions on their own and where ever needed, mentors helped the participants. Around 11:30 AM our host Sudhir gave a quick demo Azure Container Service and then we did the cake cutting. This time only one girl attended the meetup event, so we requested her to cut the cake.
DSC_0239
After that we had light snacks and spent time in networking. It was great event and I am sure participants would have learnt something new.
Docker 4th Birthday, Bangalore
Thanks to Usha and Sudhir from Microsoft for hosting the event. In the next meetup we collaborating with 7th other meetup of Bangalore and doing a event on Microservices and Serverless. 

Join us for next community event on Microservices and Serverless

If you are following the updates here then you would know that in Feb’17 AWS, DevOp and Docker meetup group of Bangalore did a combined event in the following proposed community driven event format.

containers-and-devops-community-event-4

 

Instead of INR 200 we charged INR 100 and did not look for sponsors. With all the collected money we gave gifts and prizes to speakers and participants respectively. We received very good feedback about the event. In that event we decided to next event around Microservices and Serverless. This time even more meetup groups are coming together. Following 8 meetup groups would be joining hands this time :-

 

This is going to be one great event as we have already received good amount of talk proposals and are looking for more until 7th April. If you are interested then please submit your talk proposal here. And if you would like to attend the event then go respective meetup group and purchase the INR 100 ticket.

2 Apr 2017

WordPress editor missing when using CloudFront

We often put CloudFront in front of our WordPress sites to increase the load times of the website significantly.

CloudFront and WordPress have a few quirks, the main one will be the missing rich post/page editor that suddenly goes missing from your wp-admin.

The issue comes down to the UA sniffing that WordPress does.

Adding this into your functions.php will be a good quick fix for you

/**
* Ignore UA Sniffing and override the user_can_richedit function
* and just check the user preferences
*
* @return bool
*/
function user_can_richedit_override() {  
    global $wp_rich_edit;

    if (get_user_option('rich_editing') == 'true' || !is_user_logged_in()) {
        $wp_rich_edit = true;
        return true;
    }

    $wp_rich_edit = false;
    return false;
}

add_filter('user_can_richedit', 'user_can_richedit_override');  

17 Mar 2017

GlusterFS 3.8.10 is available

The 10th update for GlusterFS 3.8 is available for users of the 3.8 Long-Term-Maintenance version. Packages for this minor update are in many of the repositories for different distributions already. It is recommended to update any 3.8 installation to this latest release.

Release notes for Gluster 3.8.10

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2, 3.8.3, 3.8.4, 3.8.5, 3.8.6, 3.8.7, 3.8.8 and 3.8.9 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Improved configuration with additional 'virt' options

This release includes 5 more options to group virt (for VM workloads) for optimal performance.
Updating to the glusterfs version containing this patch won't automatically set these newer options on already existing volumes that have group virt configured. The changes take effect only when post-upgrade
# gluster volume-set <VOL> group virt
is performed.
For already existing volumes the users may execute the following five commands, if not already set:
# gluster volume set <VOL> performance.low-prio-threads 32
# gluster volume set <VOL> cluster.locking-scheme granular
# gluster volume set <VOL> features.shard on
# gluster volume set <VOL> cluster.shd-max-threads 8
# gluster volume set <VOL> cluster.shd-wait-qlength 10000
# gluster volume set <VOL> user.cifs off
It is most likely that features.shard would already have been set on the volume even before the upgrade, in which case the third volume set command above may be skipped.

Bugs addressed

A total of 18 patches have been merged, addressing 16 bugs:
  • #1387878: Rebalance after add bricks corrupts files
  • #1412994: Memory leak on mount/fuse when setxattr fails
  • #1420993: Modified volume options not synced once offline nodes comes up.
  • #1422352: glustershd process crashed on systemic setup
  • #1422394: Gluster NFS server crashing in __mnt3svc_umountall
  • #1422811: [Geo-rep] Recreating geo-rep session with same slave after deleting with reset-sync-time fails to sync
  • #1424915: dht_setxattr returns EINVAL when a file is deleted during the FOP
  • #1424934: Include few more options in virt file
  • #1424974: remove-brick status shows 0 rebalanced files
  • #1425112: [Ganesha] : Unable to bring up a Ganesha HA cluster on RHEL 6.9.
  • #1425307: Fix statvfs for FreeBSD in Python
  • #1427390: systemic testing: seeing lot of ping time outs which would lead to splitbrains
  • #1427419: Warning messages throwing when EC volume offline brick comes up are difficult to understand for end user.
  • #1428743: Fix crash in dht resulting from tests/features/nuke.t
  • #1429312: Prevent reverse heal from happening
  • #1429405: Restore atime/mtime for symlinks and other non-regular files.

7 Mar 2017

Access Gluster volume as a object Storage (via S3)

Building gluster-object in Docker container:



Background:


This document is about accessing a gluster-volume using object interface.

Object interface is provided by gluster-swift. (2)

Here, gluster-swift is running inside a docker container. (1)

This Object interface(docker container) accesses Gluster volume which is mounted in the host.

For the same Gluster volume, bind mount is created inside the docker container and hence can be accessed using S3 GET/PUT requests.






Steps to build gluster-swift container:



git clone docker-gluster-swift containing Dockerfile

$ git clone https://github.com/prashanthpai/docker-gluster-swift.git

$ cd docker-gluster-swift


Start Docker service:
$ sudo systemctl start docker.service

Build  a new image using Dockerfile
$ docker build --rm --tag prashanthpai/gluster-swift:dev .


Sending build context to Docker daemon 187.4 kB
Sending build context to Docker daemon
Step 0 : FROM centos:7
 ---> 97cad5e16cb6
Step 1 : MAINTAINER Prashanth Pai <ppai@redhat.com>
 ---> Using cache
 ---> ec6511e6ae93
Step 2 : RUN yum --setopt=tsflags=nodocs -y update &&     yum --setopt=tsflags=nodocs -y install         centos-release-openstack-kilo         epel-release &&     yum --setopt=tsflags=nodocs -y install         openstack-swift openstack-swift-{proxy,account,container,object,plugin-swift3}         supervisor         git memcached python-prettytable &&     yum -y clean all
 ---> Using cache
 ---> ea7faccc4ae9
Step 3 : RUN git clone git://review.gluster.org/gluster-swift /tmp/gluster-swift &&     cd /tmp/gluster-swift &&     python setup.py install &&     cd -
 ---> Using cache
 ---> 32f4d0e75b14
Step 4 : VOLUME /mnt/gluster-object
 ---> Using cache
 ---> a42bbdd3df9f
Step 5 : RUN mkdir -p /etc/supervisor /var/log/supervisor
 ---> Using cache
 ---> cf5c1c5ee364
Step 6 : COPY supervisord.conf /etc/supervisor/supervisord.conf
 ---> Using cache
 ---> 537fdf7d9c6f
Step 7 : COPY supervisor_suicide.py /usr/local/bin/supervisor_suicide.py
 ---> Using cache
 ---> b5a82aaf177c
Step 8 : RUN chmod +x /usr/local/bin/supervisor_suicide.py
 ---> Using cache
 ---> 5c9971b033e4
Step 9 : COPY swift-start.sh /usr/local/bin/swift-start.sh
 ---> Using cache
 ---> 014ed9a6ae03
Step 10 : RUN chmod +x /usr/local/bin/swift-start.sh
 ---> Using cache
 ---> 00d3ffb6ccb2
Step 11 : COPY etc/swift/* /etc/swift/
 ---> Using cache
 ---> ca3be2138fa0
Step 12 : EXPOSE 8080
 ---> Using cache
 ---> 677fe3fd2fb5
Step 13 : CMD /usr/local/bin/swift-start.sh
 ---> Using cache
 ---> 3014617977e0
Successfully built 3014617977e0
$
-------------------------------

Setup Gluster volume:

Glusterd service start, create and mount volumes

$  su
root@node1 docker-gluster-swift$ service glusterd start


Starting glusterd (via systemctl):                         [  OK  ]
root@node1 docker-gluster-swift$
root@node1 docker-gluster-swift$

Create gluster volume:

There are three nodes where Centos 7.0 is installed.

Ensure glusterd service is started all three nodes(node1, node2, node3) as below:
#systemctl glusterd start


root@node1 docker-gluster-swift$ sudo gluster volume create tv1  node1:/opt/volume_test/tv_1/b1 node2:/opt/volume_test/tv_1/b2  node3:/opt/volume_test/tv_1/b3 force


volume create: tv1: success: please start the volume to access data
Here:

- node1, node2, nod3 are the hostnames,


- /opt/volume_test/tv_1/b1,  /opt/volume_test/tv_1/b2 and /opt/volume_test/tv_1/b3 are the bricks

        - tv1 is the volume name

root@node1 docker-gluster-swift$
root@node1docker-gluster-swift$

Start gluster volume:
root@node1 docker-gluster-swift$ gluster vol start tv1


volume start: tv1: success
root@node1docker-gluster-swift$

root@node1docker-gluster-swift$ gluster vol status

Status of volume: tv1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1:/opt/volume_test/tv_1/b1         49152     0          Y       5951
Brick node2:/opt/volume_test/tv_1/b2         49153     0          Y       5980
Brick node3:/opt/volume_test/tv_1/b3         49153     0          Y       5980

Task Status of Volume tv1
------------------------------------------------------------------------------
There are no active volume tasks
root@node1 docker-gluster-swift$

Create a directory to mount the volume:
root@node1 docker-gluster-swift$ mkdir -p /mnt/gluster-object/tv1


The path /mnt/gluster-object/ will be used while running Docker container.

mount the volume:

root@node1 docker-gluster-swift$ mount -t glusterfs node1:/tv1 /mnt/gluster-object/tv1

root@node1 docker-gluster-swift$

Verify mount:
sarumuga@node1 test$ mount | grep mnt

node1:/tv1 on /mnt/gluster-object/tv1 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

============================

Run command in the new container with gluster mount path:

root@node1 test$ docker run -d -p 8080:8080 -v /mnt/gluster-object:/mnt/gluster-object -e GLUSTER_VOLUMES="tv1" prashanthpai/gluster-swift:dev


feb8867e1fd9c240bb3fc3aef592b4162d56895e0015a6c9cab7777e11c79e06

Here:
-p 8080:8080


publish container port to host.


format :    hostport : containerport



                         (a)                (b)
Note: -v /mnt/gluster-object:/mnt/gluster-object
(a) location where all gluster volumes are mounted in host location
(b) location inside docker where volume is mapped

- e GLUSTER_VOLUMES="tv1"
passing tv1 volume name as environment.


Verify container :
sarumuga@node1 test$ docker ps
CONTAINER ID        IMAGE                            COMMAND                CREATED             STATUS              PORTS                    NAMES
feb8867e1fd9        prashanthpai/gluster-swift:dev   "/bin/sh -c /usr/loc   29 seconds ago      Up 28 seconds       0.0.0.0:8080->8080/tcp   sick_heisenberg

Inspect container and get the IP address:
sarumuga@node1test$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'  feb8867e1fd9"
172.17.0.1

============================

Verifying S3 access :

Now, verify S3 access requests to the Gluster volume.

We are going to make use of s3curl(3) for verifying object access.

Create bucket:
# ./s3curl.pl --debug --id 'tv1' --key 'test' --put /dev/null  -- -k -v  http://172.17.0.1:8080/bucket7

Put object
# ./s3curl.pl --debug --id 'tv1' --key 'test' --put  ./README -- -k -v -s http://172.17.0.1:8080/bucket7/a/b/c

Get object
# ./s3curl.pl --debug --id 'tv1' --key 'test'   -- -k -v -s http://172.17.0.1:8080/bucket7/a/b/c

List objects in a bucket request
# ./s3curl.pl --debug --id 'tv1' --key 'test'   -- -k -v -s http://172.17.0.1:8080/bucket7/

List all buckets
# ./s3curl.pl --debug --id 'tv1' --key 'test'   -- -k -v -s http://172.17.0.1:8080/

Delete object
# ./s3curl.pl --debug --id 'tv1' --key 'test'   --del -- -k -v -s http://172.17.0.1:8080/bucket7/a/b/c

Delete Bucket
# ./s3curl.pl --debug --id 'tv1' --key 'test'   --del -- -k -v -s http://172.17.0.1:8080/bucket7

============================



Reference:
(1) GitHub - prashanthpai/docker-gluster-swift: Run gluster-swift inside a docker container.
(2) gluster-swift/quick_start_guide.md at master · gluster/gluster-swift · GitHub
(3) Amazon S3 Authentication Tool for Curl : Sample Code & Libraries : Amazon Web Services