10 Apr 2018

5 Apr 2018

Monitoring VDO Volumes

My previous post showed you how to get deduplication working on Linux with VDO. In some ways, that's the post that could cause trouble - if you start using vdo across a number of hosts, how can you easily establish monitoring or even alerting?

So that's the problem we're going to focus on in this post.


There are a heap of different ways to monitor systems, but the rising star currently is Prometheus. Historically, I've used monitoring systems that require clients to push data to a central server but Prometheus turns this around. With Prometheus data collection is initiated by the Prometheus server itself - it's called a 'scrape' job. This approach simplifies client configurations and management, which is a huge bonus for large installations.

To make vdo data available, we need an exporter. The exporter provides a http endpoint that the Prometheus server will scrape metrics from. There are a heap of exporters available to Prometheus covering a plethora of different subsystems, but since vdo is new there isn't something you can just pick up and run with. Well that was the case...

vdo_exporter Project

The scrape job simply issues a GET request to the "/metrics" HTTP API endpoint on a host. Developing an API endpoint for this in python is fairly straight forward, and given the metrics themselves are all nicely grouped together under sysfs, it seemed a bit of a no-brainer to develop an exporter. My exporter can be found here. The project's repo contains the python code, a systemd unit file and what I hope is a sensible README file documenting how to install the exporter (if you have a firewall active, remember to open port 9286!)

I'm leaving the installation of the exporter as an exercise for the reader, and use the rest of this article to show you how to quickly stand up prometheus and grafana to collect and visualise the vdo statistics. For this example, I'm again using Fedora so for other distributions you may have to tweak 'stuff'.

Containers to the Rescue!

The prometheus and grafana projects both provide docker images on docker hub, so assuming you already have docker installed on your machine you can grab the images with the following;

docker pull quay.io/prometheus/prometheus
docker pull docker.io/grafana/grafana:4.6.3

Containers are inherently stateless, but for monitoring and dashboards we need to make sure that these containers use either different docker volumes, or persist data to the host's filesystem. For this exercise, I'll be exposing some directories on the host's filesystem (change these to suit!)

mkdir -p /opt/docker/grafana-prom/{etc,data}
chown 104 /opt/docker/grafana-prom/{etc,data}
chgrp 107 /opt/docker/grafana-prom/{etc,data}
mkdir -p /opt/docker/grafana-prom/prom-{etc,data}
chown 65534 /opt/docker/grafana-prom/prom-{etc,data}
chgrp 65534 /opt/docker/grafana-prom/prom-{etc,data}

To launch the containers and manage them as a unit, I'm using "docker-compose" - so if you don't have that installed, talk nicely to your package manager :)

Assuming you have docker-compose available, you just need a compose file (docker-compose.yml) to bring the containers together;

version: '2'

    image: docker.io/grafana/grafana:4.6.3
    container_name: grafana
      - "3000:3000"
      - /opt/docker/grafana-prom/etc:/etc/grafana:Z
      - /opt/docker/grafana-prom/data:/var/lib/grafana:Z
      - prometheus
    image: docker.io/prom/prometheus
    container_name: prometheus
    network_mode: "host"
      - "9090:9090"
      - /opt/docker/grafana-prom/prom-etc:/etc/prometheus:Z
      - /opt/docker/grafana-prom/prom-data:/prometheus:Z

With the directories in place for the persistent data within the containers, and the compose file ready you just need to start the containers. Run the docker-compose command from the directory that holds your docker-compose.yml file.

[root@myhost grafana_prom]# docker-compose up -d
Creating network "grafanaprom_default" with the default driver
Creating prometheus ...
Creating prometheus ... done
Creating grafana ...
Creating grafana ... done

Configuring Prometheus

You should already have the vdo_exporter service running on your hosts that are using vdo, so the next step is to create a scrape job in prometheus to tell it to go and fetch the data. This is done by editing the prometheus.yml file - in my case this is in /opt/docker/grafana-prom/prom-etc.  Under the scrape_configs section add something like this to collect data from your vdo host(s)

# VDO Information
- job_name: "vdo_stats"
    - targets: [ '']

Now reload Prometheus to start the data collection
[root@myhost grafana_prom]# docker exec -it prometheus sh
/prometheus $ kill -SIGHUP 1 

Configuring Grafana

To visualize the vdo statistics that Prometheus is collecting, Grafana needs two things; the data source definition pointing to the prometheus container, and a dashboard that presents the data.

  1. Login to your grafana instance (http://localhost:3000), using the default credentials (admin/admin)
  2. Click on the Grafana icon in the top left, and select Data Sources
  3. Click the "Add  data source" button
  4. Enter the prometheus details (and ensure you set the data source as the default)

  5. The grafana directory in the vdo_exporter project holds a file called VDO_Information.json. This json file is the dashboard definition, so we need to import it.
    • Click on the grafana icon again, highlight the Dashboards entry, then select the import option from the pop-up menu.
    • Click on the Upload.json File, and pick the VDO_Information.json file to upload.
  1. Now select the dashboard icon (to the right of the Grafana logo), and select "VDO Information". You should then see something like this

  1. As you add more hosts that are vdo enabled, just add the host's ip to the prometheus scrape configuration and reload prometheus. Simples..

Grafana provides a notifications feature which enables you to define threshold based alerting. You could define a trigger for low "physical space" conditions, or alert based on recovery being active - I leave that up to you! Grafana supports a number of different notification endpoints including PagerDuty, Sensu and even email! So take some time and review the docs to see how Grafana could best integrate into your environment.

And Remember...

VDO is not the proverbial "silver bullet". The savings from any compression and deduplication technology is dependent on the data you're storing, and vdo is no different. Also, each vdo volume requires additional RAM, so if you want to move vdo out of the test environment into production you'll need to plan for additional CPU and RAM to "make the magic happen"™.

Shrinking Your Storage Requirements with VDO

Whether you're using proprietary storage arrays or software defined storage, the actual cost of capacity can sometimes provoke responses like, "why do you you need all that space?" or "OK, but that's all the storage you're going to get, so make it last".

The problem is that storage is a commodity resource, it's like toner or ink in a printer. When you run out, things will stop and lots of people tend to lose their sense of humor. Controlling storage growth has been going on for over 10 years in the proprietary storage space, with one of the most successful companies being NetApp who introduced data deduplication with their ASIS (advanced Single Instance Storage) feature back in 2007. The message was that if you wanted to reduce storage consumption, you basically had to buy the more expensive "stuff" in the first place.

This was the "status quo" until Red Hat acquired Permabit in mid 2017...now compression and deduplication features are heading towards a Linux server near you!

That's the history lesson, now let's look at how you can kick the tyres on open sourced based compression and deduplication. For the remainder of this article, I'll walk through the steps you need to quickly get "dedupe" up and running with Fedora.


Since we're just testing, create a vm and install Fedora 27. Use libvirt, parallels, virtualbox...whatever takes your fancy - or maybe just use a cloud image in AWS. The choice is yours! Just try to ensure the vm has something like; 2 vcpus, 4GB RAM, an OS disk (20GB) and a data disk for vdo testing.

Once installed you'll need to enable an additional repository to pick up the vdo deduplication modules (kvdo - kernel virtual data optimizer)

dnf copr enable rhawalsh/dm-vdo
dnf install vdo kmod-kvdo


In my test environment, I'm using a 20g vdisk for my vdo testing.
[root@f27-vdo ~]# lsblk
vda    252:0    0   4G  0 disk 
└─vda1 252:1    0   4G  0 part /
vdb    252:16   0  20G  0 disk 

Now with the kvdo module in place, let's create a vdo volume of 100G using the 20G /dev/vdb device

[root@f27-vdo ~]# vdo create --name=vdo0 --device=/dev/vdb \
Creating VDO vdo0
Starting VDO vdo0
Starting compression on VDO vdo0
VDO instance 0 volume is ready at /dev/mapper/vdo0

Not exactly complicated :) Couple of things worth noting though;
  • by default new volumes are created with compression and deduplication enabled. If you don't like that you can play with the  --compression or --deduplication flags.
  • a vdo volume is actually a device mapper device, in this case /dev/mapper/vdo0. It's this 'dm' device that you'll use from here on in.


Now you have a vdo volume, next step is to get it deployed and understand how to report on space savings. The first thing is filesystem formatting. Make sure you use the -K switch to avoid issuing discards, remember a vdo volume is in effect a thin provisioned volume.

[root@f27-vdo ~]# mkfs.xfs -K /dev/mapper/vdo0

With the filesystem in place, the next step would normally be updating fstab...right? Well not this time. For vdo volumes, the boot time startup sequence between fstab and the vdo service is a problem - so we need to use a mount service to ensure vdo volumes are mounted correctly. 
The vdo rpm provides a sample mount service definition (/usr/share/doc/vdo/examples/systemd/VDO.mount.example). For this example, I'm going to mount the vdo volume at /mnt/vdo0

mkdir /mnt/vdo0
cp /usr/share/doc/vdo/examples/systemd/VDO.mount.example /etc/systemd/system/mnt-vdo0.mount

Then update the mount unit to look like this
Description = Mount filesystem that lives on VDO0
name = mnt-vdo0.mount
Requires = vdo.service systemd-remount-fs.service
After = multi-user.target
Conflicts = umount.target

What = /dev/mapper/vdo0
Where = /mnt/vdo0
Type = xfs
Options = discard

WantedBy = multi-user.target

Reminder: mount services are named to reflect their intended mount location within the filesystem.

Now reload systemd, enable the mount and start it
systemctl daemon-reload
systemctl enable mnt-vdo0.mount
systemctl start mnt-vdo0.mount
[root@f27-vdo ~]# df -h /mnt/vdo0
Filesystem         Size Used Avail Use% Mounted on
/dev/mapper/vdo0   100G 135M 100G    1% /mnt/vdo0

At this point you've used the vdo command to create the volume, but there is also a command to look at the volume's statistics called vdostats. To give us something to look at I copied the same 200MB disk image to the volume 20 times, which will also help to explain vdo overheads.

[root@f27-vdo ~]# df -h /mnt/vdo0
Filesystem        Size  Used Avail Use% Mounted on
/dev/mapper/vdo0  100G  4.5G   96G   5% /mnt/vdo0

[root@f27-vdo ~]# vdostats --hu vdo0
Device               Size   Used   Available   Use% Space saving%
vdo0                20.0G   4.2G       15.8G    21%           95%

Wait a minute...at a logical layer, the filesystem says that it's 4.5G used, but at the physical vdo layer it's saying practically the same thing AND that there's a 95% saving! So which is right? The answer is both :) The vdo subsystem persists metadata on the volume (lookup maps etc), which accounts for a chunk of the physical space used, and the savings value is derived purely from the logical blocks "in" and the physical, unique blocks written. If you need to understand more you can dive into the sysfs filesystem. 
Each vdo volume stores and maintains statistics under  /sys/kvdo/<vol_name>/statistics (which is where vdostats gets it's information from!)

The most useful stats I've found to understand how space is consumed are;

  • overhead_blocks_used : metadata for the volume. The overhead is proportional to the physical size of the volume; for example, on an 8TB device, the overhead was around 9GB
  • data_blocks_used: this is the count of the physical blocks consumed by user data
  • logical_blocks_used: the count of blocks consumed at the filesystem level
In my case, the "overhead_blocks_used" was 4GB, and the "data_blocks_used" around 200MB. The savings% value is derived from  data_blocks_used / logical_blocks_used, since it only applies to actual user data written to the volume, which equates to around 95%. Now it makes sense!

Final Words

Deduplication is a complex beast, but hopefully the above will at least get you up and running with this new Linux feature.

If you decide to use vdo across a number of servers, running vdostats isn't really a viable option. For that it would be more useful to leave the command line behind at look at solutions like prometheus and grafana to track capacity usage and generate alerts. Spoiler alert!...that's the subject of my next post :)

Useful Links

26 Feb 2018


The Gluster-4.0 release is here, one of the most important releases for the Gluster community in quite some time. The bump in the major version is being brought about by a few new changes, namely a change in the on-wire protocol, and the new management framework, GlusterD2 (GD2 for short). GD2 has been under development for a very long time. We are excited to finally get it in the hands of users.

22 Jan 2018

Building Gluster with Address Sanitizer

We occasionally find leaks in Gluster via bugs filed by users and customers. We definitely have benefits from checking for memory leaks and address corruption ourselves. The usual way has been to run it under valgrind. With ASAN, the difference is we can compile the binary with ASAN and then anyone can run their tests on top of this binary and it should crash in case it comes across a memory leak or memory corruption. We’ve fixed at least one bug with the traceback from ASAN.

Here’s how you run Gluster under ASAN.

./configure --enable-gnfs --enable-debug --silent
make install CFLAGS="-g -O0 -fsanitize-recover=address -fsanitize=address" -j 4

You need to make sure you have libasan installed or else this might error out. Once this is done, compile and install like you would normally. Now run tests and see how it works. There are problems with this approach though. If there’s a leak in cli, it’s going to complain about it all the time. The noise doesn’t imply that fixing that is important. The Gluster CLI is going away soon. Additionally, the CLI isn’t a long running daemon. It’s started, does it’s job, and dies immediately.

The tricky part though is catches memory you’ve forgotten to free. It does not catch memory that you’ve allocated unnecessarily. In the near future, I want to create downloadable RPMs which you can download and run tests against.

The configuration I’ve setup lets you continue to run the program after the first memory corruption by setting the environment variable ASAN_OPTIONS=halt_on_error=0. If you find an existing leak you are not interested in fixing, you can suppress it. More information on the wiki page

11 Dec 2017

Want to Install Ceph, but afraid of Ansible?

There is no doubt that Ansible is a pretty cool automation engine for provisioning and configuration management. ceph-ansible builds on this versatility to deliver what is probably the most flexible Ceph deployment tool out there. However, some of you may not want to get to grips with Ansible before you install Ceph...weird right?

No, not really.

If you're short on time, or just want a cluster to try ceph for the first time, a more guided installation approach may help. So I started a project called ceph-ansible-copilot

The idea is simple enough; wrap the ceph-ansible playbook with a text GUI. Very 1990's, I know, but now instead of copying and editing various files you simply start the copilot tool, enter the details and click 'deploy'. The playbook runs in the background within the GUI and any errors are shown there and then...no more drowning in an ocean of scary ansible output :)

The features and workflows of the UI are described in the project page's README file.

Enough rambling, lets look at how you test this stuff out. The process is fairly straight forward;
  1. configure some hosts for Ceph
  2. create the Ansible environment
  3. run copilot
The process below describes each of these steps using CentOS7 as the deployment target for Ansible and the Ceph cluster nodes.
    1. Configure Some Hosts for Ceph
    Call me lazy, but I'm not going to tell you how to build vm's or physical servers. To follow along, the bare minimum you need are a few virtual machines - as long as they have some disks on them for Ceph, you're all set!

    2. Create the Ansible environment
    Typically for a Ceph cluster you'll want to designate a host as the deployment or admin host. The admin host is just a deployment manager, so it can be a virtual machine, a container or even a real (gasp!) server. All that really matters is that your admin host has network connectivity to the hosts you'll be deploying ceph to.

    On the admin host, perform these tasks (copilot needs ansible 2.4 or above)
    > yum install git ansible python-urwid -y
    Install ceph-ansible (full installation steps can be found here)
    > cd /usr/share
    > git clone https://github.com/ceph/ceph-ansible.git
    > cd ceph-ansible
    > git checkout master
    Setup passwordless ssh between the admin host and for candidate ceph hosts
    > ssh-keygen
    > ssh-copy-id root@<ceph_node>
    On the admin host install copilot
    > cd ~
    > git clone https://github.com/pcuzner/ceph-ansible-copilot.git
    > cd ceph-ansible-copilot
    > python setup.py install 
    3. Run copilot
    The main playbook for ceph-ansible is in /usr/share/ceph-ansible - this is where you need to run copilot from (it will complain if you try to run it in some other place!)
    > cd /usr/share/ceph-ansible
    > copilot
    Then follow the UI..

    Example Run
    Here's a screen capture showing the whole process, so you can see what you get before you hit the command line.

    The video shows the deployment of a small 3 node ceph cluster, 6 OSDs, a radosgw (for S3), and an MDS for cephfs testing. It covers the configuration of the admin host, the copilot UI and finally a quick look at the resulting ceph cluster. The video is 9mins in length, but for those of us with short attention spans, here's the timeline so you can jump to the areas that interest you.

    00:00 Pre-requisite rpm installs on the admin host
    01:12 Installing ceph-ansible from github
    01:52 Installing copilot
    02:58 Setting up passwordless ssh from the admin host to the candidate ceph hosts
    04:04 Ceph hosts before deployment
    05:04 Starting copilot
    08:10 Copilot complete, review the Ceph hosts

    What's next?
    More testing...on more and varied hardware...

    So far I've only tested 'simple' deployments using the packages from ceph.com (community deployments) against a CentOS target. So like I said, more testing is needed, a lot more...but for now there's enough of the core code there for me to claim a victory and write a blog post!

    Aside from the testing, these are the kinds of things that I'd like to see copilot handle
    • collocation rules (which daemons can safely run together)
    • resource warnings (if you have 10 HDD's but not enough RAM, or CPU...issue a warning)
    • handle the passwordless ssh setup. copilot already checks for passwordless ssh, so instead of leaving it to the admin to resolve any issues, just add another page to the UI.
    That's my wishlist - what would you like copilot to do? Leave a comment, or drop by the project on github.

    Demo'd Versions
    • copilot 0.9.1
    • ceph-ansible MASTER as at December 11th 2017
    • ansible 2.4.1 on CentOS

    4 Dec 2017

    Static Analysis for Gluster

    Static analysis programs are quite useful, but also prone to false positives. It’s really hard to keep track of static analysis failures on a fairly large project. We’ve looked at several approaches in the past. The one that we used to do was to publish a report every day which people could look at if they wished. This guaranteed that nobody looked at it. Despite knowing where to look for it, even I barely looked at it.

    The second approach was to run them twice, before your patch is merged and after your patch is merged in. If the count goes up with your patch, the test fails. This has a problem that it doesn’t account for false positives. An argument could be made that you could go fix another static analysis failure in your patch. But that means your patch now does two things, which isn’t fun for when you want to do a backport, for instance. Or even for history purposes. That’s landing two unrelated changes in one patch.

    The approach that we’ve now gone with is to have them run on a nightly basis with Jenkins. Deepshika did almost all the work for this and wrote about it on her blog. It has more details on the actual implementation. This puts all the results in one place for everyone to take a look at. Jenkins also gives us a visual view of what changed over the course of time, which wasn’t as easy in the past.

    She’s working on further improving the visual look by uniting all the jobs that are tied to static analysis. That way, we’ll have a nightly pipeline run for each branch that will put all the tests we care about for a particular branch in one place.

    1 Dec 2017

    Gluster Summit 2017

    Right after Open Source Europe, we had Gluster Summit. It was a 2-day event with talks and BoFs. I had two key things to do at the Gluster Summit. One was build out the minnowboard setup to demo Tendrl. This didn’t work out. I had volunteered to help with the video work as well. According to my plans. The setup for minnowboards would take about 1h and then I’d be free to help with camera work. I had a talk scheduled for the second day of the event. I’d have expected one of these to two wrong. I didn’t expect all to go wrong :)

    The venue had a balcony, which made for great photos

    On the first day, Amar and I arrived early and did the camera setup. The venue staff were helpful. They gave us a line out from their audio setup for the camera. Our original plan was that speakers would have a lapel mic for the camera. That was prone to errors from speakers and also would need us to check batteries every few hours. When we first tried to work with the line in, we had interference. The camera power supply wasn’t grounded (there wasn’t even a ground out. The venue staff switched out the boxes they used for line out and it worked like a charm after that.

    We did not have a good start for the demo. Jim had pre-setup the networking on the boards from home and brought them to Prague. But whatever we did, we couldn’t connect to it’s network the night before the event. That was the day we kept free to do this. That night we gave up, because we needed a monitor, an HDMI cable, and a keyboard to debug it. At the venue, we borrowed a keyboard and hooked up the board to the monitor. There was no user for dnsmasq, so it wasn’t assigning out IPs and that’s why the networking didn’t work. Once we got past that point, it was about getting the network to work with my laptop. That took a while. We decided to go with a server in the cloud as the Tendrl server. By evening, we got the playbook run and get everything installed and configured. But I’d made a mistake. I used IPs instead of FQDNs, so the dashboard wouldn’t work. This meant re-installing the whole setup. That’s the point where I gave up on it.

    We even took the group picture from the balcony

    My original content for my talk was to look at our releases. Especially to list out what we committed to at the start of the release and what we finished with. There is definitely a gap. This is common for software projects and how people estimate work. This topic was more or less covered on the first day. I instead focused on how we fail. How we fail our users, developers, and community. I followed the theme of my original talk a bit, pointing out that we can small large problems in smaller chunks.

    We’re running a marathon, not a sprint.

    29 Nov 2017

    16 Nov 2017

    Upgrading the Gluster Jenkins Server

    I’ve been wanting to work on upgrading build.gluster.org setup for ages. There’s a lot about that setup that isn’t ideal in how people use Jenkins anymore.

    We used the unix user accounts for access to Jenkins. This means Jenkins needs to read /etc/passwd and everyone has SSH access via passwords by default. Very often, the username wasn’t tied to an actual email address. I had to guess the account owner based on their usernames elsewhere. This was also open to brute force attacks. The only way to change passwords was to login to the server and run passwd command. We fixed this problem a few months ago by switching our auth to Github. Now access control is a Github group which gives you more permissions. Logging in will not give you any more permissions than not logging in.

    Our todo list during the Jenkins upgrade

    Jenkins community now recommends not running jobs on the master node at all. But our old setup depended on certain jobs always running on master. One by one, I’ve eliminated them so that they can now run on any node agent. The last job left is our release job. We make the tar from every release available on an FTP-like server. In our old setup, the this server and Jenkins were the same machine. The job ran on master and depended on them both being the same machine. We decided to split up the systems so we could take down Jenkins without any issue. We intend to fix this with an SCP command at the end of the release job to copy artifacts to the FTP-like server.

    One of the Red Hat buildings in Brno

    Now, we have a Jenkins setup that I’m happy with. At this point, we’ve fixed a vast majority of the annoying CI-related infra issues. In a few years, we’ll rip them all out and re-do them. For now, spending a week with my colleague in Brno working on an Infra sprint has been well worth our time and energy.