18 Sep 2016

Gluster Client Containers or containers capable of mounting Gluster Volumes.

I have been receiving lots of queries on whether we can mount GlusterFS from the container ? or Gluster Client containers are available ?

Yes, it is. Today I refreshed this image to fedora24, so the blog.
The process to use gluster client containers are simple as shown below:


#docker pull humble/gluster-client

Then run the container as


[root@localhost gluster-client]# docker run -d -ti --privileged humble/gluster-client bash
7d8dfbf8e4dbb841b240cc196682c77aad7c30cc4511d906cab275cd326b4755
[root@localhost gluster-client]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7d8dfbf8e4db humble/gluster-client "bash" 6 seconds ago Up 2 seconds high_cray
[root@localhost gluster-client]# docker exec -ti 7d8dfbf8e4db bash
[root@7d8dfbf8e4db /]# glusterd --version
glusterfs 3.8.4 built on Sep 10 2016 16:42:36
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2013 Red Hat, Inc.
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
[root@7d8dfbf8e4db /]#cat /etc/redhat-release
Fedora release 24 (Twenty Four)
[root@7d8dfbf8e4db /]#

Make sure “FUSE” device exist in the container as shown below.


[root@localhost gluster-client]# ll /dev/fuse
crw-rw-rw-. 1 root root 10, 229 Sep 18 03:03 /dev/fuse
[root@localhost gluster-client]#

I have a gluster volume exported from another server and would like to mount it inside this container.


[root@7d8dfbf8e4db /]# mount -t glusterfs 192.168.43.149:/myVol1 /mnt
WARNING: getfattr not found, certain checks will be skipped..
[root@7d8dfbf8e4db mnt]# mount |grep gluster
192.168.43.149:/myVol1 on /mnt type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@7d8dfbf8e4db mnt]#
[root@7d8dfbf8e4db /]# cd /mnt
[root@7d8dfbf8e4db mnt]# ls
[root@7d8dfbf8e4db mnt]# touch Hi
[root@7d8dfbf8e4db mnt]#

Thats it.

I will automate the build in docker hub and move this image to gluster official account soon.

15 Sep 2016

GlusterFS 3.8.4 is available, Gluster users are advised to update

Even though the last release 3.8 was just two weeks ago, we're sticking to the release schedule and have 3.8.4 ready for all our current and future users. As with all updates, we advise users of previous versions to upgrade to the latest and greatest. Several bugs have been fixed, and upgrading is one way to prevent hitting known problems in future.

Release notes for Gluster 3.8.4

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1, 3.8.2 and 3.8.3 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 23 patches have been merged, addressing 22 bugs:
  • #1332424: geo-rep: address potential leak of memory
  • #1357760: Geo-rep silently ignores config parser errors
  • #1366496: 1 mkdir generates tons of log messages from dht xlator
  • #1366746: EINVAL errors while aggregating the directory size by quotad
  • #1368841: Applications not calling glfs_h_poll_upcall() have upcall events cached for no use
  • #1368918: tests/bugs/cli/bug-1320388.t: Infrequent failures
  • #1368927: Error: quota context not set inode (gfid:nnn) [Invalid argument]
  • #1369042: thread CPU saturation limiting throughput on write workloads
  • #1369187: fix bug in protocol/client lookup callback
  • #1369328: [RFE] Add a count of snapshots associated with a volume to the output of the vol info command
  • #1369372: gluster snap status xml output shows incorrect details when the snapshots are in deactivated state
  • #1369517: rotated FUSE mount log is using to populate the information after log rotate.
  • #1369748: Memory leak with a replica 3 arbiter 1 configuration
  • #1370172: protocol/server: readlink rsp xdr failed while readlink got an error
  • #1370390: Locks xlators is leaking fdctx in pl_release()
  • #1371194: segment fault while join thread reaper_thr in fini()
  • #1371650: [Open SSL] : Unable to mount an SSL enabled volume via SMB v3/Ganesha v4
  • #1371912: gluster system:: uuid get hangs
  • #1372728: Node remains in stopped state in pcs status with "/usr/lib/ocf/resource.d/heartbeat/ganesha_mon: line 137: [: too many arguments ]" messages in logs.
  • #1373530: Minor improvements and cleanup for the build system
  • #1374290: "gluster vol status all clients --xml" doesn't generate xml if there is a failure in between
  • #1374565: [Bitrot]: Recovery fails of a corrupted hardlink (and the corresponding parent file) in a disperse volume

1 Sep 2016

GlusterFS 3.7.15

GlusterFS 3.7.15 released

GlusterFS 3.7.15 has been released. This is a regular scheduled release for GlusterFS-3.7 and includes 38 bug fixes since 3.7.14. The release-notes can be read here.

Downloads

The tarball can be downloaded from download.gluster.org.

Packages

Binary packages have been built are in the process of being made available as updates.

The CentOS Storage SIG packages have been built and will become available in the centos-gluster37-test repository (from the centos-release-gluster37 package) shortly. These will be made available in the release repository after some more testing.

Packages for Fedora 23 are queued for testing in Fedora Koji/Bodhi. They will appear first via dnf in the Updates-Testing repo, then in the Updates repo.

Packages for Fedora 24, 25, 26; Debian wheezy, jessie, and stretch, are available now on download.gluster.org.

Packages for Ubuntu Trusty, Wily, and Xenial are available now in Launchpad.

Packages for SuSE available now in the SuSE build system.

See the READMEs in the respective subdirs at download.gluster.org for more details on how to obtain them.

Next release

GlusterFS-3.7.16 will be the next release for GlusterFS-3.7, and is currently targetted for release on 30th September 2016. The tracker bug for GlusterFS-3.7.16 has been created. Bugs that need to be included in 3.7.16 need to be marked as dependencies of this bug.

23 Aug 2016

The out-of-order GlusterFS 3.8.3 release addresses a usability regression

On occasion the Gluster projects deems an out-of-order release the best approach to address a problem that got introduced with the last update. The 3.8.3 version is such a release, and we advise all users to upgrade to it, if possible skipping the 3.8.2 release. See the included release notes for more details. We're sorry for any inconvenience caused.

Release notes for Gluster 3.8.3

This is a bugfix release. The Release Notes for 3.8.0, 3.8.1 and 3.8.2 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Out of Order release to address a severe usability regression

Due to a major regression that was not caught and reported by any of the testing that has been performed, this release is done outside of the normal schedule.
The main reason to release 3.8.3 earlier than planned is to fix bug 1366813:
On restarting GlusterD or rebooting a GlusterFS server, only the bricks of the first volume get started. The bricks of the remaining volumes are not started. This is a regression caused by a change in GlusterFS-3.8.2.
This regression breaks automatic start of volumes on rebooting servers, and leaves the volumes inoperable. GlusterFS volumes could be left in an inoperable state after upgrading to 3.8.2, as upgrading involves restarting GlusterD.
Users can forcefully start the remaining volumes, by doing running the gluster volume start <name> force command.

Bugs addressed

A total of 24 patches have been merged, addressing 21 bugs:
  • #1357767: Wrong XML output for Volume Options
  • #1362540: glfs_fini() crashes with SIGSEGV
  • #1364382: RFE:nfs-ganesha:prompt the nfs-ganesha disable cli to let user provide "yes or no" option
  • #1365734: Mem leak in meta_default_readv in meta xlators
  • #1365742: inode leak in brick process
  • #1365756: [SSL] : gluster v set help does not show ssl options
  • #1365821: IO ERROR when multiple graph switches
  • #1365864: gfapi: use const qualifier for glfs_*timens()
  • #1365879: [libgfchangelog]: If changelogs are not available for the requested time range, no proper error message
  • #1366281: glfs_truncate missing
  • #1366440: [AFR]: Files not available in the mount point after converting Distributed volume type to Replicated one.
  • #1366482: SAMBA-DHT : Crash seen while rename operations in cifs mount and windows access of share mount
  • #1366489: "heal info --xml" not showing the brick name of offline bricks.
  • #1366813: Second gluster volume is offline after daemon restart or server reboot
  • #1367272: [HC]: After bringing down and up of the bricks VM's are getting paused
  • #1367297: Error and warning messages related to xlator/features/snapview-client.so adding up to the client log on performing IO operations
  • #1367363: Log EEXIST errors at DEBUG level
  • #1368053: [geo-rep] Stopped geo-rep session gets started automatically once all the master nodes are upgraded
  • #1368423: core: use for makedev(3), major(3), minor(3)
  • #1368738: gfapi-trunc test shouldn't be .t

12 Aug 2016

The GlusterFS 3.8.2 bugfix release is available

Pretty much according to the release schedule, GlusterFS 3.8.2 has been released this week. Packages are available in the standard repositories, and moving from testing-status in different distributions to normal updates.

Release notes for Gluster 3.8.2

This is a bugfix release. The Release Notes for 3.8.0 and 3.8.1 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 54 patches have been merged, addressing 50 bugs:
  • #1339928: Misleading error message on rebalance start when one of the glusterd instance is down
  • #1346133: tiering : Multiple brick processes crashed on tiered volume while taking snapshots
  • #1351878: client ID should logged when SSL connection fails
  • #1352771: [DHT]: Rebalance info for remove brick operation is not showing after glusterd restart
  • #1352926: gluster volume status client" isn't showing any information when one of the nodes in a 3-way Distributed-Replicate volume is shut down
  • #1353814: Bricks are starting when server quorum not met.
  • #1354250: Gluster fuse client crashed generating core dump
  • #1354395: rpc-transport: compiler warning format string
  • #1354405: process glusterd set TCP_USER_TIMEOUT failed
  • #1354429: [Bitrot] Need a way to set scrub interval to a minute, for ease of testing
  • #1354499: service file is executable
  • #1355609: [granular entry sh] - Clean up (stale) directory indices in the event of an rm -rf and also in the normal flow while a brick is down
  • #1355610: Fix timing issue in tests/bugs/glusterd/bug-963541.t
  • #1355639: [Bitrot]: Scrub status- Certain fields continue to show previous run's details, even if the current run is in progress
  • #1356439: Upgrade from 3.7.8 to 3.8.1 doesn't regenerate the volfiles
  • #1357257: observing " Too many levels of symbolic links" after adding bricks and then issuing a replace brick
  • #1357773: [georep]: If a georep session is recreated the existing files which are deleted from slave doesn't get sync again from master
  • #1357834: Gluster/NFS does not accept dashes in hostnames in exports/netgroups files
  • #1357975: [Bitrot+Sharding] Scrub status shows incorrect values for 'files scrubbed' and 'files skipped'
  • #1358262: Trash translator fails to create 'internal_op' directory under already existing trash directory
  • #1358591: Fix spurious failure of tests/bugs/glusterd/bug-1111041.t
  • #1359020: [Bitrot]: Sticky bit files considered and skipped by the scrubber, instead of getting ignored.
  • #1359364: changelog/rpc: Memory leak- rpc_clnt_t object is never freed
  • #1359625: remove hardcoding in get_aux function
  • #1359654: Polling failure errors getting when volume is started&stopped with SSL enabled setup.
  • #1360122: Tiering related core observed with "uuid_is_null () message".
  • #1360138: [Stress/Scale] : I/O errors out from gNFS mount points during high load on an erasure coded volume,Logs flooded with Error messages.
  • #1360174: IO error seen with Rolling or non-disruptive upgrade of an distribute-disperse(EC) volume from 3.7.5 to 3.7.9
  • #1360556: afr coverity fixes
  • #1360573: Fix spurious failures in split-brain-favorite-child-policy.t
  • #1360574: multiple failures of tests/bugs/disperse/bug-1236065.t
  • #1360575: Fix spurious failures in ec.t
  • #1360576: [Disperse volume]: IO hang seen on mount with file ops
  • #1360579: tests: ./tests/bitrot/br-stub.t fails intermittently
  • #1360985: [SNAPSHOT]: The PID for snapd is displayed even after snapd process is killed.
  • #1361449: Direct io to sharded files fails when on zfs backend
  • #1361483: posix: leverage FALLOC_FL_ZERO_RANGE in zerofill fop
  • #1361665: Memory leak observed with upcall polling
  • #1362025: Add output option --xml to man page of gluster
  • #1362065: tests: ./tests/bitrot/bug-1244613.t fails intermittently
  • #1362069: [GSS] Rebalance crashed
  • #1362198: [tiering]: Files of size greater than that of high watermark level should not be promoted
  • #1363598: File not found errors during rpmbuild: /var/lib/glusterd/hooks/1/delete/post/S57glusterfind-delete-post.py{c,o}
  • #1364326: Spurious failure in tests/bugs/glusterd/bug-1089668.t
  • #1364329: Glusterd crashes upon receiving SIGUSR1
  • #1364365: Bricks doesn't come online after reboot [ Brick Full ]
  • #1364497: posix: honour fsync flags in posix_do_zerofill
  • #1365265: Glusterd not operational due to snapshot conflicting with nfs-ganesha export file in "/var/lib/glusterd/snaps"
  • #1365742: inode leak in brick process
  • #1365743: GlusterFS - Memory Leak - High Memory Utilization

2 Aug 2016

GlusterFS 3.7.14

GlusterFS 3.7.14 released

GlusterFS 3.7.14 has been released. This is a regular scheduled release for GlusterFS-3.7 and includes 26 bug fixes since 3.7.13. The release-notes can be read here.

Downloads

The tarball can be downloaded from download.gluster.org.

Packages

Binary packages have been built are in the process of being made available as updates.

The CentOS Storage SIG packages have been built and will become available in the centos-gluster37-test repository (from the centos-release-gluster37 package) shortly. These will be made available in the release repository after some more testing.

Packages for Fedora 23 are queued for testing in Fedora Koji/Bodhi. They will appear first via dnf in the Updates-Testing repo, then in the Updates repo.

Packages for Fedora 24, 25, 26; epel 5, 6, 7; debian wheezy, jessie, and stretch, are available now on download.gluster.org.

Packages for Ubuntu Trusty, Wily, and Xenial are available now in Launchpad.

Packages for SuSE SLES-12, OpenSuSE 13.1, and Leap42.1 are available now in the SuSE build system.

See the READMEs in the respective subdirs at download.gluster.org for more details on how to obtain them.

Next release

GlusterFS-3.7.15 will be the next release for GlusterFS-3.7, and is currently targetted for release on 30th August 2016. The tracker bug for GlusterFS-3.7.15 has been created. Bugs that need to be included in 3.7.15 need to be marked as dependencies of this bug.

19 Jul 2016

First stable update for 3.8 is available, GlusterFS 3.8.1 fixes several bugs

The initial release of Gluster 3.8 was the start of a new Long-Term-Maintenance version with monthly updates. These updates include bugfixes and stability improvements only, making it a version that can safely be installed in production environments. It is planned that the Long-Term-Maintenance versions receive updates for a year. With minor releases going to happen every three months, the upcoming 3.9 version will be a Short-Term-Maintenance with updates until the next version is released three months later.
GlusterFS 3.8.1 has been released a week ago, and in the mean time packages for many distributions have been made available. We recommend all our 3.8.0 users to upgrade to 3.8.1. Environments that run on 3.6.x should consider an upgrade path in the next months, 3.6 will be End-Of-Life when 3.9 is released.

Release notes for Gluster 3.8.1

This is a bugfix release. The Release Notes for 3.8.0 contain a listing of all the new features that were added and bugs fixed in the GlusterFS 3.8 stable release.

Bugs addressed

A total of 35 patches have been sent, addressing 32 bugs:
  • #1345883: [geo-rep]: Worker died with [Errno 2] No such file or directory
  • #1346134: quota : rectify quota-deem-statfs default value in gluster v set help command
  • #1346158: Possible crash due to a timer cancellation race
  • #1346750: Unsafe access to inode->fd_list
  • #1347207: Old documentation link in log during Geo-rep MISCONFIGURATION
  • #1347355: glusterd: SuSE build system error for incorrect strcat, strncat usage
  • #1347489: IO ERROR when multiple graph switches
  • #1347509: Data Tiering:tier volume status shows as in-progress on all nodes of a cluster even if the node is not part of volume
  • #1347524: NFS+attach tier:IOs hang while attach tier is issued
  • #1347529: rm -rf to a dir gives directory not empty(ENOTEMPTY) error
  • #1347553: O_DIRECT support for sharding
  • #1347590: Ganesha+Tiering: Continuous "0-glfs_h_poll_cache_invalidation: invalid argument" messages getting logged in ganesha-gfapi logs.
  • #1348055: cli core dumped while providing/not wrong values during arbiter replica volume
  • #1348060: Worker dies with [Errno 5] Input/output error upon creation of entries at slave
  • #1348086: [geo-rep]: Worker crashed with "KeyError: "
  • #1349274: [geo-rep]: If the data is copied from .snaps directory to the master, it doesn't get sync to slave [First Copy]
  • #1349711: [Granular entry sh] - Implement renaming of indices in index translator
  • #1349879: AFR winds a few reads of a file in metadata split-brain.
  • #1350326: Protocol client not mounting volumes running on older versions.
  • #1350785: Add relative path validation for gluster copy file utility
  • #1350787: gfapi: in case of handle based APIs, close glfd after successful create
  • #1350789: Buffer overflow when attempting to create filesystem using libgfapi as driver on OpenStack
  • #1351025: Implement API to get page aligned iobufs in iobuf.c
  • #1351151: ganesha.enable remains on in volume info file even after we disable nfs-ganesha on the cluster.
  • #1351154: nfs-ganesha disable doesn't delete nfs-ganesha folder from /var/run/gluster/shared_storage
  • #1351711: build: remove absolute paths from glusterfs spec file
  • #1352281: Issues reported by Coverity static analysis tool
  • #1352393: [FEAT] DHT - rebalance - rebalance status o/p should be different for 'fix-layout' option, it should not show 'Rebalanced-files' , 'Size', 'Scanned' etc as it is not migrating any files.
  • #1352632: qemu libgfapi clients hang when doing I/O
  • #1352817: [scale]: Bricks not started after node reboot.
  • #1352880: gluster volume info --xml returns 0 for nonexistent volume
  • #1353426: glusterd: glusterd provides stale port information when a volume is recreated with same brick path

15 Jul 2016

delete, info, config : GlusterFS Snapshots CLI Part 2

Now that we know how to create GlusterFS snapshots, it will be handy to know, how to delete them as well. Right now I have a cluster with two volumes at my disposal. As can be seen below, each volume has 1 brick.
# gluster volume info

Volume Name: test_vol
Type: Distribute
Volume ID: 74e21265-7060-48c5-9f32-faadaf986d85
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: VM1:/brick/brick-dirs1/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Volume Name: test_vol1
Type: Distribute
Volume ID: b6698e0f-748f-4667-8956-ec66dd91bd84
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: VM2:/brick/brick-dirs/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
We are going to take a bunch of snapshots for both these volumes using the create command.
# gluster snapshot create snap1 test_vol no-timestamp
snapshot create: success: Snap snap1 created successfully
# gluster snapshot create snap2 test_vol no-timestamp
snapshot create: success: Snap snap2 created successfully
# gluster snapshot create snap3 test_vol no-timestamp
snapshot create: success: Snap snap3 created successfully
# gluster snapshot create snap4 test_vol1 no-timestamp
snapshot create: success: Snap snap4 created successfully
# gluster snapshot create snap5 test_vol1 no-timestamp
snapshot create: success: Snap snap5 created successfully
# gluster snapshot create snap6 test_vol1 no-timestamp
snapshot create: success: Snap snap6 created successfully
# gluster snapshot list
snap1
snap2
snap3
snap4
snap5
snap6
#
Now we have 3 snapshots for each volume. To delete a snapshot we have to use the delete command along with the snap name.
# gluster snapshot delete snap1
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: snap1: snap removed successfully
# gluster snapshot list
snap2
snap3
snap4
snap5
snap6
#
We can also choose to delete all snapshots that belong to a particular volume. Before doing that let's see what snapshots are present for volume "test_vol". Apart from snapshot list, there is also snapshot info command that provides more elaborate details of snapshots. Like snapshot list, snapshot info can also take volume name as an option to show information of snapshots belonging to only that volume.
# gluster snapshot list test_vol
snap2
snap3
# gluster snapshot info volume test_vol
Volume Name               : test_vol
Snaps Taken               : 2
Snaps Available           : 254
    Snapshot                  : snap2
    Snap UUID                 : d17fbfac-1cb1-4276-9b96-0b73b90fb545
    Created                   : 2016-07-15 09:32:07
    Status                    : Stopped

    Snapshot                  : snap3
    Snap UUID                 : 0f319761-eca2-491e-b678-75b56790f3a0
    Created                   : 2016-07-15 09:32:12
    Status                    : Stopped
 #
As we can see from both list and info command, test_vol  has 2 snapshots snap2, and snap3. Instead of individually deleting these snapshots one by one, we can choose to delete all snapshots that belong to a particular volume, in this case test_vol.
# gluster snapshot delete volume test_vol
Volume (test_vol) contains 2 snapshot(s).
Do you still want to continue and delete them?  (y/n) y
snapshot delete: snap2: snap removed successfully
snapshot delete: snap3: snap removed successfully
#
# gluster snapshot list
snap4
snap5
snap6
# gluster snapshot list test_vol
No snapshots present
# gluster snapshot info volume test_vol
Volume Name               : test_vol
Snaps Taken               : 0
Snaps Available           : 256
#
With the above volume option we successfully deleted both the snapshots of test_vol with a single command. Now only 3 snapshots remain, and both belong to volume "test_vol1". Before proceeding further let's create one more snapshot for volume "test_vol".
# gluster snapshot create snap7 test_vol no-timestamp
snapshot create: success: Snap snap7 created successfully
# gluster snapshot list
snap4
snap5
snap6
snap7
#
With this, we have four snapshots belonging, three of which belong to test_vol1, and one belongs to test_vol. Now with the 'delete all'  command we will be able to delete all snapshots present irrespective of which volumes they belong to.
 # gluster snapshot delete all
System contains 4 snapshot(s).
Do you still want to continue and delete them?  (y/n) y
snapshot delete: snap4: snap removed successfully
snapshot delete: snap5: snap removed successfully
snapshot delete: snap6: snap removed successfully
snapshot delete: snap7: snap removed successfully
# gluster snapshot list
No snapshots present
#
So that is how you delete GlusterFS snapshots. There are some configurable options for Gluster snapshots, which can be viewed and modified using the snapshot config option.
# gluster snapshot config

Snapshot System Configuration:
snap-max-hard-limit : 256
snap-max-soft-limit : 90%
auto-delete : disable
activate-on-create : disable

Snapshot Volume Configuration:

Volume : test_vol
snap-max-hard-limit : 256
Effective snap-max-hard-limit : 256
Effective snap-max-soft-limit : 230 (90%)

Volume : test_vol1
snap-max-hard-limit : 256
Effective snap-max-hard-limit : 256
Effective snap-max-soft-limit : 230 (90%)
#
Just running the config option, as shown above displays the current configuration in the system. What we are looking at are the default configuration values. There are four different configurable parameters. Let's go through them one by one.

  • snap-max-hard-limit: Set by default as 256, the snap-max-hard-limit is the maximum number of snapshots that can be present in the system. Once a volume reaches this limit, in terms of the number of snapshots it has, we are not allowed to create any more snapshot, unless we either delete a snapshot, or increase this limit.
    # gluster snapshot config test_vol snap-max-hard-limit 2
    Changing snapshot-max-hard-limit will limit the creation of new snapshots if they exceed the new limit.
    Do you want to continue? (y/n) y
    snapshot config: snap-max-hard-limit for test_vol set successfully
    # gluster snapshot config

    Snapshot System Configuration:
    snap-max-hard-limit : 256
    snap-max-soft-limit : 90%
    auto-delete : disable
    activate-on-create : disable

    Snapshot Volume Configuration:

    Volume : test_vol
    snap-max-hard-limit : 2
    Effective snap-max-hard-limit : 2
    Effective snap-max-soft-limit : 1 (90%)

    Volume : test_vol1
    snap-max-hard-limit : 256
    Effective snap-max-hard-limit : 256
    Effective snap-max-soft-limit : 230 (90%)
    #
    #
    # gluster snapshot info volume test_vol
    Volume Name               : test_vol
    Snaps Taken               : 0
    Snaps Available           : 2
    #
    As can be seen with the config option, I have modified the snap-max-hard-limit for the volume test_vol to 2. This means after taking 2 snapshots it will not allow me to take any more snapshots, till I either delete one of them, or increase this value. See how the snapshot info for the volume test_vol shows 'Snaps Available' as 2.
    # gluster snapshot create snap1 test_vol no-timestamp
    snapshot create: success: Snap snap1 created successfully
    # gluster snapshot create snap2 test_vol no-timestamp
    snapshot create: success: Snap snap2 created successfully
    Warning: Soft-limit of volume (test_vol) is reached. Snapshot creation is not possible once hard-limit is reached.
    #
    #
    # gluster snapshot info volume test_vol
    Volume Name               : test_vol
    Snaps Taken               : 2
    Snaps Available           : 0
        Snapshot                  : snap1
        Snap UUID                 : 2ee5f237-d4d2-47a6-8a0c-53a887b33b26
        Created                   : 2016-07-15 10:12:55
        Status                    : Stopped

        Snapshot                  : snap2
        Snap UUID                 : 2c74925e-4c75-4824-b39e-7e1e22f3b758
        Created                   : 2016-07-15 10:13:02
        Status                    : Stopped

    #
    # gluster snapshot create snap3 test_vol no-timestamp
    snapshot create: failed: The number of existing snaps has reached the effective maximum limit of 2, for the volume (test_vol). Please delete few snapshots before taking further snapshots.
    Snapshot command failed
    #
    What we have done above is we created 2 snapshots for the volume test_vol and we reached it's snap-max-hard-limit. Notice two things here, first is when we created the second snapshot it gave us a warning that the soft-limit is reached for this volume (we will come to the soft-limit in a while), and second is that the 'Snaps Available' in snapshot info has now become 0. As explained, when we try to take the third snapshot it fails to do so, while explaining that we have reached the maximum limit, and asking us to delete a few snapshots.
    # gluster snapshot delete snap1
    Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
    snapshot delete: snap1: snap removed successfully
    # gluster snapshot create snap3 test_vol no-timestamp
    snapshot create: success: Snap snap3 created successfully
    Warning: Soft-limit of volume (test_vol) is reached. Snapshot creation is not possible once hard-limit is reached.
    #
    # gluster snapshot config test_vol snap-max-hard-limit 3
    Changing snapshot-max-hard-limit will limit the creation of new snapshots if they exceed the new limit.
    Do you want to continue? (y/n) y
    snapshot config: snap-max-hard-limit for test_vol set successfully
    # gluster snapshot info volume test_vol
    Volume Name               : test_vol
    Snaps Taken               : 2
    Snaps Available           : 1
        Snapshot                  : snap2
        Snap UUID                 : 2c74925e-4c75-4824-b39e-7e1e22f3b758
        Created                   : 2016-07-15 10:13:02
        Status                    : Stopped

        Snapshot                  : snap3
        Snap UUID                 : bfd080f3-848e-490a-83ed-066858bd96fc
        Created                   : 2016-07-15 10:19:17
        Status                    : Stopped

    # gluster snapshot create snap4 test_vol no-timestamp
    snapshot create: success: Snap snap4 created successfully
    Warning: Soft-limit of volume (test_vol) is reached. Snapshot creation is not possible once hard-limit is reached.
    #
    As seen above, once we delete a snapshot the system allows us to create another one. It also allows us to do so when we increase the snap-max-hard-limit. I am curious to see what happens when we have hit the snap-max-hard-limit, and I go ahead and further decrease the limit. Does the system delete snapshots to bring the number of snapshots to the set limit.
    # gluster snapshot config test_vol snap-max-hard-limit 1
    Changing snapshot-max-hard-limit will limit the creation of new snapshots if they exceed the new limit.
    Do you want to continue? (y/n) y
    snapshot config: snap-max-hard-limit for test_vol set successfully
    # gluster snapshot config

    Snapshot System Configuration:
    snap-max-hard-limit : 256
    snap-max-soft-limit : 90%
    auto-delete : disable
    activate-on-create : disable

    Snapshot Volume Configuration:

    Volume : test_vol
    snap-max-hard-limit : 1
    Effective snap-max-hard-limit : 1
    Effective snap-max-soft-limit : 0 (90%)

    Volume : test_vol1
    snap-max-hard-limit : 256
    Effective snap-max-hard-limit : 256
    Effective snap-max-soft-limit : 230 (90%)
    # gluster snapshot info volume test_vol
    Volume Name               : test_vol
    Snaps Taken               : 3
    Snaps Available           : 0
        Snapshot                  : snap2
        Snap UUID                 : 2c74925e-4c75-4824-b39e-7e1e22f3b758
        Created                   : 2016-07-15 10:13:02
        Status                    : Stopped

        Snapshot                  : snap3
        Snap UUID                 : bfd080f3-848e-490a-83ed-066858bd96fc
        Created                   : 2016-07-15 10:19:17
        Status                    : Stopped

        Snapshot                  : snap4
        Snap UUID                 : bd9a5297-0eb5-47d1-b250-9b57f4e57427
        Created                   : 2016-07-15 10:20:08
        Status                    : Stopped

    #
    # gluster snapshot create snap5 test_vol no-timestamp
    snapshot create: failed: The number of existing snaps has reached the effective maximum limit of 1, for the volume (test_vol). Please delete few snapshots before taking further snapshots.
    Snapshot command failed
    #
    So the answer to that question is a big NO. We don't explicitly delete snapshots when you decrease the snap-max-hard-limit to a number below the current number of snapshots. The reason for not doing so, is it will become very easy to lose important snapshots. However, what we do is, we do not allow you to create snapshots, till you... (yeah you guessed it right), either delete a snapshot or increase the snap-max-hard-limit.

    snap-max-hard-limit is both a system config and a volume config. What it means is we can set this value for indiviudal volumes, and we can also set a system value.
    # gluster snapshot config snap-max-hard-limit 10
    Changing snapshot-max-hard-limit will limit the creation of new snapshots if they exceed the new limit.
    Do you want to continue? (y/n) y
    snapshot config: snap-max-hard-limit for System set successfully
    # gluster snapshot config

    Snapshot System Configuration:
    snap-max-hard-limit : 10
    snap-max-soft-limit : 90%
    auto-delete : disable
    activate-on-create : disable

    Snapshot Volume Configuration:

    Volume : test_vol
    snap-max-hard-limit : 1
    Effective snap-max-hard-limit : 1
    Effective snap-max-soft-limit : 0 (90%)

    Volume : test_vol1
    snap-max-hard-limit : 256
    Effective snap-max-hard-limit : 10
    Effective snap-max-soft-limit : 9 (90%)
    #
    Notice, how not mentioning a volume name for a snapshot config, sets that particular config for the whole system, instead of a particular volume. The same is clearly visible in the 'Snapshot System Configuration' section of the snapshot config output. Look at this system option as an umbrella limit for the entire cluster. You are allowed to still configure individual volume's snap-max-hard-limit. If the individual volume's limit is lesser than the system's limit, then it will be honored, else the system limit will be honored.

    For example, we can see that the system snap-max-hard-limit is set to 10. Now, in case of the volume test_vol, the snap-max-hard-limit for the volume is set to 1, which is lower than the system's limit and is hence honored, making the effective snap-max-hard-limit as 1. This effective snap-max-hard-limit is the limit that is taken into consideration during snapshot create, and is displayed as 'Snaps Available' in snapshot info. Similarly, for volume test_vol1, the snap-max-hard-limit is 256, which is higher than the system's limit, and is hence not honored, making the effective snap-max-hard-limit of that volume as 10, which is the system's snap-max-hard-limit. Pretty intuitive huh!!!
  • snap-max-soft-limit: This option is set as a percentage (of snap-max-hard-limit), and as we have seen in examples above, on crossing this limit, a warning is shown saying the soft-limit is reached. It serves as a reminder to the user, that he is nearing the hard-limit and should do something about it in order to be able to keep on taking snapshots. By default the snap-max-hard-limit is set to 90%, and can be modified using the snapshot config option.
    # gluster snapshot config test_vol snap-max-soft-limit 50
    Soft limit cannot be set to individual volumes.
    Usage: snapshot config [volname] ([snap-max-hard-limit <count>] [snap-max-soft-limit <percent>]) | ([auto-delete <enable|disable>])| ([activate-on-create <enable|disable>])
    #
    So what do we have here... Yes, the snap-max-soft-limit is a system option only and cannot be set to individual volumes. When the snap-max-soft-limit option is set for the system, it applies on the effective snap-max-hard-limit of individual volumes, to get the effective snap-max-soft-limit of those respective volumes.
    # gluster snapshot config snap-max-soft-limit 50
    If Auto-delete is enabled, snap-max-soft-limit will trigger deletion of oldest snapshot, on the creation of new snapshot, when the snap-max-soft-limit is reached.
    Do you want to change the snap-max-soft-limit? (y/n) y
    snapshot config: snap-max-soft-limit for System set successfully
    # gluster snapshot config

    Snapshot System Configuration:
    snap-max-hard-limit : 10
    snap-max-soft-limit : 50%
    auto-delete : disable
    activate-on-create : disable

    Snapshot Volume Configuration:

    Volume : test_vol
    snap-max-hard-limit : 1
    Effective snap-max-hard-limit : 1
    Effective snap-max-soft-limit : 0 (50%)

    Volume : test_vol1
    snap-max-hard-limit : 256
    Effective snap-max-hard-limit : 10
    Effective snap-max-soft-limit : 5 (50%)
    #
    As we can see above, on setting the option for the system, it applies to the individual volume's (see test_vol1) snap-max-soft-limit, to procure that particular volume's snap-max-soft-limit.

    I am sure the keen-eyed observer in you has noticed, the Auto-delete warning in the output above, and it's just as well because it is our third configurable parameter.
  • auto-delete: This option is tightly tied with snap-max-soft-limit, or rather effective snap-max-soft-limit of individual volumes. It is however a system option and cannot be set for individual volumes. On enabling this option, once we exceed the effective snap-max-soft-limit, of a particular volume, we automatically delete the oldest snapshot for this volume, making sure the total number of snapshots don't increase the effective snap-max-soft-limit, and never reach the effective snap-max-hard-limit, enabling you to keep taking snapshots without hassle.

    NOTE: Extreme Caution Should Be Exercised When Enabling This Option, As It Automatically Deletes The Oldest Snapshot Of A Volume, When The Number Of Snapshots For That Volume Exceeds The Effective snap-max-soft-limit Of That Volume.
    # gluster snapshot config auto-delete enable
    snapshot config: auto-delete successfully set
    # gluster snapshot config

    Snapshot System Configuration:
    snap-max-hard-limit : 10
    snap-max-soft-limit : 50%
    auto-delete : enable
    activate-on-create : disable

    Snapshot Volume Configuration:

    Volume : test_vol
    snap-max-hard-limit : 1
    Effective snap-max-hard-limit : 1
    Effective snap-max-soft-limit : 0 (50%)

    Volume : test_vol1
    snap-max-hard-limit : 256
    Effective snap-max-hard-limit : 10
    Effective snap-max-soft-limit : 5 (50%)
    #
    # gluster snapshot list
    snap2
    snap3
    snap4
    # gluster snapshot delete all
    System contains 3 snapshot(s).
    Do you still want to continue and delete them?  (y/n) y
    snapshot delete: snap2: snap removed successfully
    snapshot delete: snap3: snap removed successfully
    snapshot delete: snap4: snap removed successfully
    # gluster snapshot create snap1 test_vol1 no-timestamp
    snapshot create: success: Snap snap1 created successfully
    # gluster snapshot create snap2 test_vol1 no-timestamp
    snapshot create: success: Snap snap2 created successfully
    # gluster snapshot create snap3 test_vol1 no-timestamp
    snapshot create: success: Snap snap3 created successfully
    # gluster snapshot create snap4 test_vol1 no-timestamp
    snapshot create: success: Snap snap4 created successfully
    # gluster snapshot create snap5 test_vol1 no-timestamp
    snapshot create: success: Snap snap5 created successfully
    In the above example, we first set the auto-delete option in snapshot config,  followed by deleting all the snapshots currently in the system. Then we create 5 snapshots for test_vol1, whose effective snap-max-soft-limit is 5. On creating one more snapshot, we will exceed the limit, and the oldest snapshot will be deleted.
    # gluster snapshot create snap6 test_vol1 no-timestamp
    snapshot create: success: Snap snap6 created successfully
    #
    # gluster snapshot list volume test_vol1
    snap2
    snap3
    snap4
    snap5
    snap6
    #
    As soon as we create snap6, the total number of snapshots become 6, thus exceeding the effective snap-max-soft-limit for test_vol1. The oldest snapshot for test_vol1(which is snap1) is then deleted in the background,  bringing the total number of snapshots to 5.
  • activate-on-create: As we discussed during creation of snapshot, a snapshot on creation is in deactivated state by default, and needs to be activated to be used. On enabling this option in snapshot config, every snapshot created thereafter, will be activated by default. This too is a system option, and cannot be set for individual volumes.
    # gluster snapshot status snap6

    Snap Name : snap6
    Snap UUID : 7fc0a0e7-950d-4c1b-913d-caea6037e633

        Brick Path        :   VM2:/var/run/gluster/snaps/db383315d5a448d6973f71ae3e45573e/brick1/brick
        Volume Group      :   snap_lvgrp
        Brick Running     :   No
        Brick PID         :   N/A
        Data Percentage   :   1.80
        LV Size           :   616.00m

    #
    # gluster snapshot config activate-on-create enable
    snapshot config: activate-on-create successfully set
    # gluster snapshot config

    Snapshot System Configuration:
    snap-max-hard-limit : 10
    snap-max-soft-limit : 50%
    auto-delete : enable
    activate-on-create : enable

    Snapshot Volume Configuration:

    Volume : test_vol
    snap-max-hard-limit : 1
    Effective snap-max-hard-limit : 1
    Effective snap-max-soft-limit : 0 (50%)

    Volume : test_vol1
    snap-max-hard-limit : 256
    Effective snap-max-hard-limit : 10
    Effective snap-max-soft-limit : 5 (50%)
    # gluster snapshot create snap7 test_vol1 no-timestamp
    snapshot create: success: Snap snap7 created successfully
    # gluster snapshot status snap7

    Snap Name : snap7
    Snap UUID : b1864a86-1fa4-4d42-b20a-3d95c2f9e277

        Brick Path        :   VM2:/var/run/gluster/snaps/38b1d9a2f3d24b0eb224f142ae5d33ca/brick1/brick
        Volume Group      :   snap_lvgrp
        Brick Running     :   Yes
        Brick PID         :   6731
        Data Percentage   :   1.80
        LV Size           :   616.00m

    #
    As can be seen when this option was disabled, snap6 wasn't activated by default. After enabling this option, snap7 on creation was in activated state. In the next post we will be discussing snapshot restore and snapshot clone.

12 Jul 2016

Bangalore Docker Meetup #21 at Infosys

We had our last Docker Meetup at Infosys on 9th July’16. Though some 300 RSVPed but only 80 people attended. It is good to see company like Infosys coming forward and hosting such meetup. Thanks to Ganesh from Infosys for taking the lead here and making it happen.
Docker Meetup #21 @InfosysHe started the meetup with introduction about Infosys and then introduced the speakers. After that I talked about my DockerCon experience and briefed everyone about announcements made at DockerCon’16 like Docker 1.12, new Swarm mode, AWS and Azure beta, Window and Mac Beta etc.

Ajeet then talked about Docker 1.12 release, Docker Swarm. He talked about scaling, routing mesh, constraints etc. He then gave demo of Docker Swarm with Docker 1.12.

After that Aditya Patawari talked about Docker Networking. He talked about the kind of network available by default when we start Docker. He gave demo on bridge, host mode and overlay networks.

Continuing on the network topic, Suraj talked about macvlan and ipvlan and gave excellent demos for both of them. With macvlan and ipvlan, container can get IP address from the local LAN.

We then moved to Docker security and as usual Srinivas Makam gave excellent session of Docker Security. I always learn something from him. He talked about namespaces, cgroups, capabilities, seccomp, SELinux, image scanning , Docker Engine secure access, best practices etc.

Lala then talked about atomic scan, which is a container vulnerability detection tool. He also talk SCAP, SCAP bench and gave a demo as well.

Raj Kiran from Infosys then talked about Docker for Developers and gave a demo with Netbeans.

Infosys team has recoded the sessions and hopefully they would share it in public domain soon.  During the meetup we also announced about the Docker 1.12 hackathon  and we are planning to hands-on with people who registered for hackathon on 16th July.

We would be doing the next Docker meetup some time in August 3rd week.

6 Jul 2016

WORM/Retention an experimental feature in GlusterFS v3.8

Introduction:

This feature is about having WORM-based compliance/archiving solution in glusterfs. It mainly focus on the following
  • Compliance: Laws and regulations to access and store intellectual property and confidential information.
  • WORM/Retention : Store data in a tamper-proof and secure way & Data accessibility policies
  • Archive: Storing data in effectively and efficiently & Disaster-Recovery solution

WORM Retention empowers GlusterFS users to safeguard their data in a tamper proof manner. It further enables the users to maintain and track the state of the file transformation with regards to time periods (writable, read-only and un-deletable). Thereafter , nullifying any effort to change contents , location and properties of a static file in brick.


Existing implementation:

The existing feature is implemented at the volume (collection of multiple storage units often referred to as bricks) level, which implies that if the volume option stays enabled ; the files in that volume will only be in read only state thereafter.
New files can be created in that volume but once the file was closed the file would be readonly. This will not even allow users to delete the files which are no longer required or of no use. It was rigid and inconvenient, without providing options or controls to the users of GlusterFS. To avoid this the more flexible and file-level WORM/Retention feature is implemented.


Feature details:

Enhancing the existing WORM translator in Gluster and to introduce File Level WORM/Retention Symantec in Gluster with Autocommit (Automatic WORM/Retention Transition). By introducing File Level WORM/Retention, each file gets its own WORM/Retention properties and by introducing Autocommit feature the user is presented with valuable options/controls to have the optimum settings according to his/her requirements. Thus making Gluster Compliance/ archival solution more relevant to the Archival Market. The life cycle of a WORM/Retained file is shown in the figure. A normal file becomes WORM-Retained either manually or by auto-commit, till the retention period. After the retention period the file will get transmitted to WORM state. A file in WORM state can transmit back to WORM-Retained state if necessary, using the manual transition procedure. The WORM-Held (Legal-Hold) state is currently not implemented in 3.8. It will be a future enhancement to the feature. A WORM file can be deleted, which was not possible with the previous implementation.




In 3.8 File Level WORM/Retention is going as experimental feature.
  1. We will be having file level worm/retention symantec i.e
    1. Each file will have its own WORM/Retention properties
      1. Retention Period
      2. WORM/Retention state
    2. There will be only 2 modes of WORM/Retention that will be supported
      1. Relaxed : Retention period of the file can be increased or decrease(not below the modification time)
      2. Enterprise : Retention period of the file can only be increased and not decreased
    3. Volume Level Retention profile:
      1. Default Retention Period : Time till which a file should be undeletable
      2. Autocommit Period : Time period at/after which the namespace scan has to take place (automatic/lazy auto-commit) to do the state transition
      3. WORM/Retention Mode : Relaxed/Enterprise
    4. Posix commands for WORM/Retention Operations
      1. “touch -a” or “touch -t” command to increase or decrease retention period
      2. “chmod -w” or equivalent command to make a file read-only on demand
    5. WORM/Retention Transition :
      1. Manual using posix command
      2. Automatic transition : Dormant Files will be converted into WORM files based on Auto-commit Period. In 3.8 it would be a lazy mechanism, IO Triggered Using timeouts for untouched files. The next IO will cause the transition.


How to test:

Enabling the feature:
Turn off the features.read-only and features.worm volume options if active. Turn on the features.worm-file-level option. This will enable the file-level WORM feature. Set the features.retention-mode option to manage the retention period of a WORM-Retained file later. Set the features.default-retention-period and features.auto-commit-period options as required. Time periods are specified in seconds.

The FOPs will do the state transition, or necessary actions only on those files which are created when these configurations are set, and volume options are in the same state.

 Screenshot from 2016-06-02 17-05-18.png

Manual transition:
This is done by using the posix command chmod.

chmod -w <filename>
chmod 0444 <filename>
chmod u=r,o=r,g=r <filename>

or any other equivalent command which removes the write bits for all three types of users. The code which checks for this is shown below

       if (stbuf->ia_prot.owner.write == 0 &&
           stbuf->ia_prot.group.write == 0 &&
           stbuf->ia_prot.other.write == 0)
               ret = _gf_true;

If the condition satisfies then it will make the state transition from Normal/WORM state to WORM-Retained state. The access time of the file will be pointing to the time till which the file will be retained. During this time the file will be in the immutable and undeletable state.

Screenshot from 2016-06-02 17-10-29.png

In this figure the access time of the file was 17:09:21 previously. After the state transition the access time points to the time till which the file will be undeletable. In this case it is 17:10:31, i.e., the sum of time when it got state transmitted and the default-retention-period.

Autocommit:
Lazy autocommit way of state transition is implemented in the current version. This will be done when the next IO(link, unlink, rename, or truncate)  is triggered. It will look for the dormant files if the auto-commit-period is expired, i.e., the difference between the current time and the start_time (creation time) of the file is greater than the auto-commit-period. If this condition is satisfied and the file is not accessed upto the auto-commit-period value, then it gets transmitted to WORM-Retained state, and access time points to the time till which the file will retain.

In the below figure, the “rm -f file2” (unlink) command does the state transition since the timeout is happened for file2. It displays Read-only file system error and blocks the FOP. The access time is pointing to the retention time of the file after the transition.

Screenshot from 2016-06-02 17-54-10.png

Updating the retention time:
For a WORM-Retained file, we can change the retention time. The access time will be pointing to the previously set retention time of the file. We can change this by using the “touch -a” or “touch -t” commands. The time will be set or not based on the retention-mode which is set on the file. If the retention mode is “relax” then the command will succeed if the time we have specified is not less than the modification time of the file. If the mode is “enterprise, we can only increase the time that is set.

In the below figure the first “touch -t” fails with Read-only file system error, since the retention mode is “relax” and we are trying to decrease the access time of the file less than the modification time of the file. Second time it succeeds and sets the access time (retention time) to a value higher than the modification time.

Screenshot from 2016-06-02 18-15-05.png


Performing IO on WORM/Retained files:
The link, unlink, rename, write, and truncate FOPs will fail on a WORM retained file. While performing the link, unlink, rename, or truncate FOP, if the file’s retention period is over, it will do the state transition to the WORM state. The access time of the file will then point to a value which is the difference of the access time and the default-retention-period of that file. If the file’s retention time is not updated once it is moved to WORM-Retained state, the access time will point to the actual access time of the file before the state transition. If it has been updated afterwards, then it may not point to the actual access time of the file before state transition.

If the FOP performed after timeout is unlink, then it will do the state transition to WORM state, and will pass the FOP. So the file will no longer be available. So if you want to keep the files for some more time, increase the retention period of the file, before it gets into WORM state.

In the below figure, “file3” gets manually transmitted to WORM-Retained state. The unlink, rename, and write FOPs are blocked since it is in WORM-Retained state. Truncate and link FOPs will also fail for “file3”.

Screenshot from 2016-06-03 01-30-59.png

Performing IO on WORM files:
For a WORM file, the link, rename, truncate and write FOPs should fail. Unlink FOP will pass and delete the file since the retention time of the file will be expired. User can either keep the files or he can delete it, since the files timeout is happend.

If user wants to keep the file he can again move the file to WORM-Retained state, by using the posix command chmod which is used in case of manual transition. This will again put the file under retention policies.

The below figure shows the state transition of file3 from WORM-Retained state to WORm state. Access time of the file after the transition is pointing to the access time before the WORM-Retention transition. The unlink FOP succeeds since the file is no longer retained.

Screenshot from 2016-06-03 01-43-21.png


User improvements:
    1. Users can still use older volume level worm feature
    2. Users can play around with the file level worm feature.


Limitations (Plans for next releases):
    1. No Data validation of Read-only data i.e Integration with bitrot not done.
    2. Internal operations like tiering, rebalancing, self-healing will fail on WORMed files
    3. Since gluster is a user land filesystem, no control on ctime. We need to implement this.
    4. WORM/Retention based tiering.


Owners:
  • Joseph Fernandes <josephaug26@gmail.com>
  • Karthik Subrahmanya <ksubrahm@redhat.com>


Reference:
http://www.gluster.org/community/documentation/index.php/Features/gluster_compliance_archive

De-mystifying gluster shards

Recently I've been working on converging glusterfs with oVirt - hyperconverged, open source style. oVirt has supported glusterfs storage domains for a while, but in the past a virtual disk was stored as a single file on a gluster volume. This helps some workloads, but file distribution and functions like self heal and rebalance have more work to do. The larger the virtual disk, the more work gluster has to do in one go.

Enter sharding

The shard translator was introduced with version 3.7, and enables large files to be split into smaller chunks(shards) of a user defined size. This addresses a number of legacy issues when using glusterfs for virtual machine storage - but does introduce an additional level complexity. For example, how do you now relate a file to it's shard, or vice-versa?

The great thing is that even though a file is split into shards, the implementation still allows you to relate files to shards with a few simple commands.
  
Firstly, let's look at how to relate a file to it's shards;


And now, let's go the other way. We start with a shard, and end with the parent file.


Hopefully this helps others getting to grips with glusterfs sharding (and maybe even oVirt!)

26 Jun 2016

create, help, list, status, activate, deactivate : GlusterFS Snapshots CLI Part 1

After discussing what GlusterFS snapshots are, what are their pre-requisites, and  what goes behind creation of a snapshot, it's time we actually created one, and familiarize ourselves with it.

To begin with let's create a volume called test_vol.
# gluster volume create test_vol replica 3 VM1:/brick/brick-dirs/brick VM2:/brick/brick-dirs/brick VM3:/brick/brick-dirs/brick
volume create: test_vol: success: please start the volume to access data
#
# gluster volume start test_vol
volume start: test_vol: success
#
# gluster volume info test_vol

Volume Name: test_vol
Type: Replicate
Volume ID: 09e773c9-e846-4568-a12d-6efb1cecf8cf
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: VM1:/brick/brick-dirs/brick
Brick2: VM2:/brick/brick-dirs/brick
Brick3: VM3:/brick/brick-dirs/brick
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
#
As you can see, we created a 1x3 replica volume, and started it. We are now primed to take our snapshot of this volume. But before, we do so let's add some data to the volume.
# mount -t glusterfs VM1:/test_vol /mnt/test-vol-mnt/
#
# cd /mnt/test-vol-mnt
#
# ls -lrt
total 0
# touch file1
# ls -lrt
total 0
-rw-r--r-- 1 root root 0 Jun 24 13:39 file1
#
So we have successfully mounted our volume and created(touched) a file called file1. Now we will take a snapshot of 'test_vol' and we will call it 'snap1'.
# gluster snapshot create snap1 test_vol
snapshot create: success: Snap snap1_GMT-2016.06.24-08.12.42 created successfully
#
That's weird isn't it. I asked it to create a snapshot called snap1, and it created a snapshot called snap1_GMT-2016.06.24-08.12.42. What happened is it actually created a snapshot called snap1 and appended the snap's name with the timestamp of it's creation. This is the default naming convention of GlusterFS snapshots, and like everything else it it so for a couple of reasons.
  • This naming format is essential to support Volume Shadow Copy Service Support in GlusterFS volumes.
  • The reason for keeping it as default naming convention is that it is more informative than just a name. Scrolling through a list of snapshots not only gives you the thoughtful name you have chosen for it, but also the time the snapshot was created, which makes it so much more relatable to you, and gives you more clarity to decide what to do with the said snapshot.
But if it still look's icky to you, as it does to a lot of people, you can choose to  not have the timestamp appended by adding the no-timestamp option in the create command.
# gluster snapshot create snap1 test_vol no-timestamp
snapshot create: success: Snap snap1 created successfully
#
So there you go. Congratulation on creating your first GlusterFS snapshot. Now what do you do with it, or rather what all can you do with it. Let's ask for some help.
# gluster snapshot help
snapshot activate <snapname> [force] - Activate snapshot volume.
snapshot clone <clonename> <snapname> - Snapshot Clone.
snapshot config [volname] ([snap-max-hard-limit <count>] [snap-max-soft-limit <percent>]) | ([auto-delete <enable|disable>])| ([activate-on-create <enable|disable>]) - Snapshot Config.
snapshot create <snapname> <volname> [no-timestamp] [description <description>] [force] - Snapshot Create.
snapshot deactivate <snapname> - Deactivate snapshot volume.
snapshot delete (all | snapname | volume <volname>) - Snapshot Delete.
snapshot help - display help for snapshot commands
snapshot info [(snapname | volume <volname>)] - Snapshot Info.
snapshot list [volname] - Snapshot List.
snapshot restore <snapname> - Snapshot Restore.
snapshot status [(snapname | volume <volname>)] - Snapshot Status.
#
Quite the buffet isn't it. So let's first see what snapshots do we have here. gluster snapshot list will do the trick for us.
# gluster snapshot list
snap1_GMT-2016.06.24-08.12.42
snap1
#

# gluster snapshot list test_vol
snap1_GMT-2016.06.24-08.12.42
snap1
#
The list command will display all the snapshots in the trusted pool. Adding a volume's name along with the list command will list all snapshots of that particular volume only. As we have only one volume now, it shows the same result for both. It helps provide more clarity when you have a couple of volumes, and each volume has a number of snapshots.

We have previously discussed that a GlusterFS snapshot is like a GlusterFS volume. Just like a regular volume you can mount it, delete it, and even see it's status. So let's see the status of our snapshots.
# gluster snapshot status

Snap Name : snap1_GMT-2016.06.24-08.12.42
Snap UUID : 26d1455d-1d58-4c39-9efa-822d9397088a

    Brick Path        :   VM1:/var/run/gluster/snaps/f4b2ae1fbf414c8383c3b198dd42e7d7/brick1/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   95.81
    LV Size           :   616.00m


    Brick Path        :   VM2:/var/run/gluster/snaps/f4b2ae1fbf414c8383c3b198dd42e7d7/brick2/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.45
    LV Size           :   616.00m


    Brick Path        :   VM3:/var/run/gluster/snaps/f4b2ae1fbf414c8383c3b198dd42e7d7/brick3/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.43
    LV Size           :   616.00m


Snap Name : snap1
Snap UUID : 73489d9b-c370-4687-8be9-fc094ee78d0a

    Brick Path        :   VM1:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick1/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   95.81
    LV Size           :   616.00m


    Brick Path        :   VM2:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick2/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.45
    LV Size           :   616.00m


    Brick Path        :   VM3:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick3/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.43
    LV Size           :   616.00m
As with the volume status command, the snapshot status command also shows the status of all the snapshot bricks of all snapshots. Adding the snapname in the status command displays the status of only that particular snapshot.
# gluster snapshot status snap1

Snap Name : snap1
Snap UUID : 73489d9b-c370-4687-8be9-fc094ee78d0a

    Brick Path        :   VM1:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick1/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   95.81
    LV Size           :   616.00m


    Brick Path        :   VM2:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick2/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.45
    LV Size           :   616.00m


    Brick Path        :   VM2:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick3/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.43
    LV Size           :   616.00m
Similar to the snapshot list command adding the volname instead of the snapname in the status command displays the status of all snapshots of that particular volume.
The status itself gives us a wealth of information about each snapshot brick like the volume group, the data percentage, the LV Size. It also tells us if the brick is running or not, and if it is what is the PID of the brick. Interestingly we see that none of the bricks are running. This is the default behaviour of GlusterFS snapshots, where a newly created snapshot is in deactivated state(analogous to the Created/Stopped state of a GlusterFS volume), where none of it's bricks are running. In order to start the snap brick process we will have to activate the snapshot.
# gluster snapshot activate snap1
Snapshot activate: snap1: Snap activated successfully
#
# gluster snapshot status snap1

Snap Name : snap1
Snap UUID : 73489d9b-c370-4687-8be9-fc094ee78d0a

    Brick Path        :   VM1:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick1/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   Yes
    Brick PID         :   29250
    Data Percentage   :   95.81
    LV Size           :   616.00m


    Brick Path        :   VM1:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick2/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   Yes
    Brick PID         :   12616
    Data Percentage   :   3.45
    LV Size           :   616.00m


    Brick Path        :   VM1:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick3/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   Yes
    Brick PID         :   3058
    Data Percentage   :   3.43
    LV Size           :   616.00m
After the snapshot is activated, we can see the the bricks are running and their respective PIDs. The snapshot can also be deactivated again by using the deactivate command.
# gluster snapshot deactivate snap1
Deactivating snap will make its data inaccessible. Do you want to continue? (y/n) y
Snapshot deactivate: snap1: Snap deactivated successfully
#
# gluster snapshot status snap1

Snap Name : snap1
Snap UUID : 73489d9b-c370-4687-8be9-fc094ee78d0a

    Brick Path        :   VM1:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick1/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   95.81
    LV Size           :   616.00m


    Brick Path        :   VM2:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick2/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.45
    LV Size           :   616.00m


    Brick Path        :   VM3:/var/run/gluster/snaps/d5171e51e1ef407292ee4e24677385cb/brick3/brick
    Volume Group      :   snap_lvgrp
    Brick Running     :   No
    Brick PID         :   N/A
    Data Percentage   :   3.43
    LV Size           :   616.00m
Uptill now we have barely grazed the surface. There's delete, restore, config, and a whole lot more. We will be covering these in future posts.

24 Jun 2016

Mandatory locks support with GlusterFS v3.8

Latest version 3.8 from GlusterFS community comes out with the support for Mandatory locks. Please refer the blog post announcing the release to get an overview of all new features delivered with 3.8. This article will be a background cum architectural analysis on mandatory locks feature for GlusterFS and its further possibilities when working under […]

17 Jun 2016

Gluster Tiering : CTR/Libgfdb present & future

In the previous post (or re-post) we discussed about the operations of Gluster Tiering feature. In this post we will discuss how we capture the heat(special metadata) of the files/data-objects in Gluster, so that data maintenance activities (which are expensive on CPU, Storage and Network) like tier-migrations(promotion-demotions) can be done intelligently. We will also discuss on the possible improvements that can be done on the existing system so that heat can be captured efficiently and accurately. But before diving into this, lets we will see existing system.

The current architecture of Gluster tiering is as follows. We have a client side translator called tier-xlator. (For more details on Gluster Translator or Xlator click here).  This translator is responsible for multiplexing the IO into HOT or COLD Tier. And an thin application over this translator, we have the tier migration logic which functions as the Gluster Tier migrator, which does promotion and demotions.(NOTE: In Gluster, various data maintenance daemons, like DHT Rebalancer,  AFR Self-heal,  Tiering Migrator are modified clients, which are loaded with the necessary  Xlator Graph dept). The Tiering Migrator, which runs on each file serving node queries each Gluster Local Brick heat store via libgfdb and does promotion and demotion of files appropriately. Each File Serving Process(Gluster Brick) has a new translator called Change Time Recorder(CTR) that intercepts the IO and records the required extra metadata (in our case heat) on to the heat store of the respective brick, via libgfdb.

Screenshot from 2016-03-29 07-36-40

 

Now we are not going to discuss the about the tier xlator or tier migrator in detail in this post. Rather we will discuss about the CTR Xlator, libgfdb and the Heat-Data-store.

When we look at the case of Data tiering, which is a kind of Data Maintenance task, thus  as a toll on CPU, Storage and Network performance, should be done intelligently. Precision on selecting the right candidate for migration (promotion/demotion), makes data tiering more  effective and efficient, as you will not be migrating undeserving candidates (files) to the wrong tier. Also data migration is activity which is done conservatively, i.e either less frequently with large numbers/bytes of files or frequently with small number/bytes of files, for very same reason stated above.

Also while capturing heat of the file we need to take advantage of the distributed nature of the file system(Gluster) so that we can have no single point of failure and proper load balancing. We don’t want the admin or user of the file system to be aware of such a heat-data-store, so that there is no maintenance (disaster recovery, service management etc) overhead.

So taking these factors (and some more) in consideration, the requirements for the heat store would be as follows,

  • Should be well distributed : No single point of failure and has load balancing
  • No separate process/service for the Admin to manage
  • Should store various heat-metadata of the files i.e more versatility on heat configurations
  • Should provide good querying capabilities to have precise query results
  • Should record the heat-metadata quickly without a performance hit on IO.

When we explored for such a store, the one which fitted the bill was SQLITE3, as

  • Its a library and not a database service, less hassles for Admin
  • Its a library so can be embedded with Gluster Bricks, which provides for distribution
  • Has good querying capabilities with SQL
  • Gives precise query results
  • Has decent Write/Record optimization
  • Can store various kind of file-heat-meta.

Even though SQLITE3 looked attractive we wanted flexibility to load and try other data stores, so we introduced a library called libgfdb. The main purpose of libgfdb is to provide a layer of abstraction to the user, so that the library user is agnostic to the data store used below and use standard library API to insert, update and query heat-meta-data of the files from the data store.

Therefore the overall architecture of the solution looks like this. We have a new translator called Change Time Recorder(CTR) that is introduced in brick stack. The sole purpose of the translator is to do inserts/updates to the heat-data-store i.e sqlite3 db via libgfdb. The tier migration daemon that runs on every file serving node queries the lock brick’s heat-data-store via libgfdb and does migrations accordingly.

Screenshot from 2016-06-13 09-43-15

Now lets looks at the write performance optimizations that sqlite3 provides

PRAGMA page_size: Align page size to the OS page size
PRAGMA cache_size: Increased cache size
PRAGMA journal_mode: Change to Write ahead logging.
PRAGMA wal_autocheckpoint : Less often autocheck
PRAGMA synchronous : Set to NONE i.e no synchronous writes
PRAGMA auto_vacuum : Set to NONE (Garbage collection, this needs to be done periodically)

For more information on sqlite3 settings click here.

This is how the insert/update go into the sqlite3 database.

Screenshot from 2016-06-13 09-45-46.png

By looking at the setting you may wonder if we should worrying about crash consistency ? Actually no, because we have introduced a CTR Lookup heal feature which heal the heat-data store while lookups and IO. Thus the heat-data-store will be eventually consistent. And for the tiering case eventual consistency is just fine!

Even though we have optimized our sqlite3,which is our heat-data-store, we see performance issues due to the following

  • Indexing in WAL/DB : Even though the write to the sqlite3 goes in sequential manner, there is consistency check of index/primary keys. So for a large database (even though this database just has a part of the files in the whole Gluster Namspace), the indexing takes a toll on performance.
  • Synchronous recording of heat : The heat of the file is recorded synchronous to the IO. Slowing down the IO performance of bricks, due to latency in heat-data-store. This should be abstracted out.
  • SQL Creates are expensive : Creates in sqlite3 are transactions i.e create needs to be ACID in nature. This takes a toll on the performance. Huge number of creates show better performance when they are batched together in transaction. Thus comes the need for batch processing of creates, which is absent today.
  • Read-Modify-Update Expensive: Due to the WAL (Write Ahead Logging) read performance of sqlite3 is not great. This is fine for the queries which are executed periodically, as the time taken to query is relatively much much lesser than the time taken to migrate files(or Time taken for data maintenance on the query result is relatively huge compared to the query time). But this has a very bad impact when we have Read-Modify-Updates while capturing heat for example number of writes/IOps on a file in a specified period of time. Such updates should be made less often i.e  instead of 10 incremental updates we can consolidate it to one as +10.

All the above rises a need for a caching solution in libgfdb which has the following,

  • Removes the indexing part of the data store from the IO heat recording path
  • Batch process inserts/updates
  • Consolidate incremental updates.
  • Asynchronously record heat

The solution would be IMeTaL i.e In Memory Transaction Logs.

In Memory Transaction Logs, or simply imetal, is an in-memory circular log buffer. The idea is to have the inserts/updates in imetal and when logical partitions of imetal called segments are filled up, batch process the inserts/updates asynchronously/parallel to the incoming inserts/updates to other segments and write to the database. The only lock contention that the IO threads will have is the fine grain lock of where to write in imetal, rather than the whole index of the data-store. In this way we are able to achieve the above requirements. Here is how imetal looks like.

rsz_screenshot_from_2016-06-03_17-13-12

Now lets look the detail design of imetal, by understanding the components of imetal,

  • Log Record : Each record that is inserted in imetal
  • Segment:  Segments is logical partitioning of the log. Each segment will hold a fixed amount of Log Records.
  • Locator: Locator is a consultant that will tell the  IO thread where to insert the Log Record  in the imetal or precisely speaking which location in which segment.
  • Flush Queue : Full or Expired Segments go to the flush queue.
  • Worker Thread Pool: This is a thread pool of worker threads that works on the segments that are places in the flush queue.
  •  Flush Function/Data-Structure: Flush function is the function that is called when a segment, which is nothing but a batch of Log Records needs to be flushed to the database.
  • Flush Data-Structure : Before the flush we need to sort the Log Records for redundant entries and have only one entry per GFID.  For this we need a data structure like balanced binary search tree – red-black tree (insertion and search of O(log(n))), that would help us sort the Log records. Once the Log records are sorted in the data structure without duplication, we traverse the flush data-structure and start flushing the Log records into the database.

Now imetal is not specific to libgfdb, it can be designed/implemented in a generic way, i.e consumers of imetal should be given the liberty to define their own log record, flush function and flush data structure, thus imetal will only providing the infra for batch processing. Also imetal should be implemented at libgfdb, above the different data-store plugins, in this way no matter what the data store, users can benefit from imetal.

Configurations in imetal could be,

  • Size of segment  : How many log records in a segments
  • Size of the imetal : How many segments in imetal
  • Thread pool size : How many worker threads
  • On-Demand Flush  : A mechanism to flush segments even though they are not full.

Potential issues with imetal could be,

  • Loss of inserts/updates : If the imetal is full then the inserts/updates are lost. This may lead to loss of precision. But with the help of CTR lookup heal we heal the heat-data-store.
  • Freshness of the query: Since we are not inserting/updating the datastore directly, we need to flush all segments on demand, before the query is fired.

imetal is not yet implemented and this could be a good place, for someone from the community, to pick this up as a mini project😉

There are few other performance/scale issues that sqlite3 has.

  • Sqlite3 doesn’t scale well : Since sqlite3 stores data in a single file and thus it doesn’t scale well when we have huge number of records. The obvious solution would be shradding of the databases i.e having multiple sqite3 database files for a single brick, and do distribution based on GFID
  • Garbage collection: We have turn auto garbage collection off, to improve performance, but this can backfire when the database grows as there will be a lot of sparse spaces in the databases. Period Garbage collection is needed.

Well to summarize things, tiering is a very tricky feature and it takes time to perfect and fine tune it. With CTR/Libgfdb we have tried to give good amount of tunables and plugins, so that new things can be tried and we get a optimal solution.

Signing off for now!🙂


13 Jun 2016

GlusterFS Snapshots And Their Prerequisites

Long time, no see huh!!! This post has been pending on my part for a while now, partly because I was busy and partly because I am that lazy. But it's a fairly important post as it talks about snapshotting the GlusterFS volumes. So what are these snapshots and why are they so darn important. Let's find out...

Wikipedia says, 'a snapshot is the state of a system at a particular point in time'. In filesystems specifically, a snapshot is a 'backup' (a read only copy of the data set frozen at a point in time). Obviously, it's not a full backup of the entire dataset, but it's a backup nonetheless, which makes it pretty important. Now moving on to GlusterFS snapshots. GlusterFS snapshots, are point-in-time, read-only, crash consistent copies, of GlusterFS volumes. They are also online snapshots, and hence the volume and it's data continue to be available to the clients, while the snapshot is being taken.

GlusterFS snapshots are thinly provisioned LVM based snapshots, and hence they have certain pre-requisites. A quick look at the product documentation tells us what those pre-requisites. For a GlusterFS volume, to be able to support snapshots, it needs to meet the following pre-requisites:
  • Each brick of the GlusterFS volume, should be on an independent, thinly-provisioned LVM.
  • A brick's lvm should not contain any data other than the brick's.
  • None of the bricks should be on a thick LVM
  • gluster version should be 3.6 and above (duh!!)
  • The volume should be started.
  • All brick processes must be up and running.

Now that I have laid out the rules above, let me give you their origin story as well. As in, how do the GlusterFS snapshots internally enable you to take a crash-consistent backup using thinly-provisioned LVM in a space efficient manner. We start by having a look at a GlusterFS volume, whose bricks are on independent, thinly-provisioned LVMs.


In the above diagram, we can see that GlusterFS volume test_vol comprises of two bricks, Brick1 and Brick2. Both the bricks are mounted on independent, thinly-provisioned LVMs. When the volume is mounted, the client process maintains a connection to both the bricks. This is as much summary of GlusterFS volumes, as is needed for this post. A GlusterFS snapshot, is also internally a GlusterFS volume with the exception that, it is a read-only volume and it is treated differently than a regular volume in certain aspects.

When we take a snapshot (say snap1) of the GlusterFS volume test_vol, following set of things happen in the background:
  •  It is checked if the volume is in started state, and if so are all the brick processes up and running.
  • At this point in time, we barrier certain fops, in order to make the snapshots crash-consistent. What it means is even though it is an online snapshot, certain write fops will be barriered for the duration of the snapshot. The fops that are on the fly when the barrier is initiated will be allowed to complete, but the acknowledgement to the client will be pending till the snapshot creation is complete. The barriering has a default time-out window of 2 mins, within which if the snapshot is not complete, the fops are unbarriered, and we fail that particular snapshot.
  • After successfully barriering fops on all brick processes, we proceed to take individual copy-on-write LVM snapshots of each brick. A copy-on-write snapshot LVM snapshot ensures a fast, space-efficient backup of the data currently on the brick. These LVM snapshots reside in the same LVM thinpool as the GLusterFS brick LVMs.
  • Once this snapshot is taken, we carve bricks out of these LVM snapshots, and create a snapshot volume out of those bricks.
  • Once the snapshot creation is complete, we unbarrier the GlusterFS volume.

As can be seen in the above diagram, the snapshot creation process has created a LVM snapshot for each LVM, and these snapshots lie in the same thinpool as the LVM. Then we carve bricks (Brick1" and Brick2") out of these snapshots, and create a snapshot volume called snap1.

This snapshot, snap1 is a read-only snapshot volume which can be:
  • Restored to the original  volume test_vol.
  • Mounted as a read-only volume and accessed.
  • Cloned to create a writeable snapshot.
  • Can be accessed via User-Servicable-Snapshots.
All these functionalities will be discussed in future posts, starting with the command line tools to create, delete and restore GlusterFS snapshots.

6 Jun 2016

Collecting my thoughts about Torus

The other day, CoreOS announced a new distributed storage system called Torus. Not too surprisingly, a lot of people have asked for my opinion about it, so I might as well collect some of my thoughts here.

First off, let me say that I like the CoreOS team and I welcome new projects in this space - especially when they're open source. When I wrote the first C bindings for etcd, it gave me occasion to interact a bit with Brandon Phillips. He seems like an awesome fellow, and as far as I can tell others on that team are good as well. I think it's great that they're turning their attention to storage. I don't want them to go away, or fail. I want to see them succeed, and teach us all something new.

If I seem negative it's not toward or because of the developers. Like many engineers, I have a strong distaste for excessive marketing, and that's what I find objectionable about the announcement. The claims are not only far beyond anything that has actually been achieved, which is fine for a new project, but also far in excess of anything that experience tells us is likely to be achieved within any relevant period of time. Willingness to tackle unknown problems is great, but these are for the most part not unknown problems. The difficulties are quite well known, and represent hard distributed-system problems. If you want to claim that solutions are imminent, it really helps to demonstrate a thorough understanding of those problems. Instead, we're presented with claims that are vague or misleading, claims that illustrate significant gaps in knowledge, and at least one claim that's blatantly false. Quoting from the announcement:

These distributed storage systems were mostly designed for a regime of small clusters of large machines, rather than the GIFEE approach that focuses on large clusters of inexpensive, “small” machines.

It's not true for Gluster. It's not true for Ceph. It's not true for Lustre, OrangeFS, and so on. It's not even true for Sheepdog, which Torus very strongly resembles. None of these systems were designed for small clusters. It's true that some of them might have more trouble than they should scaling up to hundreds of machines, but those are implementation issues and the work that remains to be done is still less than building a whole new system from scratch.

The same paragraph then continues by talking about the specific problems with high-margin proprietary systems, implying that they're the most relevant alternative. They're not. Already, I've seen many people comparing Torus to open-source solutions, and nobody comparing them to proprietary ones. The omission of other open-source projects from their portrayal stands out as deliberate avoidance of hard questions. So does the lack of any explanation of what makes Torus any better than anything else for containers. Being written in Go doesn't make something container-specific. Neither does using etcd. There's nothing in the announcement about any actual container-oriented features, like multi-tenancy or built-in support for efficient overlays. It's just a vanilla block store using basic algorithms, marketed as good for containers. There's nothing wrong with that, in fact it's quite useful, but it's hardly ground-breaking. Anyone who attended my FAST tutorials on this subject during the three years I gave them could have built something similar in the same six months.

The other part of the announcement that bothers me is this.

Torus includes support for consistent hashing, replication, garbage collection, and pool rebalancing through the internal peer-to-peer API. The design includes the ability to support both encryption and efficient Reed-Solomon error correction in the near future, providing greater assurance of data validity and confidentiality throughout the system.

"Includes support" via an API? Does that mean it's already there, or planned, or just hypothetically possible? The first two seem to be there already. I wouldn't be so sure about any sufficiently transparent and non-disruptive form of rebalancing. Encryption and Reed-Solomon are supposedly in the "near future" but I doubt that future is really so near. The implication is that these will be easy to add, but I think the people who have worked on these for Gluster or Ceph or HDFS or Swift would all disagree. Similarly, there's this from Hacker News:

early versions had POSIX access, though it was terribly messy. We know the architecture can support it, it's just a matter of learning from the mistakes and building something worth supporting.

"Just a matter" eh? It was "just a matter" for CephFS to be implemented on top of RADOS too, but it took multiple genius-level people multiple years to get that where it is today. Saying this is "just" anything sets an unrealistic expectation. I'd expect anyone who actually understands the problem domain to warn people that getting from block-storage simplicity to filesystem complexity is a big step. Such a transition might take a while, or not happen at all. Then there's this.

Some good benchmarks to run:

Linear write speed

dd if=/dev/zero of=/mnt/torus/testfile bs=1K count=4000000

Traditional benchmark

bonnie++ -d /mnt/torus/test -s 8G -u core

Single-threaded sequential 1KB writes for a total of 4GB, without even oflag=sync? Bonnie++? Sorry, but these are not "good benchmarks to run" at all. They're garbage. People who know storage would never suggest these. We're all sick of complaints that these are slow, or slower on distributed systems than on local disks, as though that's avoidable somehow. Anybody who would suggest these is not a storage professional, and should not be making any claims about how long it might take to implement filesystem semantics on top of what Torus already has.

So, again, this is not about the project itself but the messaging around it. For the project itself and the engineers working on it: welcome. Best of luck to you. Feel free to ping me if you want to brainstorm or compare notes. BTW, I'll be in San Francisco at the end of this month. For the marketing folks: get real. You're setting your own engineers up for failure, disappointment, and recriminations. I know you want to paint the best picture you can, but that's no excuse for presenting fiction as fact. That's a perfectly good horse you have there. Maybe that horse will even be good enough to win a race or two some day. Stop trying to tell people it's a unicorn just because you have some ideas about how to graft a horn onto its head.

30 May 2016

Making gluster play nicely with others

These days hyperconverged strategies are everywhere. But when you think about it, sharing the finite resources within a physical host requires an effective means of prioritisation and enforcement. Luckily, the Linux kernel already provides an infrastructure for this in the shape of cgroups, and the interface to these controls is now simplified with systemd integration.

So lets look at how you could use these capabilities to make Gluster a better neighbour in a collocated or hyperconverged  model. 

First some common systemd terms, we should to be familiar with;
slice : a slice is a concept that systemd uses to group together resources into a hierarchy. Resource constraints can then be applied to the slice, which defines 
  • how different slices may compete with each other for resources (e.g. weighting)
  • how resources within a slice are controlled (e.g. cpu capping)
unit : a systemd unit is a resource definition for controlling a specific system service
NB. More information about control groups with systemd can be found here

In this article, I'm keeping things simple by implementing a cpu cap on glusterfs processes. Hopefully, the two terms above are big clues, but conceptually it breaks down into two main steps;
  1. define a slice which implements a CPU limit
  2. ensure gluster's systemd unit(s) start within the correct slice.
So let's look at how this is done.

Defining a slice

Slice definitions can be found under /lib/systemd/system, but systemd provides a neat feature where /etc/systemd/system can be used provide local "tweaks". This override directory is where we'll place a slice definition. Create a file called glusterfs.slice, containing;

[Slice]
CPUQuota=200%

CPUQuota is our means of applying a cpu limit on all resources running within the slice. A value of 200% defines a 2 cores/execution threads limit.

Updating glusterd


Next step is to give gluster a nudge so that it shows up in the right slice. If you're using RHEL7 or Centos7, cpu accounting may be off by default (you can check in /etc/systemd/system.conf). This is OK, it just means we have an extra parameter to define. Follow these steps to change the way glusterd is managed by systemd

# cd /etc/systemd/system
# mkdir glusterd.service.d
# echo -e "[Service]\nCPUAccounting=true\nSlice=glusterfs.slice" > glusterd.service.d/override.conf

glusterd is responsible for starting the brick and self heal processes, so by ensuring glusterd starts in our cpu limited slice, we capture all of glusterd's child processes too. Now the potentially bad news...this 'nudge' requires a stop/start of gluster services. If your doing this on a live system you'll need to consider quorum, self heal etc etc. However, with the settings above in place, you can get gluster into the right slice by;

# systemctl daemon-reload
# systemctl stop glusterd
# killall glusterfsd && killall glusterfs
# systemctl daemon-reload
# systemctl start glusterd


You can see where gluster is within the control group hierarchy by looking at it's runtime settings

# systemctl show glusterd | grep slice
Slice=glusterfs.slice
ControlGroup=/glusterfs.slice/glusterd.service
Wants=glusterfs.slice
After=rpcbind.service glusterfs.slice systemd-journald.socket network.target basic.target

or use the systemd-cgls command to see the whole control group hierarchy

├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 19
├─glusterfs.slice
│ └─glusterd.service
│   ├─ 867 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
│   ├─1231 /usr/sbin/glusterfsd -s server-1 --volfile-id repl.server-1.bricks-brick-repl -p /var/lib/glusterd/vols/repl/run/server-1-bricks-brick-repl.pid 

 │   └─1305 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log
├─user.slice
│ └─user-0.slice
│   └─session-1.scope
│     ├─2075 sshd: root@pts/0  
│     ├─2078 -bash
│     ├─2146 systemd-cgls
│     └─2147 less
└─system.slice

At this point gluster is exactly where we want it! 

Time for some more systemd coolness ;) The resource constraints that are applied by the slice are dynamic, so if you need more cpu, you're one command away from getting it;

# systemctl set-property glusterfs.slice CPUQuota=350%

Try the 'systemd-cgtop' command to show the cpu usage across the complete control group hierarchy.

Now if jumping straight into applying resource constraints to gluster is a little daunting, why not test this approach with a tool like 'stress'. Stress is designed to simply consume components of the system - cpu, memory, disk. Here's an example .service file which uses stress to consume 4 cores

[Unit]
Description=CPU soak task

[Service]
Type=simple
CPUAccounting=true
ExecStart=/usr/bin/stress -c 4
Slice=glusterfs.slice

[Install]
WantedBy=multi-user.target

Now you can tweak the service, and the slice with different thresholds before you move on to bigger things! Use stress to avoid stress :)

And now the obligatory warning. Introducing any form of resource constraint may resort in unexpected outcomes especially in hyperconverged/collocated systems - so adequate testing is key.

With that said...happy hacking :)