Recently I started to revisit the idea of using zfs and linux (zfsonlinux) as the basis for a server that will eventually be the foundation of our gluster storage infrastructure. Â At this point we are using the Opensolaris version of zfs and an older (but stable) version of gluster (3.0.5).
The problem with staying with Opensolaris (besides the fact that it is no longer being actively supported itself),  is that we would be unable to upgrade gluster….and thus we would be unable to take advantage of some of the new and upcoming features that exist in the later versions (such as geo-replication, snapshots, active-active geo-replication and various other bugfixes, performance enhancements, etc).
Hardware:
Here are the specs for the current hardware I am using to test:
- 2 x Intel Xeon E5410 @ 2.33GHz:CPU
- 32 GB DDR2 DIMMS:RAM
- 48 X 2TB Western Digital SATA II:HARD DRIVES
- 2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
- Ubuntu 11.10
- Glusterfs version 3.2.5
- 1 Gbps interconnects (LAN)
ZFS installation:
I decided to use Ubuntu 11.10 for this round of testing, currently the daliy ppa has a lot of bugfixes and performance improvements that do not exist in the latest stable release ( 0.6.0-rc6) so the daily ppa is the version that should be used until either v0.6.0-rc7 or v0.6.0 final are released.
Here is what you will need to get zfs installed and running:
# apt-add-repository ppa:zfs-native/daily
# apt-get update
# apt-get install debootstrap ubuntu-zfs
At this point we can create our first zpool. Here is the syntax used to create a 6 disk raidz2 vdev:
# zpool create -f tank raidz2 sdc sdd sde sdf sdg sdh
Now let’s check the status of the zpool:
# zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0errors: No known data errors
ZFS Benchmarks:
I ran a few tests to see what kind of performance I could expect out of zfs first, before I added gluster on top, that way I would have better idea about where the bottleneck (if any) existed.
linux 3.3-rc5 kernel untar:
single ext4 disk: 3.277s
zfs 2 disk mirror: 19.338s
zfs 6 disk raidz2: 8.256s
dd using block size of 4096:
single ext4 disk: 204 MB/s
zfs 2 disk mirror: 7.5 MB/s
zfs 6 disk raidz2: 174 MB/s
dd using block size of 1M:
single ext4 disk: 153.0 MB/s
zfs 2 disk mirror: 99.7 MB/s
zfs 6 disk raidz2: 381.2 MB/s
Gluster + ZFS Benchmarks
Next I added gluster (version 3.2.5) to the mix to see how they performed together:
linux 3.3-rc5 kernel untar:
zfs 6 disk raidz2 + gluster (replication): 4m10.093s
zfs 6 disk raidz2 + gluster (geo replication): 1m12.054s
dd using block size of 4096:
zfs 6 disk raidz2 + gluster (replication): 53.6 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 53.7 MB/s
dd using block size of 1M:
zfs 6 disk raidz2 + gluster (replication): 45.7 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 155 MB/s
Conclusion
Well so far so good, I have been running the zfsonlinux port for two weeks now without any real issues. From what I understand there is still a decent amount of work left to do around dedup and compression (neither of which IÂ necessarily require for this particular setup).
The good news is that the zfsonlinux developers have not even really started looking into improving performance at this point, since their main focus thus far has been overall stability.
A good deal of development is also taking place in order to allow linux to boot using a zfs ‘/boot’ partition. Â This is currently an option on several disto’s including Ubuntu and Gentoo, however the setup requires a fair amount of effort to get going, so it will be nice when this style setup is supported out of the box.
In terms of Gluster specifically, it performs quite well using geo-replication with larger file sizes. I am really looking forward to the active-active geo-replication feature currently planned for v3.4 to become fully implemented and available. Our current production setup (currently using two node replication) has a T3 (WAN) interconnect, so having the option to use geo-replication in the future should really speed up our write throughput, which is currently hampered by the throughput of the T3 itself.