I’ve wanted to test out NVME over TCP as a datastore from an ESXi host for some time. You may say “But Kenyon the Linux NVME implementation doesn’t support fused commands.” And you’d be right at one point I went through the configuration of an NVME target via the nvmet-cli tools. While this works for Linux hosts the kernel modules don’t support fused commands that ESXi requires for file locking. The ESXi host issues a test and set command as a single command. I’ve read the Linux kernel archives about this and the developers don’t want to add support for a number of reason. Well I received 2 Intel Optane 905 PCIe devices from the vExpert hardware testing program and thought I would give it a shot on my equipment. After some digging I ran into SPDK (https://github.com/spdk/spdk). This is a user space development kit for all kinds of things but it has an NVMEtarget that supports TCP and RDMA with fused command support. Super cool.
Configuring the target it straight forward. From the SPDK docs install goes like this:
git clone https://github.com/spdk/spdk --recursive
cd spdk
sudo scripts/pkgdep.sh --all
./configure --with-rdma
make
This completed with no issues on my Ubuntu VM. Then with some more doc reading I came to this as the configuration steps:
sudo scripts/setup.sh
This will output the nvme devices:
0000:0b:00.0 (8086 2700): nvme -> uio_pci_generic
The PCI ID is what we will use to configure the device later
build/bin/nvmf_tgt -m [1,2,3] &
scripts/rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 0000:0b:00.0 -t pcie
scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1
scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 nvme0n1 -n 2 -u 483099c6-ac37-4bca-bef3-679a5aff2a6c
scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 192.168.10.163 -s 8009
The [1,2,3] on the nvmf_tgt command specify what CPUs to use. I had 4 in my VM so I left 1 for the OS and used the other 3 for nvmf_tgt. The res comes from the documentation except the -u when adding the namespace. If you do not specify a UUID for the name space ESXi will not recognize the namespace. There will be entries in vmkwarning.log complaining about Namespace ID not supported. The listener IP address should be the IP of the VM you are using for the target.
Once this is completed you can follow the ESXi documentation to configure the NVME over TCP adapter and discover the controller.
Some very quick and dirty fio read test show pretty good throughput performance:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=./file -size=5G --bs=16M --iodepth=32 --readwrite=read --time_based --runti me=60 --eta-newline 2
test: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][6.7%][r=2306MiB/s][r=144 IOPS][eta 00m:56s]
Jobs: 1 (f=1): [R(1)][11.7%][r=2162MiB/s][r=135 IOPS][eta 00m:53s]
Jobs: 1 (f=1): [R(1)][18.0%][r=2224MiB/s][r=139 IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][23.3%][r=2208MiB/s][r=138 IOPS][eta 00m:46s]
Jobs: 1 (f=1): [R(1)][26.7%][r=2162MiB/s][r=135 IOPS][eta 00m:44s]
Jobs: 1 (f=1): [R(1)][31.7%][r=2194MiB/s][r=137 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [R(1)][36.7%][r=2130MiB/s][r=133 IOPS][eta 00m:38s]
Jobs: 1 (f=1): [R(1)][41.7%][r=2082MiB/s][r=130 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][46.7%][r=2194MiB/s][r=137 IOPS][eta 00m:32s]
Jobs: 1 (f=1): [R(1)][52.5%][r=2192MiB/s][r=137 IOPS][eta 00m:28s]
Jobs: 1 (f=1): [R(1)][56.7%][r=2240MiB/s][r=140 IOPS][eta 00m:26s]
Jobs: 1 (f=1): [R(1)][62.7%][r=2208MiB/s][r=138 IOPS][eta 00m:22s]
Jobs: 1 (f=1): [R(1)][66.7%][r=2224MiB/s][r=139 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [R(1)][71.7%][r=2144MiB/s][r=134 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [R(1)][76.7%][r=1938MiB/s][r=121 IOPS][eta 00m:14s]
Jobs: 1 (f=1): [R(1)][81.7%][r=2050MiB/s][r=128 IOPS][eta 00m:11s]
Jobs: 1 (f=1): [R(1)][88.1%][r=2144MiB/s][r=134 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [R(1)][91.7%][r=2144MiB/s][r=134 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [R(1)][96.7%][r=2098MiB/s][r=131 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [R(1)][100.0%][r=2194MiB/s][r=137 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1379: Wed Jan 25 15:06:23 2023
read: IOPS=135, BW=2163MiB/s (2268MB/s)(128GiB/60573msec)
bw ( MiB/s): min= 1376, max= 2496, per=99.74%, avg=2157.67, stdev=127.88, samples=121
iops : min= 86, max= 156, avg=134.84, stdev= 7.98, samples=121
cpu : usr=0.22%, sys=7.98%, ctx=17462, majf=0, minf=131081
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=8190,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=2163MiB/s (2268MB/s), 2163MiB/s-2163MiB/s (2268MB/s-2268MB/s), io=128GiB (137GB), run=60573-60573msec
Disk stats (read/write):
dm-0: ios=8820/107, merge=0/0, ticks=1293016/4652, in_queue=1297668, util=89.11%, aggrios=107109/76, aggrmerge=0/31, aggrticks=15069665/2383, aggrin_queue=14857744, aggrutil=99.78%
sda: ios=107109/76, merge=0/31, ticks=15069665/2383, in_queue=14857744, util=99.78%
IOPS performance is not so good. This could be due to any number of things. I need to rebuild this test with some larger VMs and see what numbers come out:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=./file -size=5G --bs=4k --iodepth=128 --readwrite=read --time_based --runtime=60 --eta-newline 2
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][6.7%][r=104MiB/s][r=26.7k IOPS][eta 00m:56s]
Jobs: 1 (f=1): [R(1)][11.7%][r=274MiB/s][r=70.1k IOPS][eta 00m:53s]
Jobs: 1 (f=1): [R(1)][16.7%][r=106MiB/s][r=27.2k IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][22.0%][r=280MiB/s][r=71.7k IOPS][eta 00m:46s]
Jobs: 1 (f=1): [R(1)][26.7%][r=120MiB/s][r=30.7k IOPS][eta 00m:44s]
Jobs: 1 (f=1): [R(1)][31.7%][r=271MiB/s][r=69.4k IOPS][eta 00m:41s]
Jobs: 1 (f=1): [R(1)][36.7%][r=122MiB/s][r=31.2k IOPS][eta 00m:38s]
Jobs: 1 (f=1): [R(1)][41.7%][r=286MiB/s][r=73.1k IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][46.7%][r=128MiB/s][r=32.8k IOPS][eta 00m:32s]
Jobs: 1 (f=1): [R(1)][52.5%][r=288MiB/s][r=73.7k IOPS][eta 00m:28s]
Jobs: 1 (f=1): [R(1)][56.7%][r=136MiB/s][r=34.8k IOPS][eta 00m:26s]
Jobs: 1 (f=1): [R(1)][62.7%][r=247MiB/s][r=63.2k IOPS][eta 00m:22s]
Jobs: 1 (f=1): [R(1)][66.7%][r=144MiB/s][r=36.8k IOPS][eta 00m:20s]
Jobs: 1 (f=1): [R(1)][71.7%][r=292MiB/s][r=74.9k IOPS][eta 00m:17s]
Jobs: 1 (f=1): [R(1)][76.7%][r=144MiB/s][r=36.9k IOPS][eta 00m:14s]
Jobs: 1 (f=1): [R(1)][81.7%][r=268MiB/s][r=68.6k IOPS][eta 00m:11s]
Jobs: 1 (f=1): [R(1)][88.1%][r=133MiB/s][r=33.0k IOPS][eta 00m:07s]
Jobs: 1 (f=1): [R(1)][91.7%][r=205MiB/s][r=52.6k IOPS][eta 00m:05s]
Jobs: 1 (f=1): [R(1)][96.7%][r=131MiB/s][r=33.6k IOPS][eta 00m:02s]
Jobs: 1 (f=1): [R(1)][100.0%][r=130MiB/s][r=33.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1780: Wed Jan 25 15:20:39 2023
read: IOPS=51.4k, BW=201MiB/s (211MB/s)(11.8GiB/60018msec)
bw ( KiB/s): min=46488, max=456656, per=100.00%, avg=205791.50, stdev=123388.27, samples=120
iops : min=11622, max=114164, avg=51447.89, stdev=30847.12, samples=120
cpu : usr=8.91%, sys=44.86%, ctx=49651, majf=0, minf=139
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=3087230,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: bw=201MiB/s (211MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s), io=11.8GiB (12.6GB), run=60018-60018msec
Disk stats (read/write):
dm-0: ios=3085930/117, merge=0/0, ticks=4142716/304, in_queue=4143020, util=99.91%, aggrios=2956147/91, aggrmerge=131083/26, aggrticks=2482465/176, aggrin_queue=1351104, aggrutil=99.88%
sda: ios=2956147/91, merge=131083/26, ticks=2482465/176, in_queue=1351104, util=99.88%
#intel and #vmware