NVME over TCP

I’ve wanted to test out NVME over TCP as a datastore from an ESXi host for some time. You may say “But Kenyon the Linux NVME implementation doesn’t support fused commands.” And you’d be right at one point I went through the configuration of an NVME target via the nvmet-cli tools. While this works for Linux hosts the kernel modules don’t support fused commands that ESXi requires for file locking. The ESXi host issues a test and set command as a single command. I’ve read the Linux kernel archives about this and the developers don’t want to add support for a number of reason. Well I received 2 Intel Optane 905 PCIe devices from the vExpert hardware testing program and thought I would give it a shot on my equipment. After some digging I ran into SPDK (https://github.com/spdk/spdk). This is a user space development kit for all kinds of things but it has an NVMEtarget that supports TCP and RDMA with fused command support. Super cool.

Configuring the target it straight forward. From the SPDK docs install goes like this:

git clone https://github.com/spdk/spdk --recursive
cd spdk
sudo scripts/pkgdep.sh --all
./configure --with-rdma
make

This completed with no issues on my Ubuntu VM. Then with some more doc reading I came to this as the configuration steps:

sudo scripts/setup.sh 

This will output the nvme devices:

0000:0b:00.0 (8086 2700): nvme -> uio_pci_generic

The PCI ID is what we will use to configure the device later

build/bin/nvmf_tgt -m [1,2,3] &

scripts/rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 0000:0b:00.0 -t pcie

scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1

scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 nvme0n1 -n 2 -u 483099c6-ac37-4bca-bef3-679a5aff2a6c

scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 192.168.10.163 -s 8009

The [1,2,3] on the nvmf_tgt command specify what CPUs to use. I had 4 in my VM so I left 1 for the OS and used the other 3 for nvmf_tgt. The res comes from the documentation except the -u when adding the namespace. If you do not specify a UUID for the name space ESXi will not recognize the namespace. There will be entries in vmkwarning.log complaining about Namespace ID not supported. The listener IP address should be the IP of the VM you are using for the target.

Once this is completed you can follow the ESXi documentation to configure the NVME over TCP adapter and discover the controller.

Some very quick and dirty fio read test show pretty good throughput performance:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=./file -size=5G --bs=16M --iodepth=32  --readwrite=read --time_based --runti                                                                                                                               me=60 --eta-newline 2
test: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][6.7%][r=2306MiB/s][r=144 IOPS][eta 00m:56s]
Jobs: 1 (f=1): [R(1)][11.7%][r=2162MiB/s][r=135 IOPS][eta 00m:53s]
Jobs: 1 (f=1): [R(1)][18.0%][r=2224MiB/s][r=139 IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][23.3%][r=2208MiB/s][r=138 IOPS][eta 00m:46s]
Jobs: 1 (f=1): [R(1)][26.7%][r=2162MiB/s][r=135 IOPS][eta 00m:44s]
Jobs: 1 (f=1): [R(1)][31.7%][r=2194MiB/s][r=137 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [R(1)][36.7%][r=2130MiB/s][r=133 IOPS][eta 00m:38s]
Jobs: 1 (f=1): [R(1)][41.7%][r=2082MiB/s][r=130 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][46.7%][r=2194MiB/s][r=137 IOPS][eta 00m:32s]
Jobs: 1 (f=1): [R(1)][52.5%][r=2192MiB/s][r=137 IOPS][eta 00m:28s]
Jobs: 1 (f=1): [R(1)][56.7%][r=2240MiB/s][r=140 IOPS][eta 00m:26s]
Jobs: 1 (f=1): [R(1)][62.7%][r=2208MiB/s][r=138 IOPS][eta 00m:22s]
Jobs: 1 (f=1): [R(1)][66.7%][r=2224MiB/s][r=139 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [R(1)][71.7%][r=2144MiB/s][r=134 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [R(1)][76.7%][r=1938MiB/s][r=121 IOPS][eta 00m:14s]
Jobs: 1 (f=1): [R(1)][81.7%][r=2050MiB/s][r=128 IOPS][eta 00m:11s]
Jobs: 1 (f=1): [R(1)][88.1%][r=2144MiB/s][r=134 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [R(1)][91.7%][r=2144MiB/s][r=134 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [R(1)][96.7%][r=2098MiB/s][r=131 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [R(1)][100.0%][r=2194MiB/s][r=137 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1379: Wed Jan 25 15:06:23 2023
  read: IOPS=135, BW=2163MiB/s (2268MB/s)(128GiB/60573msec)
   bw (  MiB/s): min= 1376, max= 2496, per=99.74%, avg=2157.67, stdev=127.88, samples=121
   iops        : min=   86, max=  156, avg=134.84, stdev= 7.98, samples=121
  cpu          : usr=0.22%, sys=7.98%, ctx=17462, majf=0, minf=131081
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=8190,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2163MiB/s (2268MB/s), 2163MiB/s-2163MiB/s (2268MB/s-2268MB/s), io=128GiB (137GB), run=60573-60573msec

Disk stats (read/write):
    dm-0: ios=8820/107, merge=0/0, ticks=1293016/4652, in_queue=1297668, util=89.11%, aggrios=107109/76, aggrmerge=0/31, aggrticks=15069665/2383, aggrin_queue=14857744, aggrutil=99.78%
  sda: ios=107109/76, merge=0/31, ticks=15069665/2383, in_queue=14857744, util=99.78%

IOPS performance is not so good. This could be due to any number of things. I need to rebuild this test with some larger VMs and see what numbers come out:

 fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=./file -size=5G --bs=4k --iodepth=128  --readwrite=read --time_based --runtime=60 --eta-newline 2
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][6.7%][r=104MiB/s][r=26.7k IOPS][eta 00m:56s]
Jobs: 1 (f=1): [R(1)][11.7%][r=274MiB/s][r=70.1k IOPS][eta 00m:53s]
Jobs: 1 (f=1): [R(1)][16.7%][r=106MiB/s][r=27.2k IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][22.0%][r=280MiB/s][r=71.7k IOPS][eta 00m:46s]
Jobs: 1 (f=1): [R(1)][26.7%][r=120MiB/s][r=30.7k IOPS][eta 00m:44s]
Jobs: 1 (f=1): [R(1)][31.7%][r=271MiB/s][r=69.4k IOPS][eta 00m:41s]
Jobs: 1 (f=1): [R(1)][36.7%][r=122MiB/s][r=31.2k IOPS][eta 00m:38s]
Jobs: 1 (f=1): [R(1)][41.7%][r=286MiB/s][r=73.1k IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][46.7%][r=128MiB/s][r=32.8k IOPS][eta 00m:32s]
Jobs: 1 (f=1): [R(1)][52.5%][r=288MiB/s][r=73.7k IOPS][eta 00m:28s]
Jobs: 1 (f=1): [R(1)][56.7%][r=136MiB/s][r=34.8k IOPS][eta 00m:26s]
Jobs: 1 (f=1): [R(1)][62.7%][r=247MiB/s][r=63.2k IOPS][eta 00m:22s]
Jobs: 1 (f=1): [R(1)][66.7%][r=144MiB/s][r=36.8k IOPS][eta 00m:20s]
Jobs: 1 (f=1): [R(1)][71.7%][r=292MiB/s][r=74.9k IOPS][eta 00m:17s]
Jobs: 1 (f=1): [R(1)][76.7%][r=144MiB/s][r=36.9k IOPS][eta 00m:14s]
Jobs: 1 (f=1): [R(1)][81.7%][r=268MiB/s][r=68.6k IOPS][eta 00m:11s]
Jobs: 1 (f=1): [R(1)][88.1%][r=133MiB/s][r=33.0k IOPS][eta 00m:07s]
Jobs: 1 (f=1): [R(1)][91.7%][r=205MiB/s][r=52.6k IOPS][eta 00m:05s]
Jobs: 1 (f=1): [R(1)][96.7%][r=131MiB/s][r=33.6k IOPS][eta 00m:02s]
Jobs: 1 (f=1): [R(1)][100.0%][r=130MiB/s][r=33.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1780: Wed Jan 25 15:20:39 2023
  read: IOPS=51.4k, BW=201MiB/s (211MB/s)(11.8GiB/60018msec)
   bw (  KiB/s): min=46488, max=456656, per=100.00%, avg=205791.50, stdev=123388.27, samples=120
   iops        : min=11622, max=114164, avg=51447.89, stdev=30847.12, samples=120
  cpu          : usr=8.91%, sys=44.86%, ctx=49651, majf=0, minf=139
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=3087230,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=201MiB/s (211MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s), io=11.8GiB (12.6GB), run=60018-60018msec

Disk stats (read/write):
    dm-0: ios=3085930/117, merge=0/0, ticks=4142716/304, in_queue=4143020, util=99.91%, aggrios=2956147/91, aggrmerge=131083/26, aggrticks=2482465/176, aggrin_queue=1351104, aggrutil=99.88%
  sda: ios=2956147/91, merge=131083/26, ticks=2482465/176, in_queue=1351104, util=99.88%

#intel and #vmware

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: