NVME over TCP

I’ve wanted to test out NVME over TCP as a datastore from an ESXi host for some time. You may say “But Kenyon the Linux NVME implementation doesn’t support fused commands.” And you’d be right at one point I went through the configuration of an NVME target via the nvmet-cli tools. While this works for Linux hosts the kernel modules don’t support fused commands that ESXi requires for file locking. The ESXi host issues a test and set command as a single command. I’ve read the Linux kernel archives about this and the developers don’t want to add support for a number of reason. Well I received 2 Intel Optane 905 PCIe devices from the vExpert hardware testing program and thought I would give it a shot on my equipment. After some digging I ran into SPDK (https://github.com/spdk/spdk). This is a user space development kit for all kinds of things but it has an NVMEtarget that supports TCP and RDMA with fused command support. Super cool.

Configuring the target it straight forward. From the SPDK docs install goes like this:

git clone https://github.com/spdk/spdk --recursive
cd spdk
sudo scripts/pkgdep.sh --all
./configure --with-rdma
make

This completed with no issues on my Ubuntu VM. Then with some more doc reading I came to this as the configuration steps:

sudo scripts/setup.sh 

This will output the nvme devices:

0000:0b:00.0 (8086 2700): nvme -> uio_pci_generic

The PCI ID is what we will use to configure the device later

build/bin/nvmf_tgt -m [1,2,3] &

scripts/rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 0000:0b:00.0 -t pcie

scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1

scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 nvme0n1 -n 2 -u 483099c6-ac37-4bca-bef3-679a5aff2a6c

scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 192.168.10.163 -s 8009

The [1,2,3] on the nvmf_tgt command specify what CPUs to use. I had 4 in my VM so I left 1 for the OS and used the other 3 for nvmf_tgt. The res comes from the documentation except the -u when adding the namespace. If you do not specify a UUID for the name space ESXi will not recognize the namespace. There will be entries in vmkwarning.log complaining about Namespace ID not supported. The listener IP address should be the IP of the VM you are using for the target.

Once this is completed you can follow the ESXi documentation to configure the NVME over TCP adapter and discover the controller.

Some very quick and dirty fio read test show pretty good throughput performance:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=./file -size=5G --bs=16M --iodepth=32  --readwrite=read --time_based --runti                                                                                                                               me=60 --eta-newline 2
test: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][6.7%][r=2306MiB/s][r=144 IOPS][eta 00m:56s]
Jobs: 1 (f=1): [R(1)][11.7%][r=2162MiB/s][r=135 IOPS][eta 00m:53s]
Jobs: 1 (f=1): [R(1)][18.0%][r=2224MiB/s][r=139 IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][23.3%][r=2208MiB/s][r=138 IOPS][eta 00m:46s]
Jobs: 1 (f=1): [R(1)][26.7%][r=2162MiB/s][r=135 IOPS][eta 00m:44s]
Jobs: 1 (f=1): [R(1)][31.7%][r=2194MiB/s][r=137 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [R(1)][36.7%][r=2130MiB/s][r=133 IOPS][eta 00m:38s]
Jobs: 1 (f=1): [R(1)][41.7%][r=2082MiB/s][r=130 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][46.7%][r=2194MiB/s][r=137 IOPS][eta 00m:32s]
Jobs: 1 (f=1): [R(1)][52.5%][r=2192MiB/s][r=137 IOPS][eta 00m:28s]
Jobs: 1 (f=1): [R(1)][56.7%][r=2240MiB/s][r=140 IOPS][eta 00m:26s]
Jobs: 1 (f=1): [R(1)][62.7%][r=2208MiB/s][r=138 IOPS][eta 00m:22s]
Jobs: 1 (f=1): [R(1)][66.7%][r=2224MiB/s][r=139 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [R(1)][71.7%][r=2144MiB/s][r=134 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [R(1)][76.7%][r=1938MiB/s][r=121 IOPS][eta 00m:14s]
Jobs: 1 (f=1): [R(1)][81.7%][r=2050MiB/s][r=128 IOPS][eta 00m:11s]
Jobs: 1 (f=1): [R(1)][88.1%][r=2144MiB/s][r=134 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [R(1)][91.7%][r=2144MiB/s][r=134 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [R(1)][96.7%][r=2098MiB/s][r=131 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [R(1)][100.0%][r=2194MiB/s][r=137 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1379: Wed Jan 25 15:06:23 2023
  read: IOPS=135, BW=2163MiB/s (2268MB/s)(128GiB/60573msec)
   bw (  MiB/s): min= 1376, max= 2496, per=99.74%, avg=2157.67, stdev=127.88, samples=121
   iops        : min=   86, max=  156, avg=134.84, stdev= 7.98, samples=121
  cpu          : usr=0.22%, sys=7.98%, ctx=17462, majf=0, minf=131081
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=8190,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2163MiB/s (2268MB/s), 2163MiB/s-2163MiB/s (2268MB/s-2268MB/s), io=128GiB (137GB), run=60573-60573msec

Disk stats (read/write):
    dm-0: ios=8820/107, merge=0/0, ticks=1293016/4652, in_queue=1297668, util=89.11%, aggrios=107109/76, aggrmerge=0/31, aggrticks=15069665/2383, aggrin_queue=14857744, aggrutil=99.78%
  sda: ios=107109/76, merge=0/31, ticks=15069665/2383, in_queue=14857744, util=99.78%

IOPS performance is not so good. This could be due to any number of things. I need to rebuild this test with some larger VMs and see what numbers come out:

 fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=./file -size=5G --bs=4k --iodepth=128  --readwrite=read --time_based --runtime=60 --eta-newline 2
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][6.7%][r=104MiB/s][r=26.7k IOPS][eta 00m:56s]
Jobs: 1 (f=1): [R(1)][11.7%][r=274MiB/s][r=70.1k IOPS][eta 00m:53s]
Jobs: 1 (f=1): [R(1)][16.7%][r=106MiB/s][r=27.2k IOPS][eta 00m:50s]
Jobs: 1 (f=1): [R(1)][22.0%][r=280MiB/s][r=71.7k IOPS][eta 00m:46s]
Jobs: 1 (f=1): [R(1)][26.7%][r=120MiB/s][r=30.7k IOPS][eta 00m:44s]
Jobs: 1 (f=1): [R(1)][31.7%][r=271MiB/s][r=69.4k IOPS][eta 00m:41s]
Jobs: 1 (f=1): [R(1)][36.7%][r=122MiB/s][r=31.2k IOPS][eta 00m:38s]
Jobs: 1 (f=1): [R(1)][41.7%][r=286MiB/s][r=73.1k IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][46.7%][r=128MiB/s][r=32.8k IOPS][eta 00m:32s]
Jobs: 1 (f=1): [R(1)][52.5%][r=288MiB/s][r=73.7k IOPS][eta 00m:28s]
Jobs: 1 (f=1): [R(1)][56.7%][r=136MiB/s][r=34.8k IOPS][eta 00m:26s]
Jobs: 1 (f=1): [R(1)][62.7%][r=247MiB/s][r=63.2k IOPS][eta 00m:22s]
Jobs: 1 (f=1): [R(1)][66.7%][r=144MiB/s][r=36.8k IOPS][eta 00m:20s]
Jobs: 1 (f=1): [R(1)][71.7%][r=292MiB/s][r=74.9k IOPS][eta 00m:17s]
Jobs: 1 (f=1): [R(1)][76.7%][r=144MiB/s][r=36.9k IOPS][eta 00m:14s]
Jobs: 1 (f=1): [R(1)][81.7%][r=268MiB/s][r=68.6k IOPS][eta 00m:11s]
Jobs: 1 (f=1): [R(1)][88.1%][r=133MiB/s][r=33.0k IOPS][eta 00m:07s]
Jobs: 1 (f=1): [R(1)][91.7%][r=205MiB/s][r=52.6k IOPS][eta 00m:05s]
Jobs: 1 (f=1): [R(1)][96.7%][r=131MiB/s][r=33.6k IOPS][eta 00m:02s]
Jobs: 1 (f=1): [R(1)][100.0%][r=130MiB/s][r=33.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1780: Wed Jan 25 15:20:39 2023
  read: IOPS=51.4k, BW=201MiB/s (211MB/s)(11.8GiB/60018msec)
   bw (  KiB/s): min=46488, max=456656, per=100.00%, avg=205791.50, stdev=123388.27, samples=120
   iops        : min=11622, max=114164, avg=51447.89, stdev=30847.12, samples=120
  cpu          : usr=8.91%, sys=44.86%, ctx=49651, majf=0, minf=139
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=3087230,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=201MiB/s (211MB/s), 201MiB/s-201MiB/s (211MB/s-211MB/s), io=11.8GiB (12.6GB), run=60018-60018msec

Disk stats (read/write):
    dm-0: ios=3085930/117, merge=0/0, ticks=4142716/304, in_queue=4143020, util=99.91%, aggrios=2956147/91, aggrmerge=131083/26, aggrticks=2482465/176, aggrin_queue=1351104, aggrutil=99.88%
  sda: ios=2956147/91, merge=131083/26, ticks=2482465/176, in_queue=1351104, util=99.88%

#intel and #vmware

2 thoughts on “NVME over TCP

  1. Hey Kenyon.
    I found this article after spending some time implementing an alternative configuration utility for the kernel nvme target because I didn’t like the state of `nvmet-cli`.

    One of my goals was to provide storage to ESXi down the line, but seeing your comments regarding it makes me upset that I may have wasted my time.

    Can you link me something that details ESXi’s fused command (for ATS?) requirement?
    Is it only for shared storage or storage in general? Or plainly put, is it entirely unusable?

    I’d love to see the mailing list threads as well – I have a hard time finding anything.
    The current code does state it’s not supported, but obviously no detailed reasoning.

    On the mailing list, there are patches submitted September 2023 that try to implement the atomic writes in the block layer, so there is hope! 🙂

    Kind regards,
    vifino.

    Like

    1. You can see the requirement here: https://kb.vmware.com/s/article/91135. This is for both shared and local devices. Here is on such mailing list thread: https://lore.kernel.org/all/263971f2-dc9e-fc53-06e9-9c3c80ddb8e3@grimberg.me/ here is another thread on github: https://github.com/linux-nvme/nvme-cli/issues/318.

      To put it plainly it was entirely unusable at the time that I tried this. The method outlined in the post does work. However I didn’t take the time to research how to make the storage redundant (some sort of RAID or something).

      Like

Leave a comment