- Dec 08, 2021
-
-
Sagi Grimberg authored
nvmet_tcp_handle_req_failure needs to understand weather to prepare for incoming data or the next pdu. However if we misidentify this, we will wait for 0-length data, and queue the response although nvmet_req_init already did that. The particular command was namespace management command with no data, which was incorrectly categorized as a command with incapsule data. Also, add a code comment of what we are trying to do here. Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Dec 07, 2021
-
-
Ruozhu Li authored
A crash happens when trying to disconnect a reconnecting ctrl: 1) The network was cut off when the connection was just established, scan work hang there waiting for some IOs complete. Those I/Os were retried because we return BLK_STS_RESOURCE to blk in reconnecting. 2) After a while, I tried to disconnect this connection. This procedure also hangs because it tried to obtain ctrl->scan_lock. It should be noted that now we have switched the controller state to NVME_CTRL_DELETING. 3) In nvme_check_ready(), we always return true when ctrl->state is NVME_CTRL_DELETING, so those retrying I/Os were issued to the bottom device which was already freed. To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom device only when queue state is live. If not, return host path error to the block layer Signed-off-by:
Ruozhu Li <liruozhu@huawei.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hou Tao authored
Set ana_log_size to 0 when ana_log_buf is freed to make sure nvme_mpath_init_identify will do the right thing when retrying after an earlier failure. Signed-off-by:
Hou Tao <houtao1@huawei.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Dec 06, 2021
-
-
Niklas Cassel authored
The write pointer in NVMe ZNS is invalid for a zone in zone state full. The same also holds true for ZAC/ZBC. The current behavior for NVMe is to simply propagate the wp reported by the drive, even for full zones. Since the wp is invalid for a full zone, the wp reported by the drive may be any value. The way that the sd_zbc driver handles a full zone is to always report the wp as zone start + zone len, regardless of what the drive reported. null_blk also follows this convention. Do the same for NVMe, so that a BLKREPORTZONE ioctl reports the write pointer for a full zone in a consistent way, regardless of the interface of the underlying zoned block device. blkzone report before patch: start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0xfffffffffffbfff8 reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)] blkzone report after patch: start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0x040000 reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)] Signed-off-by:
Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
The only fabrics target that supports metadata handling through the separate integrity buffer is RDMA. It is currently usable only if the size is 8B per block and formatted for protection information. If an rdma target were to export a namespace with a different format (ex: 4k+64B), the driver will not be able to submit valid read/write commands for that namespace. Suppress setting the metadata feature in the namespace so that the gendisk capacity will be set to 0. This will prevent read/write access through the block stack, but will continue to allow ioctl passthrough commands. Cc: Max Gurtovoy <mgurtovoy@nvidia.com> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
The driver assigned nvme handle isn't persistent across reboots, so is not enough information to match up where the collisions are occuring. Add the subsys nqn string to the output so that it can more easily be identified later. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215099 Signed-off-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Nov 25, 2021
-
-
Maurizio Lombardi authored
Submit I/O requests with the IOCB_NOWAIT flag set only if the underlying filesystem supports it. Fixes: 50a909db ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O") Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Nov 23, 2021
-
-
Klaus Jensen authored
Write Zeroes sets PRACT when block integrity is enabled (as it should), but neglects to also set the reftag which is expected by reads. This causes protection errors on reads. Fix this by setting the reftag for type 1 and 2 (for type 3, reads will not check the reftag). Signed-off-by:
Klaus Jensen <k.jensen@samsung.com> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Maurizio Lombardi authored
Valid fast_io_fail_tmo values are integers >= 0 or -1 (disabled). Prevent userspace from setting arbitrary negative values. Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Enzo Matsumiya authored
This particular Kioxia device times out and aborts I/O during any load, but it's more easily observable with discards (fstrim). The device gets to a state that is also not possible to use "nvme set-feature" to disable APST. Booting with nvme_core.default_ps_max_latency=0 solves the issue. We had a dozen or so of these devices behaving this same way in customer environments. Signed-off-by:
Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Maurizio Lombardi authored
Release the page frag cache when tearing down the io queues Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
John Meneghini <jmeneghi@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Varun Prakash authored
If maxh2cdata < r2t_length then driver will form multiple H2CData PDUs, validate R2T PDU in nvme_tcp_handle_r2t() to reuse nvme_tcp_setup_h2c_data_pdu(). Also set req->state to NVME_TCP_SEND_H2C_PDU in nvme_tcp_setup_h2c_data_pdu(). Signed-off-by:
Varun Prakash <varun@chelsio.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Varun Prakash authored
Current nvmet_try_send_ddgst() code does not check whether all data digest bytes are transmitted, fix this by returning -EAGAIN if all data digest bytes are not transmitted. Fixes: 872d26a3 ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by:
Varun Prakash <varun@chelsio.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Maurizio Lombardi authored
If a reset controller is executed while the initiator is performing some I/O the driver may leak the memory allocated for the commands' iovec. Make sure that nvmet_tcp_uninit_data_in_cmds() releases all the memory. Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
John Meneghini <jmeneghi@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Maurizio Lombardi authored
Makes the code easier to read and to debug. Sets the freed pointers to NULL, it will be useful when destroying the queues to understand if the commands' buffers have been released already or not. Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
John Meneghini <jmeneghi@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Maurizio Lombardi authored
If the initiator executes a reset controller operation while performing I/O, the target kernel will crash because of a race condition between release_queue and io_work; nvmet_tcp_uninit_data_in_cmds() may be executed while io_work is running, calling flush_work() was not sufficient to prevent this because io_work could requeue itself. Fix this bug by using cancel_work_sync() to prevent io_work from requeuing itself and set rcv_state to NVMET_TCP_RECV_ERR to make sure we don't receive any more data from the socket. Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
John Meneghini <jmeneghi@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Nov 09, 2021
-
-
Ming Lei authored
NVMe uses one atomic flag to check if quiesce is needed. If quiesce is started, the helper returns immediately. This way is wrong, since we have to wait until quiesce is done. Fixes: e70feb8b ("blk-mq: support concurrent queue quiesce/unquiesce") Reviewed-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20211109071144.181581-5-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Oct 27, 2021
-
-
Amit Engel authored
Pass the correct length to nvmet_tcp_verify_hdgst, which is the pdu header length. This fixes a wrong behaviour where header digest verification passes although the digest is wrong. Signed-off-by:
Amit Engel <amit.engel@dell.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Len Baker authored
In an effort to avoid open-coded arithmetic in the kernel [1], use the flex_array_size() and struct_size() helpers instead of an open-coded calculation. [1] https://github.com/KSPP/linux/issues/160 Signed-off-by:
Len Baker <len.baker@gmx.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
Register the discovery subsystem as the 'current' discovery subsystem, and add a new discovery log page entry for it. Signed-off-by:
Hannes Reinecke <hare@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
Invert the check for discovery subsystem type to allow for additional discovery subsystem types. Signed-off-by:
Hannes Reinecke <hare@suse.de> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Varun Prakash authored
exp_ddgst is of type __le32, &cmd->exp_ddgst + cmd->offset increases &cmd->exp_ddgst by 4 * cmd->offset, fix this by type casting &cmd->exp_ddgst to u8 *. Fixes: 872d26a3 ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by:
Varun Prakash <varun@chelsio.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Varun Prakash authored
ddgst is of type __le32, &req->ddgst + req->offset increases &req->ddgst by 4 * req->offset, fix this by type casting &req->ddgst to u8 *. Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by:
Varun Prakash <varun@chelsio.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Varun Prakash authored
With commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq context") r2t and response PDU can get processed while send function is executing. Current data digest send code uses req->offset after kernel_sendmsg(), this creates a race condition where req->offset gets reset before it is used in send function. This can happen in two cases - 1. Target sends r2t PDU which resets req->offset. 2. Target send response PDU which completes the req and then req is used for a new command, nvme_tcp_setup_cmd_pdu() resets req->offset. Fix this by storing req->offset in a local variable and using this local variable after kernel_sendmsg(). Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context") Signed-off-by:
Varun Prakash <varun@chelsio.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Oct 26, 2021
-
-
Sagi Grimberg authored
We should not access request members after the last send, even to determine if indeed it was the last data payload send. The reason is that a completion could have arrived and trigger a new execution of the request which overridden these members. This was fixed by commit 825619b0 ("nvme-tcp: fix possible use-after-completion"). Commit e371af03 broke that assumption again to address cases where multiple r2t pdus are sent per request. To fix it, we need to record the request data_sent and data_len and after the payload network send we reference these counters to determine weather we should advance the request iterator. Fixes: e371af03 ("nvme-tcp: fix incorrect h2cdata pdu offset accounting") Reported-by:
Keith Busch <kbusch@kernel.org> Cc: stable@vger.kernel.org # 5.10+ Signed-off-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Maurizio Lombardi authored
page_frag_free() won't completely release the memory allocated for the commands, the cache page must be explicitly freed by calling __page_frag_cache_drain(). This bug can be easily reproduced by repeatedly executing the following command on the initiator: $echo 1 > /sys/devices/virtual/nvme-fabrics/ctl/nvme0/reset_controller Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
John Meneghini <jmeneghi@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- Oct 25, 2021
-
-
Jens Axboe authored
The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by:
Darrick J. Wong <djwong@kernel.org> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Oct 20, 2021
-
-
Len Baker authored
As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. In this case this is not actually dynamic size: all the operands involved in the calculation are constant values. However it is better to refactor this anyway, just to keep the open-coded math idiom out of code. So, use the struct_size() helper to do the arithmetic instead of the argument "size + count * size" in the kmalloc() function. This code was detected with the help of Coccinelle and audited and fixed manually. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Signed-off-by:
Len Baker <len.baker@gmx.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
When reading the partition table on initial scan hits an I/O error the I/O will hang with the scan_mutex held: [<0>] do_read_cache_page+0x49b/0x790 [<0>] read_part_sector+0x39/0xe0 [<0>] read_lba+0xf9/0x1d0 [<0>] efi_partition+0xf1/0x7f0 [<0>] bdev_disk_changed+0x1ee/0x550 [<0>] blkdev_get_whole+0x81/0x90 [<0>] blkdev_get_by_dev+0x128/0x2e0 [<0>] device_add_disk+0x377/0x3c0 [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] [<0>] nvme_scan_work+0x168/0x310 [nvme_core] [<0>] process_one_work+0x231/0x420 and trying to delete the controller will deadlock as it tries to grab the scan mutex: [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] As we're now properly ordering the namespace list there is no need to hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore. And we always need to kick the requeue list as the path will be marked as unusable and I/O will be requeued _without_ a current path. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
The host memory doorbell and event buffers need to be initialized on each reset so the driver doesn't observe stale values from the previous instantiation. Signed-off-by:
Keith Busch <kbusch@kernel.org> Tested-by:
John Levon <john.levon@nutanix.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
In case that icdoff is not zero or mandatory keyed sgls are not supported by the NVMe/RDMA target, we'll go to error flow but we'll return 0 to the caller. Fix it by returning an appropriate error code. Fixes: c66e2998 ("nvme-rdma: centralize controller setup sequence") Signed-off-by:
Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Luis Chamberlain authored
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Since we now can tell for sure when a disk was added, move setting the bit NVME_NSHEAD_DISK_LIVE only when we did add the disk successfully. Nothing to do here as the cleanup is done elsewhere. We take care and use test_and_set_bit() because it is protects against two nvme paths simultaneously calling device_add_disk() on the same namespace head. Signed-off-by:
Luis Chamberlain <mcgrof@kernel.org> Reviewed-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
This makes the code more readable. Signed-off-by:
Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
This makes the code more readable. Signed-off-by:
Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by:
Keith Busch <kbusch@kernel.org> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
With discovery controllers supporting unique subsystem NQNs the actual subsystem NQN might be different from that one passed in via the connect args. So add a helper to display the resulting subsystem NQN. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
Add a connect option 'discovery' to specify that the connection should be made to a discovery controller, not a normal I/O controller. With discovery controllers supporting unique subsystem NQNs we cannot easily distinguish by the subsystem NQN if this should be a discovery connection, but we need this information to blank out options not supported by discovery controllers. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
With unique discovery controller NQNs we cannot distinguish the subsystem type by the NQN alone, but need to check the subsystem type, too. So expose the subsystem type in a new sysfs attribute 'subsystype'. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
Set the correct 'CNTRLTYPE' field in the identify controller data. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
Add a helper function to determine if a given subsystem is a discovery subsystem. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
Hannes Reinecke authored
TPAR8013 allows for unique discovery NQNs, so make the discovery controller NQN configurable by exposing a subsys attribute 'discovery_nqn'. Signed-off-by:
Hannes Reinecke <hare@suse.de> Reviewed-by:
Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by:
Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-