← Back to team overview

debcrafters-packages team mailing list archive

[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation

 

I got pratically identical client hang and stack trace with comment #35:

INFO: task apache2:2566 blocked for more than 122 seconds.
      Not tainted 6.14.0-33-generic #33~24.04.1-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:apache2         state:D stack:0     pid:2566  tgid:2566  ppid:2271   task_flags:0x400140 flags:0x00004002
Call Trace:
 <TASK>
 __schedule+0x2cf/0x640
 schedule+0x29/0xd0
 io_schedule+0x4c/0x80
 folio_wait_bit_common+0x138/0x310
 ? __pfx_wake_page_function+0x10/0x10
 folio_wait_private_2+0x2c/0x60
 nfs_invalidate_folio+0x84/0x110 [nfs]
 truncate_cleanup_folio+0xaa/0xd0
 truncate_inode_pages_range+0x140/0x560
 ? __call_rcu_nocb_wake+0x17d/0x270
 truncate_pagecache+0x48/0x70
 nfs_setattr_update_inode+0x30e/0x3d0 [nfs]
 nfs3_proc_setattr+0x108/0x150 [nfsv3]
 nfs_setattr+0x197/0x380 [nfs]
 notify_change+0x2fa/0x4f0
 do_truncate+0x98/0xf0
 ? do_truncate+0x98/0xf0
 do_open+0x2f0/0x430
 path_openat+0x134/0x2d0
 do_filp_open+0xd4/0x1a0
 do_sys_openat2+0xb3/0xe0
 ? post_alloc_hook+0xc9/0x140
 __x64_sys_openat+0x55/0xa0
 x64_sys_call+0x1c49/0x2650
 do_syscall_64+0x7e/0x170
 ? __alloc_frozen_pages_noprof+0x164/0x330
 ? try_charge_memcg+0x8e/0x5a0
 ? __mod_memcg_lruvec_state+0xf4/0x250
 ? __lruvec_stat_mod_folio+0x8b/0xf0
 ? set_ptes.isra.0+0x3b/0x90
 ? do_anonymous_page+0x132/0x470
 ? handle_pte_fault+0x1e1/0x200
 ? __handle_mm_fault+0x62c/0x770
 ? __count_memcg_events+0xd3/0x1a0
 ? count_memcg_events.constprop.0+0x2a/0x50
 ? handle_mm_fault+0x1df/0x2d0
 ? do_user_addr_fault+0x5d5/0x870
 ? arch_exit_to_user_mode_prepare.isra.0+0x22/0xd0
 ? irqentry_exit_to_user_mode+0x2d/0x1d0
 ? irqentry_exit+0x43/0x50
 ? exc_page_fault+0x96/0x1e0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7be9b631b215
RSP: 002b:00007ffd47a5a600 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007be9b631b215
RDX: 0000000000000241 RSI: 00007ffd47a5a6b0 RDI: 00000000ffffff9c
RBP: 00007ffd47a5a670 R08: 0000000000000000 R09: 000000000000006e
R10: 00000000000001b6 R11: 0000000000000293 R12: 00007ffd47a5a6b0
R13: 00007ffd47a5a6b0 R14: 00007ffd47a5b7c5 R15: 0000000000000000
 </TASK>

This happened while Apache / mod_php process tried to access file on NFS
mount. The mount used following flags:

nofail,nfsvers=3,fsc,tcp,intr,soft,retrans=3,timeo=10,retry=1,ac,lookupcache=positive,acregmin=1,acdirmin=1,noexec,nosuid,noatime

but the hang still lasted for over 10 hours until the client was hard-
rebooted with "echo b > /proc/sysrq-trigger".

The mount was done using redundant 10 Gbps fiber link and it's still
unclear how much traffic was going at the moment of the hang starting.

The system was using following kernel packages at the moment of the
hang:

linux-lowlatency-hwe-24.04
linux-image-generic-hwe-24.04
linux-image-6.14.0-33-generic
linux-modules-6.14.0-33-generic
linux-modules-extra-6.14.0-33-generic

and the exact version for all of the above was 6.14.0-33.33~24.04.1

Other clients with identical connection and the same NFS server seemed
to work fine at the same moment so this is probably some kind of race
condition in the kernel. And looking at this discussion, I'd guess
everything since 5.x kernels has this yet-unknown failure mode. The
system was previously using 5.15.x kernels for a long time without any
issues.

-- 
You received this bug notification because you are a member of
Debcrafters packages, which is subscribed to nfs-utils in Ubuntu.
https://bugs.launchpad.net/bugs/2062568

Title:
  nfsd gets unresponsive after some hours of operation

Status in linux package in Ubuntu:
  In Progress
Status in nfs-utils package in Ubuntu:
  Incomplete
Status in linux source package in Noble:
  In Progress
Status in nfs-utils source package in Noble:
  Incomplete

Bug description:
  I installed the 24.04 Beta on two test machines that were running
  22.04 without issues before. One of them exports two volumes that are
  mounted by the other machine, which primarily uses them as a secondary
  storage for ccache.

  After being up for a couple of hours (happened twice since yesterday
  evening) it seems that nfsd on the machine exporting the volumes hangs
  on something.

  From dmesg on the server (repeated a few times):

  [11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds.
  [11183.290558]       Not tainted 6.8.0-22-generic #22-Ubuntu
  [11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [11183.290582] task:nfsd            state:D stack:0     pid:1419  tgid:1419  ppid:2      flags:0x00004000
  [11183.290587] Call Trace:
  [11183.290602]  <TASK>
  [11183.290606]  __schedule+0x27c/0x6b0
  [11183.290612]  schedule+0x33/0x110
  [11183.290615]  schedule_timeout+0x157/0x170
  [11183.290619]  wait_for_completion+0x88/0x150
  [11183.290623]  __flush_workqueue+0x140/0x3e0
  [11183.290629]  nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
  [11183.290689]  nfsd4_destroy_session+0x186/0x260 [nfsd]
  [11183.290744]  nfsd4_proc_compound+0x3af/0x770 [nfsd]
  [11183.290798]  nfsd_dispatch+0xd4/0x220 [nfsd]
  [11183.290851]  svc_process_common+0x44d/0x710 [sunrpc]
  [11183.290924]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
  [11183.290976]  svc_process+0x132/0x1b0 [sunrpc]
  [11183.291041]  svc_handle_xprt+0x4d3/0x5d0 [sunrpc]
  [11183.291105]  svc_recv+0x18b/0x2e0 [sunrpc]
  [11183.291168]  ? __pfx_nfsd+0x10/0x10 [nfsd]
  [11183.291220]  nfsd+0x8b/0xe0 [nfsd]
  [11183.291270]  kthread+0xef/0x120
  [11183.291274]  ? __pfx_kthread+0x10/0x10
  [11183.291276]  ret_from_fork+0x44/0x70
  [11183.291279]  ? __pfx_kthread+0x10/0x10
  [11183.291281]  ret_from_fork_asm+0x1b/0x30
  [11183.291286]  </TASK>

  From dmesg on the client (repeated a number of times):
  [ 6596.911785] RPC: Could not send backchannel reply error: -110
  [ 6596.972490] RPC: Could not send backchannel reply error: -110
  [ 6837.281307] RPC: Could not send backchannel reply error: -110

  ProblemType: Bug
  DistroRelease: Ubuntu 24.04
  Package: nfs-kernel-server 1:2.6.4-3ubuntu5
  ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1
  Uname: Linux 6.8.0-22-generic x86_64
  .etc.request-key.d.id_resolver.conf: create	id_resolver	*	*	/usr/sbin/nfsidmap -t 600 %k %d
  ApportVersion: 2.28.1-0ubuntu1
  Architecture: amd64
  CasperMD5CheckResult: pass
  Date: Fri Apr 19 14:10:25 2024
  InstallationDate: Installed on 2024-04-16 (3 days ago)
  InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1)
  NFSMounts:

  NFSv4Mounts:

  ProcEnviron:
   LANG=en_US.UTF-8
   PATH=(custom, no user)
   SHELL=/bin/bash
   TERM=xterm-256color
   XDG_RUNTIME_DIR=<set>
  SourcePackage: nfs-utils
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2062568/+subscriptions