Summary
Testnet MNs/SNs sometimes go unresponsive (similar to issue #289 (closed) and #302 (closed)) while processing mnbudget stuff.
Steps to reproduce
Run a bunch of testnet nodes, from time to time one is bound to end up in this apparent deadlocked state.
Expected behavior
Nodes should run without problems.
Problematic behavior
Testnet MNs/SNs sometimes go unresponsive (similar to issue #289 (closed) and #302 (closed)). It may be happening in mainnet too but with a much larger node population operated by many more users the chances of it being recognised are much smaller.
Crown-core environment info
Ubuntu 16.04
Crown-core application info
Crown version v0.13.9.0-d5fd2fe3 (2019-06-21 15:57:45 +0300)
Relevant logs, dumps and/or screenshots
Node is running fine, starts a budget sync
2020-01-06 03:02:28 CheckProofOfStake : OutPoint: 0f92351c6b2c84efcc3e4002a4bb0650603e06132e306d56af6eb139db5504c8:1 Modifier=7c811346bbc0f5a9370c9a09b84b4fad4265eea78756bd7918e4f8122ed4ae8b timeblockfrom=1578239758 time=1578279723 amount=10000
2020-01-06 03:02:28 UpdateTip: new best=85d48b78a0c3d18c628493ccdf6b974f621d25687aea76c90b11e8b2d2dbd313 height=339258 log2_work=62.365663 tx=552539 date=2020-01-06 03:02:03 progress=1.000000 cache=0.1MiB(109tx)
2020-01-06 03:02:28 CBudgetManager::NewBlock - incremental sync started
2020-01-06 03:02:28 CBudgetManager::Sync - sent 2 items
2020-01-06 03:02:28 CBudgetManager::Sync - sent 0 items
and hits a problem
2020-01-06 03:02:28 CBudgetManager::Sync - sent 0 items
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - mapBudgetDrafts cleanup - size: 342
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Older than current blockHeight
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Cannot find masternode : CTxIn(COutPoint(f5343bb0dd09ef628032119b403dd6879724574d7f6c42cc1e1922ec998706a0, 0), scriptSig=)
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Older than current blockHeight
then starts getting connection refused messages (possibly just noise)
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Older than current blockHeight
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - mapProposals cleanup - size: 5
2020-01-06 03:02:55 connect() to 196.53.105.236:19340 failed after select(): Connection refused (111)
2020-01-06 03:03:01 connect() to 206.189.184.90:19340 failed after select(): Connection refused (111)
then starts doing nothing but relay MN/SN pings
2020-01-06 03:05:21 CActiveSystemnode::SendSystemnodePing() - Relay Systemnode Ping vin = CTxIn(COutPoint(7982077278f9448773441c5d3550a82d8bb5dc97c6e02a67c4fce831f0acaf74, 0), scriptSig=)
2020-01-06 03:05:34 connect() to 217.61.109.5:19340 failed after select(): Connection refused (111)
for 20 minutes and then timesout remaining connections
2020-01-06 03:20:21 CActiveSystemnode::SendSystemnodePing() - Relay Systemnode Ping vin = CTxIn(COutPoint(7982077278f9448773441c5d3550a82d8bb5dc97c6e02a67c4fce831f0acaf74, 0), scriptSig=)
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:25:21 CActiveSystemnode::SendSystemnodePing() - Relay Systemnode Ping vin = CTxIn(COutPoint(7982077278f9448773441c5d3550a82d8bb5dc97c6e02a67c4fce831f0acaf74, 0), scriptSig=)
and does NOTHING but relay pings every 5 minutes thereafter. Doesn't respond to crown-cli RPC requests. Only option is to kill the daemon and restart.
Before killing the most recent occurrence I attached the debugger and confirmed it's deadlocked.
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f2c886ce740 (LWP 18833) "crownd" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
2 Thread 0x7f2c86f37700 (LWP 18834) "crownd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
3 Thread 0x7f2c86736700 (LWP 18835) "crownd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
4 Thread 0x7f2c85f35700 (LWP 18836) "crownd" 0x00007f2c87317a13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
5 Thread 0x7f2c85734700 (LWP 18837) "crownd" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
6 Thread 0x7f2c84f33700 (LWP 18838) "crownd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
7 Thread 0x7f2c83bc5700 (LWP 18839) "crownd" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
8 Thread 0x7f2c79706700 (LWP 18842) "crown-legacysig" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
9 Thread 0x7f2c73fff700 (LWP 18844) "crown-net" 0x00007f2c8730d5d3 in select () at ../sysdeps/unix/syscall-template.S:84
10 Thread 0x7f2c737fe700 (LWP 18845) "crown-addcon" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
---Type <return> to continue, or q <return> to quit---
11 Thread 0x7f2c72ffd700 (LWP 18846) "crown-opencon" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
12 Thread 0x7f2c727fc700 (LWP 18847) "crown-msghand" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
13 Thread 0x7f2c71ffb700 (LWP 18848) "crown-dumpaddr" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
14 Thread 0x7f2c717fa700 (LWP 18849) "crown-miner" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
15 Thread 0x7f2c70ff9700 (LWP 18850) "crown-wallet" pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
(gdb) thread 12
[Switching to thread 12 (Thread 0x7f2c727fc700 (LWP 18847))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f2c875e3e42 in __GI___pthread_mutex_lock (mutex=0xedb620 <cs_main>)
at ../nptl/pthread_mutex_lock.c:115
#2 0x0000000000454845 in CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::CMutexLock(AnnotatedMixin<boost::recursive_mutex>&, char const*, char const*, int, bool) ()
#3 0x000000000049d039 in GetTransaction(uint256 const&, CTransaction&, uint256&, bool) ()
#4 0x000000000071b4e0 in IsBudgetCollateralValid(uint256, uint256, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, long&, int&) ()
#5 0x000000000071bde2 in CBudgetProposal::IsValid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool) const ()
#6 0x0000000000722509 in CBudgetManager::CheckAndRemove() ()
#7 0x000000000072283e in CBudgetManager::NewBlock() ()
#8 0x00000000004b4790 in ProcessNewBlock(CValidationState&, CNode*, CBlock*, CDiskBlockPos*) ()
#9 0x00000000004bbb01 in ProcessMessage(CNode*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, CDataStream&, long) ()
#10 0x00000000004bedbf in ProcessMessages(CNode*) ()
#11 0x000000000050d86c in ThreadMessageHandler() ()
#12 0x000000000051b408 in void TraceThread<void (*)()>(char const*, void (*)())
()
#13 0x0000000000809bf2 in thread_proxy ()
#14 0x00007f2c875e16ba in start_thread (arg=0x7f2c727fc700)
at pthread_create.c:333
#15 0x00007f2c8731741d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) thread 14
[Switching to thread 14 (Thread 0x7f2c717fa700 (LWP 18849))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 in ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f2c875e3e42 in __GI___pthread_mutex_lock (
mutex=0xf31c10 <budget+48>) at ../nptl/pthread_mutex_lock.c:115
#2 0x0000000000454845 in CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::CMutexLock(AnnotatedMixin<boost::recursive_mutex>&, char const*, char const*, int, bool) ()
#3 0x0000000000714dee in CBudgetManager::IsBudgetPaymentBlock(int) const ()
#4 0x0000000000731304 in IsBlockPayeeValid(long const&, CTransaction const&, int, unsigned int const&, unsigned int const&) ()
#5 0x000000000048ec09 in CheckBlock(CBlock const&, CValidationState&, bool, bool) ()
#6 0x00000000004b43da in ProcessNewBlock(CValidationState&, CNode*, CBlock*, CDiskBlockPos*) ()
#7 0x00000000004f57e7 in ProcessBlockFound(CBlock*, CWallet&, CReserveKey&) ()
#8 0x00000000004f8e96 in BitcoinMiner(CWallet*, bool) ()
#9 0x0000000000505057 in ThreadStakeMiner() ()
#10 0x000000000051b408 in void TraceThread<void (*)()>(char const*, void (*)())
()
#11 0x0000000000809bf2 in thread_proxy ()
#12 0x00007f2c875e16ba in start_thread (arg=0x7f2c717fa700)
at pthread_create.c:333
#13 0x00007f2c8731741d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Drop full logs & dumps here: https://nextcloud.crownplatform.com/index.php/s/Q6H8enXNmJsQYCD
Possible fixes
(Any comments on what you think might be responsible for the problem, if you have particular insight.)
/cc @artem