Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • crown-core crown-core
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 75
    • Issues 75
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1
    • Merge requests 1
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Crown
  • crown-corecrown-core
  • Issues
  • #341

Closed
Open
Created Jan 06, 2020 by Mark Brooker@walkjiveflyMaintainer

(testnet) nodes sometimes go unresponsive (similar to issue #289 and #302)

Summary

Testnet MNs/SNs sometimes go unresponsive (similar to issue #289 (closed) and #302 (closed)) while processing mnbudget stuff.

Steps to reproduce

Run a bunch of testnet nodes, from time to time one is bound to end up in this apparent deadlocked state.

Expected behavior

Nodes should run without problems.

Problematic behavior

Testnet MNs/SNs sometimes go unresponsive (similar to issue #289 (closed) and #302 (closed)). It may be happening in mainnet too but with a much larger node population operated by many more users the chances of it being recognised are much smaller.

Crown-core environment info

Ubuntu 16.04

Crown-core application info

Crown version v0.13.9.0-d5fd2fe3 (2019-06-21 15:57:45 +0300)

Relevant logs, dumps and/or screenshots

Node is running fine, starts a budget sync

2020-01-06 03:02:28 CheckProofOfStake : OutPoint: 0f92351c6b2c84efcc3e4002a4bb0650603e06132e306d56af6eb139db5504c8:1 Modifier=7c811346bbc0f5a9370c9a09b84b4fad4265eea78756bd7918e4f8122ed4ae8b timeblockfrom=1578239758 time=1578279723 amount=10000
2020-01-06 03:02:28 UpdateTip: new best=85d48b78a0c3d18c628493ccdf6b974f621d25687aea76c90b11e8b2d2dbd313  height=339258  log2_work=62.365663  tx=552539  date=2020-01-06 03:02:03 progress=1.000000  cache=0.1MiB(109tx)
2020-01-06 03:02:28 CBudgetManager::NewBlock - incremental sync started
2020-01-06 03:02:28 CBudgetManager::Sync - sent 2 items
2020-01-06 03:02:28 CBudgetManager::Sync - sent 0 items

and hits a problem

2020-01-06 03:02:28 CBudgetManager::Sync - sent 0 items
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - mapBudgetDrafts cleanup - size: 342
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Older than current blockHeight
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Cannot find masternode : CTxIn(COutPoint(f5343bb0dd09ef628032119b403dd6879724574d7f6c42cc1e1922ec998706a0, 0), scriptSig=)
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Older than current blockHeight

then starts getting connection refused messages (possibly just noise)

2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Older than current blockHeight
2020-01-06 03:02:28 CBudgetManager::CheckAndRemove - mapProposals cleanup - size: 5
2020-01-06 03:02:55 connect() to 196.53.105.236:19340 failed after select(): Connection refused (111)
2020-01-06 03:03:01 connect() to 206.189.184.90:19340 failed after select(): Connection refused (111)

then starts doing nothing but relay MN/SN pings

2020-01-06 03:05:21 CActiveSystemnode::SendSystemnodePing() - Relay Systemnode Ping vin = CTxIn(COutPoint(7982077278f9448773441c5d3550a82d8bb5dc97c6e02a67c4fce831f0acaf74, 0), scriptSig=)
2020-01-06 03:05:34 connect() to 217.61.109.5:19340 failed after select(): Connection refused (111)

for 20 minutes and then timesout remaining connections

2020-01-06 03:20:21 CActiveSystemnode::SendSystemnodePing() - Relay Systemnode Ping vin = CTxIn(COutPoint(7982077278f9448773441c5d3550a82d8bb5dc97c6e02a67c4fce831f0acaf74, 0), scriptSig=)
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:22:29 socket sending timeout: 1201s
2020-01-06 03:25:21 CActiveSystemnode::SendSystemnodePing() - Relay Systemnode Ping vin = CTxIn(COutPoint(7982077278f9448773441c5d3550a82d8bb5dc97c6e02a67c4fce831f0acaf74, 0), scriptSig=)

and does NOTHING but relay pings every 5 minutes thereafter. Doesn't respond to crown-cli RPC requests. Only option is to kill the daemon and restart.

Before killing the most recent occurrence I attached the debugger and confirmed it's deadlocked.

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f2c886ce740 (LWP 18833) "crownd" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  2    Thread 0x7f2c86f37700 (LWP 18834) "crownd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  3    Thread 0x7f2c86736700 (LWP 18835) "crownd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  4    Thread 0x7f2c85f35700 (LWP 18836) "crownd" 0x00007f2c87317a13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
  5    Thread 0x7f2c85734700 (LWP 18837) "crownd" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  6    Thread 0x7f2c84f33700 (LWP 18838) "crownd" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  7    Thread 0x7f2c83bc5700 (LWP 18839) "crownd" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  8    Thread 0x7f2c79706700 (LWP 18842) "crown-legacysig" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  9    Thread 0x7f2c73fff700 (LWP 18844) "crown-net" 0x00007f2c8730d5d3 in select () at ../sysdeps/unix/syscall-template.S:84
  10   Thread 0x7f2c737fe700 (LWP 18845) "crown-addcon" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
---Type <return> to continue, or q <return> to quit---
  11   Thread 0x7f2c72ffd700 (LWP 18846) "crown-opencon" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  12   Thread 0x7f2c727fc700 (LWP 18847) "crown-msghand" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  13   Thread 0x7f2c71ffb700 (LWP 18848) "crown-dumpaddr" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
  14   Thread 0x7f2c717fa700 (LWP 18849) "crown-miner" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  15   Thread 0x7f2c70ff9700 (LWP 18850) "crown-wallet" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225

(gdb) thread 12
[Switching to thread 12 (Thread 0x7f2c727fc700 (LWP 18847))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135	../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2c875e3e42 in __GI___pthread_mutex_lock (mutex=0xedb620 <cs_main>)
    at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000454845 in CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::CMutexLock(AnnotatedMixin<boost::recursive_mutex>&, char const*, char const*, int, bool) ()
#3  0x000000000049d039 in GetTransaction(uint256 const&, CTransaction&, uint256&, bool) ()
#4  0x000000000071b4e0 in IsBudgetCollateralValid(uint256, uint256, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, long&, int&) ()
#5  0x000000000071bde2 in CBudgetProposal::IsValid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, bool) const ()
#6  0x0000000000722509 in CBudgetManager::CheckAndRemove() ()
#7  0x000000000072283e in CBudgetManager::NewBlock() ()
#8  0x00000000004b4790 in ProcessNewBlock(CValidationState&, CNode*, CBlock*, CDiskBlockPos*) ()
#9  0x00000000004bbb01 in ProcessMessage(CNode*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, CDataStream&, long) ()
#10 0x00000000004bedbf in ProcessMessages(CNode*) ()
#11 0x000000000050d86c in ThreadMessageHandler() ()
#12 0x000000000051b408 in void TraceThread<void (*)()>(char const*, void (*)())
    ()
#13 0x0000000000809bf2 in thread_proxy ()
#14 0x00007f2c875e16ba in start_thread (arg=0x7f2c727fc700)
    at pthread_create.c:333
#15 0x00007f2c8731741d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) thread 14
[Switching to thread 14 (Thread 0x7f2c717fa700 (LWP 18849))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135	in ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2c875e3e42 in __GI___pthread_mutex_lock (
    mutex=0xf31c10 <budget+48>) at ../nptl/pthread_mutex_lock.c:115
#2  0x0000000000454845 in CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::CMutexLock(AnnotatedMixin<boost::recursive_mutex>&, char const*, char const*, int, bool) ()
#3  0x0000000000714dee in CBudgetManager::IsBudgetPaymentBlock(int) const ()
#4  0x0000000000731304 in IsBlockPayeeValid(long const&, CTransaction const&, int, unsigned int const&, unsigned int const&) ()
#5  0x000000000048ec09 in CheckBlock(CBlock const&, CValidationState&, bool, bool) ()
#6  0x00000000004b43da in ProcessNewBlock(CValidationState&, CNode*, CBlock*, CDiskBlockPos*) ()
#7  0x00000000004f57e7 in ProcessBlockFound(CBlock*, CWallet&, CReserveKey&) ()
#8  0x00000000004f8e96 in BitcoinMiner(CWallet*, bool) ()
#9  0x0000000000505057 in ThreadStakeMiner() ()
#10 0x000000000051b408 in void TraceThread<void (*)()>(char const*, void (*)())
    ()
#11 0x0000000000809bf2 in thread_proxy ()
#12 0x00007f2c875e16ba in start_thread (arg=0x7f2c717fa700)
    at pthread_create.c:333
#13 0x00007f2c8731741d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Drop full logs & dumps here: https://nextcloud.crownplatform.com/index.php/s/Q6H8enXNmJsQYCD

Possible fixes

(Any comments on what you think might be responsible for the problem, if you have particular insight.)

/cc @artem

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking