FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors) – Error: starting log segment failed for required journal

In namenode log:
2018-05-13 04:48:53,131 INFO  namenode.FSEditLog (FSEditLog.java:startLogSegment(1294)) - Starting log segment at 608846305
2018-05-13 04:48:59,132 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:00,133 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 7003 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:01,134 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 8004 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:02,136 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:03,136 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:04,138 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:05,139 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:06,139 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:07,141 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:08,141 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:09,142 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:10,144 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:11,144 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:12,146 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(608846305). Succeeded so far: [192.168.3.166:8485]
2018-05-13 04:49:13,131 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: starting log segment 608846305 failed for required journal (JournalAndStream(mgr=QJM to [192.168.3.166:8485, 192.168.3.164:8485, 192.168.3.167:8485], stream=null))
2018-05-13 04:49:13,135 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2018-05-13 04:49:13,143 INFO  namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
In journal node logs:
2018-05-13 04:49:24,666 INFO  server.Journal (Journal.java:startLogSegment(546)) - Updating lastWriterEpoch from 46 to 47 for client /192.168.3.167
2018-05-13 04:50:37,846 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 608849399-608849399 took 13418ms
2018-05-13 04:50:50,015 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 608850103-608850105 took 7125ms
2018-05-13 04:51:49,897 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 608852085-608852088 took 11132ms
2018-05-13 04:52:26,540 WARN  server.Journal (Journal.java:journal(398)) - Sync of transaction range 608853148-608853150 took 13877ms
2018-05-13 04:52:38,541 WARN  ipc.Server (Server.java:processResponse(1273)) - IPC Server handler 0 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from 192.168.3.167:49156 Call#9522213 Retry#0: output error
2018-05-13 04:52:38,541 INFO  ipc.Server (Server.java:run(2402)) - IPC Server handler 0 on 8485 caught an exception
These log messages indicate that there was a timeout condition when the NameNode
attempted to call the JournalNodes. The NameNode must successfully call a quorum
of JournalNodes: at least 2 out of 3. This means that the call timed out to at 
least 2 out of 3 of them. This is a fatal condition for the NameNode, so by design,
it aborts.
There are multiple potential reasons for this timeout condition. Reviewing logs 
from the NameNodes and JournalNodes would likely reveal more details. There are 
several common causes to watch for:
A long stop-the-world garbage collection pause may surpass the timeout threshold
for the call. Garbage collection logging would show what kind of garbage collection
activity the process is doing. You might also see log messages about the 
"JvmPauseMonitor". Consider reviewing the article NameNode Garbage Collection 
Configuration: Best Practices and Rationale to make sure your cluster's heap and
garbage collection settings match best practices.

In environments that integrate with LDAP for resolution of users' group memberships,
load problems on the LDAP infrastructure can cause delays. In extreme cases, we have
seen such timeouts at the JournalNodes cause edit logging calls to fail, which causes
a NameNode abort and an HA failover. See Hadoop and LDAP: Usage, Load Patterns and 
Tuning for a more detailed description and potential mitigation steps.

It is possible that there is a failure in network connectivity between the NameNode
and the JournalNodes. This tends to be rare, because NameNodes and JournalNodes tend
to be colocated on the same host or placed relatively close to one another in the 
network topology. Still, it is worth investigating that basic network connectivity 
between all NameNode hosts and all JournalNode hosts is working fine.
added the following custom properties to hdfs-site.xml
dfs.qjournal.start-segment.timeout.ms = 300000
dfs.qjournal.select-input-streams.timeout.ms = 300000
dfs.qjournal.write-txns.timeout.ms = 300000

added the following property to core-site.xml
ipc.client.connect.timeout=300000
0 Comments

There are no comments yet

Leave a comment

Your email address will not be published. Required fields are marked *