메뉴 건너뛰기

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.


1. 상황

 가. zookeeper의 zookeeper_trace.log파일이 너무커서 강제로 cat /dev/null > zookeeper_trace.log로 0로 만들고 zkServer.sh로 재기동 했더니 hadoop cluster전체가 이상하게 행동하고 journalnode데몬이 다운되고 이어서 namenode가 다운되는 등의 문제가 발생함.

 나. 많은 WARN발생후에 journalnode데몬이 기동되더라고 namenode가 비 정상적으로 작동하다가 다운되는 문제가 발생함.

 다. 이에 따른 영향으로 hadoop cluster나 zookeeper를 사용하는 각 데몬들이 문제가 발생하면서 다운되는 문제가 발생함.


2. 원인

journal node가 사용하는 editLog가 노드가 sync되지 않고 다른 경우에 발생하는 문제임


3. 조치 

아래와 같은 오류가 발생하면 문제없이 journalnode데몬이 수행되는 곳의 edit log(edits_로 시작되는 파일만)들을 journal node로 사용하는 모든 서버에 복사하고 다시 기동시켜준다.(특히 edits_inprogress_0000000000005681277.empty파일이 포함되어야 한다.)

(예, scp -P 10022 /data/hadoop/journal/data/mycluster/current/edits_* root@gsda2:/data/hadoop/journal/data/mycluster/current)



-------------nodename기동시 오류(INFO)내용---------------

2017-09-14 15:10:33,450 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 73044 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:34,452 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 74046 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:35,453 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 75047 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:36,454 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 76048 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:37,455 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 77049 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:38,456 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 78050 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:39,457 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 79051 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:40,458 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 80052 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:41,459 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 81053 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:42,460 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 82054 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:43,462 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 83056 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:44,463 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 84057 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:45,464 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 85058 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:46,465 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 86059 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:47,466 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 87060 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:48,467 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 88061 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:49,468 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 89062 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:50,469 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 90063 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:51,470 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 91064 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:52,471 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 92065 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:53,472 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 93066 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:54,473 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 94067 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:55,474 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 95068 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:56,475 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 96070 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]



--------------journalnode 데몬 기동시 WARNING내용(editLog파일 각각 유사한 WARN이 발생함)---------

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

번호 제목 날짜 조회 수
27 [2.7.2] distribute-exclude.sh사용할때 ssh 포트변경에 따른 오류발생시 조치사항 2018.01.02 578
26 hadoop 어플리케이션을 사용하는 사용자 변경시 바꿔줘야 하는 부분 2016.09.23 525
25 서버중 slave,worker,regionserver만 재기동해야 할때 필요한 기동스크립트및 사용방법 2017.02.03 523
24 Windows7 64bit 환경에서 Apache Hadoop 2.7.1설치하기 2017.07.26 517
23 Hadoop - 클러스터 세팅및 기동 2015.04.28 513
22 java.lang.IllegalArgumentException: Does not contain a valid host:port authority: master 오류해결방법 2015.05.06 503
21 namenode오류 복구시 사용하는 명령 2016.04.01 502
20 AIX 7.1에 Hadoop설치(정리중) 2016.09.12 459
19 Hadoop 완벽 가이드 정리된 링크 2016.04.19 457
18 Mountable HDFS on CentOS 6.x(hadoop 2.7.2의 nfs기능을 이용) 2016.11.24 456
17 hadoop nfs gateway설정 (Cloudera 6.3.4, CentOS 7.4 환경에서) 2022.01.07 449
16 HDFS상의 /tmp폴더에 Permission denied오류가 발생시 조치사항 2017.01.25 438
15 hadoop클러스터를 구성하던 서버중 HA를 담당하는 서버의 hostname등이 변경되어 문제가 발생했을때 조치사항 2016.07.29 409
14 Cleaning up the staging area file시 'cannot access' 혹은 'Directory is not writable' 발생시 조치사항 2017.05.02 396
» editLog의 문제로 발생하는 journalnode 기동 오류 발생시 조치사항 2017.09.14 391
12 missing block및 관련 파일명 찾는 명령어 2021.02.20 336
11 AIX 7.1에 Hadoop설치(정리중#2) 2016.09.20 318
10 format된 namenode를 다른 서버에서 다시 format했을때 오류내용 2016.09.22 265
9 hadoop cluster구성된 노드를 확인시 Capacity를 보면 색이 붉은색으로 표시되어 있는 경우나 Unhealthy인 경우 처리방법 2017.08.30 243
8 ./hadoop-daemon.sh start namenode로 namenode기동시 EditLog의 custerId, namespaceId가 달라서 발생하는 오류 해결방법 2016.09.24 228
위로