메뉴 건너뛰기

Bigdata, Semantic IoT, Hadoop, NoSQL

Bigdata, Hadoop ecosystem, Semantic IoT등의 프로젝트를 진행중에 습득한 내용을 정리하는 곳입니다.
필요한 분을 위해서 공개하고 있습니다. 문의사항은 gooper@gooper.com로 메일을 보내주세요.


1. 상황

 가. zookeeper의 zookeeper_trace.log파일이 너무커서 강제로 cat /dev/null > zookeeper_trace.log로 0로 만들고 zkServer.sh로 재기동 했더니 hadoop cluster전체가 이상하게 행동하고 journalnode데몬이 다운되고 이어서 namenode가 다운되는 등의 문제가 발생함.

 나. 많은 WARN발생후에 journalnode데몬이 기동되더라고 namenode가 비 정상적으로 작동하다가 다운되는 문제가 발생함.

 다. 이에 따른 영향으로 hadoop cluster나 zookeeper를 사용하는 각 데몬들이 문제가 발생하면서 다운되는 문제가 발생함.


2. 원인

journal node가 사용하는 editLog가 노드가 sync되지 않고 다른 경우에 발생하는 문제임


3. 조치 

아래와 같은 오류가 발생하면 문제없이 journalnode데몬이 수행되는 곳의 edit log(edits_로 시작되는 파일만)들을 journal node로 사용하는 모든 서버에 복사하고 다시 기동시켜준다.(특히 edits_inprogress_0000000000005681277.empty파일이 포함되어야 한다.)

(예, scp -P 10022 /data/hadoop/journal/data/mycluster/current/edits_* root@gsda2:/data/hadoop/journal/data/mycluster/current)



-------------nodename기동시 오류(INFO)내용---------------

2017-09-14 15:10:33,450 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 73044 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:34,452 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 74046 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:35,453 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 75047 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:36,454 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 76048 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:37,455 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 77049 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:38,456 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 78050 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:39,457 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 79051 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:40,458 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 80052 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:41,459 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 81053 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:42,460 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 82054 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:43,462 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 83056 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:44,463 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 84057 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:45,464 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 85058 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:46,465 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 86059 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:47,466 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 87060 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:48,467 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 88061 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:49,468 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 89062 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:50,469 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 90063 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:51,470 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 91064 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:52,471 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 92065 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:53,472 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 93066 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:54,473 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 94067 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:55,474 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 95068 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:56,475 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 96070 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]



--------------journalnode 데몬 기동시 WARNING내용(editLog파일 각각 유사한 WARN이 발생함)---------

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

번호 제목 글쓴이 날짜 조회 수
461 서버중 slave,worker,regionserver만 재기동해야 할때 필요한 기동스크립트및 사용방법 총관리자 2017.02.03 143
460 LUBM 데이타 생성구문 총관리자 2017.07.24 143
459 ?a는 모두 표시하면서 ?b와 비교하여 ?a=?b는 ""로 하고 ?a!=?b 인경우는 해당값을 가지는 결과 집합을 구하는 경우 file 총관리자 2016.01.29 144
458 mongodb에서 큰데이타 sort시 오류발생에 대한 해결방법 총관리자 2015.12.22 145
457 update(update와 delete->insert)사용시 주의/참고사항 총관리자 2016.01.06 146
456 [jsoup]Jsoup Tutorial 총관리자 2017.04.11 146
455 Apache Toree설치(Jupyter에서 Scala, PySpark, SparkR, SQL을 사용할 수 있도록 하는 Kernel) 총관리자 2018.04.17 146
454 Cloudera의 API를 이용하여 impala의 실행되었던 쿼리 확인하는 예시 총관리자 2018.05.03 148
453 magento2 2.1.3을 수동으로 설치하는 방법 총관리자 2017.02.01 149
452 이미지 관리 오픈소스 목록 총관리자 2018.03.11 149
451 천문학적, 기후학적, 기상학적, 생물학적, 농사계절 구분 총관리자 2015.12.16 150
450 [HIVESERVER2]프로세스의 thread및 stack trace를 덤프하는 방법(pstack, jstack) 총관리자 2022.05.11 150
449 No broker partitions consumed by consumer thread오류 발생시 확인/조치할 사항 총관리자 2016.09.02 151
448 [PHP7.0]로그파일 위치 총관리자 2017.05.07 151
447 windows7에서 lagom의 hello world를 빌드하여 실행하는 경우의 로그(mvn lagom:runAll -Dscala.binary.version=2.11) 총관리자 2017.12.22 152
446 test2 총관리자 2017.05.01 153
445 schema.xml vs managed-schema 지정 사용하기 - 두개를 동시에 사용할 수는 없음 총관리자 2017.07.09 153
444 spark 시동중 applicationHistory 로그 디렉토리가 없다고 하면서 기동되지 않는 경우 총관리자 2018.06.01 153
443 [CentOS] 네트워크 설정 총관리자 2018.03.26 154
442 format된 namenode를 다른 서버에서 다시 format했을때 오류내용 총관리자 2016.09.22 155

A personal place to organize information learned during the development of such Hadoop, Hive, Hbase, Semantic IoT, etc.
We are open to the required minutes. Please send inquiries to gooper@gooper.com.

위로