메뉴 건너뛰기

Bigdata, Semantic IoT, Hadoop, NoSQL

Bigdata, Hadoop ecosystem, Semantic IoT등의 프로젝트를 진행중에 습득한 내용을 정리하는 곳입니다.
필요한 분을 위해서 공개하고 있습니다. 문의사항은 gooper@gooper.com로 메일을 보내주세요.


1. 상황

 가. zookeeper의 zookeeper_trace.log파일이 너무커서 강제로 cat /dev/null > zookeeper_trace.log로 0로 만들고 zkServer.sh로 재기동 했더니 hadoop cluster전체가 이상하게 행동하고 journalnode데몬이 다운되고 이어서 namenode가 다운되는 등의 문제가 발생함.

 나. 많은 WARN발생후에 journalnode데몬이 기동되더라고 namenode가 비 정상적으로 작동하다가 다운되는 문제가 발생함.

 다. 이에 따른 영향으로 hadoop cluster나 zookeeper를 사용하는 각 데몬들이 문제가 발생하면서 다운되는 문제가 발생함.


2. 원인

journal node가 사용하는 editLog가 노드가 sync되지 않고 다른 경우에 발생하는 문제임


3. 조치 

아래와 같은 오류가 발생하면 문제없이 journalnode데몬이 수행되는 곳의 edit log(edits_로 시작되는 파일만)들을 journal node로 사용하는 모든 서버에 복사하고 다시 기동시켜준다.(특히 edits_inprogress_0000000000005681277.empty파일이 포함되어야 한다.)

(예, scp -P 10022 /data/hadoop/journal/data/mycluster/current/edits_* root@gsda2:/data/hadoop/journal/data/mycluster/current)



-------------nodename기동시 오류(INFO)내용---------------

2017-09-14 15:10:33,450 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 73044 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:34,452 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 74046 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:35,453 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 75047 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:36,454 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 76048 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:37,455 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 77049 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:38,456 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 78050 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:39,457 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 79051 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:40,458 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 80052 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:41,459 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 81053 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:42,460 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 82054 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:43,462 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 83056 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:44,463 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 84057 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:45,464 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 85058 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:46,465 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 86059 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:47,466 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 87060 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:48,467 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 88061 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:49,468 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 89062 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:50,469 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 90063 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:51,470 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 91064 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:52,471 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 92065 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:53,472 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 93066 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:54,473 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 94067 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:55,474 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 95068 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]

2017-09-14 15:10:56,475 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 96070 ms (timeout=120000 ms) for a response for getJournalState(). Succeeded so far: [XXX.XXX.XXX.XXX:8485]



--------------journalnode 데몬 기동시 WARNING내용(editLog파일 각각 유사한 WARN이 발생함)---------

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1147)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:329)

        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:548)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:195)

        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:155)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102)

        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:190)

        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)

        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)

        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2455)

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1036288

2017-09-14 17:02:32,882 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /data/hadoop/journal/data/mycluster/current/edits_inprogress_0000000000004738264 while determining its valid length. Position was 1036288

java.io.IOException: Can't scan a pre-transactional edit log.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4924)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)

번호 제목 글쓴이 날짜 조회 수
461 다중 모듈 프로젝트 설정에 대한 설명 총관리자 2016.09.21 80
460 journalnode노드 기동시 "should be an absolute path"가 발생하고 기동되지 않을 경우 확인사항 총관리자 2016.09.22 93
459 format된 namenode를 다른 서버에서 다시 format했을때 오류내용 총관리자 2016.09.22 155
458 hadoop 어플리케이션을 사용하는 사용자 변경시 바꿔줘야 하는 부분 총관리자 2016.09.23 68
457 ./hadoop-daemon.sh start namenode로 namenode기동시 EditLog의 custerId, namespaceId가 달라서 발생하는 오류 해결방법 총관리자 2016.09.24 119
456 AIX 7.1에 MariaDB 10.2 소스 설치 총관리자 2016.09.24 2372
455 파일끝에 붙는 ^M 일괄 지우기(linux, unix(AIX)) 혹은 파일내에 있는 ^M지우기 총관리자 2016.09.24 78
454 schema설정없이 hive를 최초에 실행했을때 발생하는 오류메세지및 처리방법 총관리자 2016.09.25 1226
453 hive기동시 Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D 오류 발생시 조치사항 총관리자 2016.09.25 496
452 AIX 7.1에서 hive실행시 "hive: line 86: readlink: command not found" 오류가 발생시 임시 조치사항 총관리자 2016.09.25 235
451 DBCP Datasource(org.apache.commons.dbcp.BasicDataSource) 설정 및 속성 설명 총관리자 2016.09.26 74
450 모두를 위한 머신러닝과 딥러닝의 강의 file 총관리자 2016.09.27 192
449 프로그래밍 언어별 딥러닝 라이브러리 정리 file 총관리자 2016.10.05 215
448 AIX 7.1에 Python 2.7.11설치하기 총관리자 2016.10.06 652
447 동시에 많은 요청이 endpoint로 몰려서java.net.NoRouteToHostException가 발생하는 경우의 처리방법 총관리자 2016.10.17 510
446 java.lang.OutOfMemoryError: unable to create new native thread오류 발생지 조치사항 총관리자 2016.10.17 470
445 producer / consumer구현시 설정 옵션 설명 총관리자 2016.10.19 121
444 운영중인 상태에서 kafka topic삭제하고 재생성하여 처리되지 않은 메세지 모두 삭제하기 총관리자 2016.10.24 156
443 VisualVM 1.3.9을 이용한 JVM 모니터링 file 총관리자 2016.10.27 333
442 VisualVM 1.3.9을 이용한 spark-submit JVM 모니터링을 위한 설정및 spark-submit실행 옵션 총관리자 2016.10.28 1891

A personal place to organize information learned during the development of such Hadoop, Hive, Hbase, Semantic IoT, etc.
We are open to the required minutes. Please send inquiries to gooper@gooper.com.

위로