Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

hive upsert구현방법(년-월-일 파티션을 기준으로) 및 테스트 script

총관리자 2018.07.03 22:45 조회 수 : 5368

출처 : http://letsexplorehadoop.blogspot.com/2016/05/upsert-in-hive-3-step-process.html

아래설명을 기준으로 hive에서 실행해본 hive script

--------------------------------------------------------------------------------------------------

-- create table if not exists site_view_hist(

-- brower_name string,

-- clicks_count int,

-- impressions_count int)

-- partitioned by (hit_date date)

-- row format delimited

-- fields terminated by ',';

-- gooper@gsda1:/var/log$ hdfs dfs -cat /user/hive/warehouse/site_view_hist/hit_date=2016-01-01/000000_0

--iexplorer,123,456

SET hive.support.concurrency = true;

SET hive.enforce.bucketing = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

SET hive.txn.manager =org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

SET hive.compactor.initiator.on = true;

SET hive.compactor.worker.threads = 1;

truncate table site_view_hist;

truncate table site_view_raw;

insert into table site_view_hist partition(hit_date='2016-01-01') values('iexplorer', 123, 456);

insert into table site_view_hist partition(hit_date='2016-01-01') values('firefox', 123, 456);

insert into table site_view_hist partition(hit_date='2016-01-01') values('chrome', 123, 456);

insert into table site_view_hist partition(hit_date='2016-01-02') values('firefox', 111, 431);

insert into table site_view_hist partition(hit_date='2016-01-03') values('chrome', 234, 567);

insert into table site_view_hist partition(hit_date='2016-01-03') values('iexplorer', 234, 567);

insert into table site_view_hist partition(hit_date='2016-01-03') values('firefox', 987, 654);

insert into table site_view_hist partition(hit_date='2016-01-04') values('chrome', 529, 912);

insert into table site_view_hist partition(hit_date='2016-01-05') values('firefox', 911, 888);

insert into table site_view_hist partition(hit_date='2016-01-06') values('iexplorer', 900, 833);

select * from site_view_hist;

-- create table if not exists site_view_raw(

-- brower_name string,

-- clicks_count int,

-- impressions_count int)

-- partitioned by (hit_date date)

-- row format delimited

-- fields terminated by ',';

insert into table site_view_raw partition(hit_date='2016-01-01') values('chrome', 246, 789);

insert into table site_view_raw partition(hit_date='2016-01-01') values('firefox', 999, 200);

insert into table site_view_raw partition(hit_date='2016-01-31') values('iexplorer', 144, 999);

select * from site_view_raw;

select h.* from site_view_hist h where h.hit_date in (select distinct hit_date from site_view_raw r);

drop table site_view_temp1;

--아래 설명에서 subquery부분에 brower_name is not null을 추가하여 파티션만 있고 데이타 없는 경우는 포함되지 않도록함

create table site_view_temp1

as select h.* from site_view_hist h where h.hit_date in (select distinct hit_date from site_view_raw r where r.brower_name is not null);

select * from site_view_temp1;

create table site_view_temp2 as select t1.* from site_view_temp1 t1

where not exists

(select 1 from site_view_raw r

where t1.brower_name=r.brower_name

and t1.hit_date=r.hit_date);

select * from site_view_temp2;

insert into table site_view_temp2

select *

from site_view_raw;

select * from site_view_temp2;

insert overwrite table site_view_hist

partition(hit_date)

select * from site_view_temp2;

select * from site_view_hist;

--------------------------------------------------------------------------------------------------

UPSERT in Hive(3 Step Process)

In this post I am providing a 3 step process for performing UPSERT in hive on a large size table containing entire history.
Just for the audience not aware of UPSERT - It is a combination of UPDATE and INSERT. If on a table containing history data, we receive new data which needs to be inserted as well as some data which is an UPDATE to the existing data, then we have to perform an UPSERT operation to achieve this.

Prerequisite – The table containing history being very large in size should be partitioned, which is also a best practice for efficient data storage, when working with large data in hive.

Business scenario – Lets take a scenario of a website table containing website metrics as gathered from different browsers of visitors who visited the website. The site_view_hist table contains the clicks and page impressions counts from different browsers and the table is partitioned on hit_date(the date on which the visitor visited the website).

Clicks – number of clicks(Eg on adds displayed) done by visitor on website page.
Impressions – number of times the website pages or different sections were viewed by the visitor.

Problem statement - If we receive correction in the number of clicks and impressions as recorded by browser, we need to update them in the history table and also insert any new records we received.

Lets dive into it:
In the history table we have browser_name and hit_date as a composite key which will remain constant and we receive updates in the values of clicks_count and impressions_count columns.

DDL of history table

Data:

Now suppose we receive records for date 2016-01-01(marked in blue) for firefox and chrome browsers, with an updated value of clicks and impressions, and we also received a new record(iexplorer) for 2016-01-31. Let us store these new and updated records in the following raw table:

DDL of Raw table

Data

Now we need an UPSERT solution, which updates the records of site_view_hist table for hit_date 2016-01-01 and insert the new record for 2016-01-31.

SOLUTION (3 STEP):
To achieve this in an efficient way, we will use the following 3 step process:
Prep Step - We should first get those partitions from the history table which needs to be updated. So we create a temp table site_view_temp1 which contains the rows from history with hit_date equal to the hit_date of raw table.
This will bring us all the hit_date partitions from history table for which atleast one updated record exists in the raw table.
Note - Instead of table we can also create a view for efficient processing and saving storage space.

Data of site_view_temp1 table:

Step 1 – From these fetched partitions we will separate the old unchanged rows. These are the rows in which there is no change in the clicks and impressions count. For this we will create a temp table site_view_temp2 as follows:

Data of site_view_temp2 table:

Step2 – Now we will insert into this new temp table, all the rows from the raw table. This step will bring in the updated rows as well as any new rows. And since site_view_temp2 already contained the old rows, so it will now have all the rows including new, updated, and unchanged old rows. Following query does this:

New Data of site_view_temp2 table

Step3 – Now simply insert overwriting the site_view_hist table with site_view_temp2 table, will provide us the required output rows including two updated rows for 2016-01-01 and one new inserted row for 2016-01-31.
Catch – Since the history table is partitioned on the hit_date, the respective partitions will only be overwritten as follows:

Final history table with updated and inserted rows:

Benefits of this approach:

In the prep step itself since we are fetching just the partitions we have to update, so we are not scanning the whole history table. This makes our processing faster.
In the final step as we are insert overwriting the history with the temp table, we are touching just the partition we want to update along with a new partition created for the new record.This gives a high performance gain, as I gained for my production process on a 6.7 TB history table with over 5 billion records. But since my 3 step process(included in one hive script) just touched few partitions of few thousand rows, the process completed in just minutes.

이 게시물을

이 글의 추천인 목록 목록

번호	제목	날짜	조회 수
650	[보안/인증]javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target발생 원인/조치내용	2023.10.24	4903
649	[HA구성 이슈]oozie 2대를 L4로 HA구성했을때 발생하는 이슈	2023.01.17	4903
648	Cacti로 Hadoop 모니터링 하기	2013.03.12	4902
647	oozie에서 share lib설정시 action type별로 구분하여 넣을것	2014.04.18	4899
646	[KUDU] kudu tablet server여러가지 원인에 의해서 corrupted상태가 된 경우 복구방법	2023.03.28	4883
645	[JSON 파싱]mongodb의 document를 GSON을 이용하여 parsing할때 ObjectId값에서 오류 발생시 조치방법	2017.01.18	4864
644	banana pi에 hive 0.13.1+mysql(metastore)설치	2014.09.09	4858
643	json으로 존재하는 데이터 parsing하기	2019.03.25	4849
642	hadoop설치시 참고사항	2013.03.08	4849
641	spark에서 hive table을 읽어 출력하는 예제 소스	2017.03.09	4848
640	[Impala 3.2버젼]compute incremental stats db명.테이블명 수행시 ERROR: AnalysisException: Incremental stats size estimate exceeds 2000.00MB. 오류 발생원인및 조치방안	2022.11.30	4845
639	Impala Admission Control 설정시 쿼리가 사용하는 메모리 사용량 판단 방법	2023.05.19	4844
638	임시 테이블에서 데이터를 읽어서 partitioned table에 입력하는 impala SQL문 예시	2023.11.10	4839
637	Scala에서 countByWindow를 이용하기(예제)	2018.03.08	4838
636	[ftgo_application]Unable to infer base url오류 발생시 조치방법	2023.02.20	4826
635	[CDP7.1.7]Oozie job에서 ERROR: Kudu error(s) reported, first error: Timed out: Failed to write batch of 774 ops to tablet 8003f9a064bf4be5890a178439b2ba91가 발생하면서 쿼리가 실패하는 경우	2024.01.05	4818
634	[application수행 로그]Failed to read the application application_123456789012_123456시 조치 방법	2022.03.21	4818
633	hadoop 클러스터 실행 스크립트 정리	2018.03.20	4815
632	mysql에서 외부 디비를 커넥션할 경우 접속 속도가 느려질때	2017.06.30	4813
631	hive의 메타정보 테이블을 MariaDB로 사용하는 경우 table comment나 column comment에 한글 입력시 깨지는 경우 utf8로 바꾸는 방법.	2023.03.10	4811

쓰기 태그

첫 페이지 1 2 3 4 5 6 7 8 9 10 끝 페이지

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

hive upsert구현방법(년-월-일 파티션을 기준으로) 및 테스트 script

UPSERT in Hive(3 Step Process)

댓글 0

Cloudera, BigData, Semantic IoT, Hadoop, NoSQL

Cloudera CDH/CDP 및 Hadoop EcoSystem, Semantic IoT등의 개발/운영 기술을 정리합니다. gooper@gooper.com로 문의 주세요.

hive upsert구현방법(년-월-일 파티션을 기준으로) 및 테스트 script

UPSERT in Hive(3 Step Process)

댓글 0

LOGIN