spark) spark shell에서 df 값이 달라지는 이슈

2023. 12. 28. 14:31

spark shell을 통해 테스트 코드 작성 중,

hive 테이블을 조회한 데이터프레임이 df.show()를 할 때마다 값이 바뀌는 현상이었습니다.

한 번 로드한 데이터를 메모리 상에 상주시키기 위해

df.cache()함수를 사용하였습니다.

예제코드:

from pyspark.sql import SparkSession

spark = SparkSession.builder\
	.config("hive.exec.dynamic.partition.mode", "nonstrict")\
    .config("partitionoverwritemode", "dynamic")\
    .appName("test")\
    .enableHiveSupport().getOrCreate()
    
df = spark.sql("select * from test.sample limit 10")

df.cache()
df.show()

'BigData > Spark' 카테고리의 다른 글

Spark란? (5)	2024.11.08
spark) Caused by: java.io.NotSerializableException (0)	2023.12.19
pyspark) kafka spark structured streaming HA 구성 시 중요사항 (2)	2023.12.18
pyspark) pyspark.sql.utils.StreamingQueryException: assertion failed: Concurrent update to the commit log. Multiple streaming jobs detected for 0 (0)	2023.12.18
spark kafka structured streaming 중 java.lang.IllegalStateException: Set() are gone. Kafka option 'kafka.group.id' has been set on this query, it is not recommended to set this option. (0)	2023.12.18

웅이 IT 저장소

spark) spark shell에서 df 값이 달라지는 이슈

'BigData > Spark' 카테고리의 다른 글

+ Recent posts

티스토리툴바