Apache Spark, Parallel Distributed Processing

Apache Spark, Parallel Distributed Processing

μ„€λͺ…
Summary of Spark & Flink for Real-time Big Data Processing - part2
Last Updated
Last updated July 29, 2023
νƒœκ·Έ
spark
DE
Data Engineering
πŸ’‘
This article is a summary of "Spark & Flink Online for Real-Time Big Data Processing" from Fast Campus. It summarizes the overall definition and recommends taking the lecture for programming related content.

빅데이터 처리λ₯Ό μœ„ν•œ μ•„νŒŒμΉ˜ 슀파크 (Apache Spark)

μ•„νŒŒμΉ˜ μŠ€νŒŒν¬λŠ” μ•„λ§ˆμ‘΄, 에어비앀비 같은 κΈ°μ—…μ—μ„œ λΉ…λ°μ΄ν„°λΌλŠ” κ³΅ν†΅μ˜ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μ‚¬μš©ν•˜λŠ” μ˜€ν”ˆμ†ŒμŠ€ 고속 λΆ„μ‚° 처리 μ‹œμŠ€ν…œμž…λ‹ˆλ‹€. 빅데이터 λ¬Έμ œλŠ” 크기, 속도, λ‹€μ–‘μ„±μ˜ μ„Έ 가지 λ²”μ£Όλ‘œ λ‚˜λ‰©λ‹ˆλ‹€. ꡬ글은 HDFS, 맡리듬, μ—°μ˜ μ„Έ 가지 μ£Όμš” λΆ€λΆ„μœΌλ‘œ κ΅¬μ„±λœ 단일 μ—μ½”μ‹œμŠ€ν…œμ„ κ°œλ°œν•˜μ—¬ 이 문제λ₯Ό ν•΄κ²°ν–ˆμŠ΅λ‹ˆλ‹€. μ•„νŒŒμΉ˜ μŠ€νŒŒν¬λŠ” ν•˜λ‘‘μ˜ 계산 엔진을 λŒ€μ²΄ν•˜λŠ” ν”„λ‘œμ νŠΈλ‘œ, λ©”λͺ¨λ¦¬ κ°œλ°©ν˜• ꡬ쑰둜 인해 속도가 λΉ λ¦…λ‹ˆλ‹€.
μŠ€νŒŒν¬λŠ” 데이터λ₯Ό μ—¬λŸ¬ λ…Έλ“œμ— 걸쳐 λ©”λͺ¨λ¦¬μ— μ €μž₯ν•˜κ³  λ™μ‹œμ— μ²˜λ¦¬ν•˜λŠ” λ°©μ‹μœΌλ‘œ 데이터λ₯Ό μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μŠ€νŒŒν¬λŠ” 슀파크 μ»¨ν…μŠ€νŠΈκ°€ μžˆλŠ” λ“œλΌμ΄λ²„ ν”„λ‘œκ·Έλž¨κ³Ό μŠ€ν¬λŸ¬λ²„ μ£Ό μž‘μ—… λ…Έλ“œλ‘œ λ‚˜λˆŒ 수 μžˆλŠ” ν΄λŸ¬μŠ€ν„°λ₯Ό ν˜•μ„±ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ μŠ€νŒŒν¬λŠ” λŒ€κΈ° 쀑에도 κ³Όμž₯된 계산을 μˆ˜ν–‰ν•  수 μžˆμ–΄ ν”„λ‘œμ„ΈμŠ€ 속도λ₯Ό 높일 수 μžˆμŠ΅λ‹ˆλ‹€. 슀파크의 핡심 데이터 λͺ¨λΈμ€ μ—¬λŸ¬ νŒŒν‹°μ…˜μœΌλ‘œ λΆ„λ¦¬λ˜μ–΄μ•Ό ν•˜λŠ” λΆˆλ³€μ˜ 탄λ ₯적인 λΆ„μ‚° 데이터 μΊμ‹œμΈ RDD라고 λΆˆλ¦½λ‹ˆλ‹€.
μŠ€νŒŒν¬λŠ” νŒλ‹€λ³΄λ‹€ 속도가 λŠλ¦¬μ§€λ§Œ μˆ˜ν‰ ν™•μž₯이 κ°€λŠ₯ν•˜κΈ° λ•Œλ¬Έμ— 빅데이터 처리 속도가 λΉ λ₯΄κ³  νš¨μœ¨μ μž…λ‹ˆλ‹€. Hadoop Memorit보닀 λ©”λͺ¨λ¦¬μ—μ„œ 100λ°°, λ””μŠ€ν¬μ—μ„œ 10λ°° 더 λΉ λ¦…λ‹ˆλ‹€.
μš”μ•½ν•˜μžλ©΄, μ•„νŒŒμΉ˜ μŠ€νŒŒν¬λŠ” 빅데이터 μ²˜λ¦¬μ— ν•„μˆ˜μ μΈ λ„κ΅¬λ‘œ λŒ€μš©λŸ‰ 데이터 μ²˜λ¦¬λΌλŠ” κ³Όμ œμ— λΉ λ₯΄κ³  효율적인 μ†”λ£¨μ…˜μ„ μ œκ³΅ν•©λ‹ˆλ‹€.
Β 

Apache Spark의 κ°„λž΅ν•œ 역사

이 λΈ”λ‘œκ·Έ κ²Œμ‹œλ¬Όμ—μ„œλŠ” 버전 1, 2, 3에 걸친 κ°œλ°œμ— μ΄ˆμ μ„ 맞좰 Apache Spark의 역사λ₯Ό κ°„λž΅ν•˜κ²Œ μ‚΄νŽ΄λ΄…λ‹ˆλ‹€. 핡심 사항은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

μ•„νŒŒμΉ˜ 슀파크 버전 1

  • 2014λ…„ 곡식 λ°œν‘œ
  • RDDλ₯Ό μ΄μš©ν•œ 2λ©”λͺ¨λ¦¬ 처리 방식
  • 버전 1.3μ—μ„œ "데이터 높이"λΌλŠ” 데이터 ν”„λ ˆμž„ μΆ”κ°€
  • ν”„λ‘œμ νŠΈ ν……μŠ€ν…μ„ ν†΅ν•œ 엔진 μ—…κ·Έλ ˆμ΄λ“œλ‘œ λ©”λͺ¨λ¦¬ 및 CPU νš¨μœ¨μ„± μ΅œμ ν™”

μ•„νŒŒμΉ˜ 슀파크 버전 2

  • 2016년에 λ°œν‘œ
  • λ”μš± λ‹¨μˆœν™”λ˜κ³  μ„±λŠ₯이 ν–₯μƒλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  • Spark SQL μœ„μ— ν”„λ‘œμ νŠΈ κ΅¬μ‘°ν™”λœ 슀트리밍으둜 슀트리밍 처리 κ°€λŠ₯
  • "데이터 μΊμ‹œ"라고 λΆˆλ¦¬λŠ” μΌμ •λŸ‰μ˜ 데이터 ν”„λ ˆμž„μ„ ν†΅ν•œ 데이터 캐싱
  • 파이썬, μžλ°”, 슀칼라 λ“± λ‹€μ–‘ν•œ μ–Έμ–΄ 지원

μ•„νŒŒμΉ˜ 슀파크 버전 3

  • 2020년에 μΆœμ‹œ
  • MLlib 및 Spark SQL κΈ°λŠ₯ μΆ”κ°€
  • λŒ€ν­μ μΈ μ„±λŠ₯ ν–₯상 - Spark 2.4보닀 μ•½ 200% 빨라짐
  • μ£Όλͺ©ν•  λ§Œν•œ PySpark μ‚¬μš©
  • Project GraphXλ₯Ό 톡해 GPU 기반 μ›Œν¬λ‘œλ“œ 및 λΆ„μ‚° κ·Έλž˜ν”„ 연산에 λŒ€ν•œ 지원 κ°•ν™”
  • 파이썬 2 지원 쀑단, μΏ λ²„λ„€ν‹°μŠ€ 지원 κ°•ν™”

슀파크 ꡬ성

  • SparkλŠ” 5가지 μœ ν˜•μœΌλ‘œ λ‚˜λ‰©λ‹ˆλ‹€: Spark Core, Spark SQL, Spark Streaming, MLlib, GraphXμž…λ‹ˆλ‹€.
  • 슀파크 코어와 같은 기본적인 뢀뢄은 컬러 κΈ°λŠ₯이 μΆ”κ°€λ˜λ”λΌλ„ λ³€ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
결둠적으둜 μ•„νŒŒμΉ˜ μŠ€νŒŒν¬λŠ” κ³„μ†ν•΄μ„œ κΈ°λŠ₯을 μΆ”κ°€ν•˜κ³  μ„±λŠ₯을 κ°œμ„ ν•˜κ³  μžˆμ§€λ§Œ, 기본적인 뢀뢄은 λ³€ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ 슀파크 코어와 같은 기초적인 λΆ€λΆ„λΆ€ν„° μ°¨κ·Όμ°¨κ·Ό λ°°μ›Œλ‚˜κ°€λŠ” 것이 κ°€μž₯ μ’‹μŠ΅λ‹ˆλ‹€.

Apache Spark의 RDD μ†Œκ°œ

이 λΈ”λ‘œκ·Έ κ²Œμ‹œλ¬Όμ—μ„œλŠ” Apache Spark의 RDD(탄λ ₯적 λΆ„μ‚° 데이터 μ„ΈνŠΈ)에 λŒ€ν•΄ μ•Œμ•„λ΄…λ‹ˆλ‹€. 핡심은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
  • RDDλŠ” "탄λ ₯적 λΆ„μ‚° 데이터 μ„ΈνŠΈ" λ˜λŠ” "ν˜„λͺ…ν•œ μ—”λ“œ 배포 λ·° λ‘œλ“œ 데이터 μ…‹"의 μ•½μžμž…λ‹ˆλ‹€. μš°λ¦¬λ§λ‘œλŠ” "탄λ ₯적 λΆ„μ‚° 데이터"λΌλŠ” λœ»μž…λ‹ˆλ‹€.
  • SparkContext의 "textFile" ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ νŒŒμΌμ—μ„œ 데이터λ₯Ό λ‘œλ“œν•˜μ—¬ RDDλ₯Ό 생성할 수 μžˆμŠ΅λ‹ˆλ‹€.
  • RDDμ—λŠ” λ‹€μ„― 가지 νŠΉμ§•μ΄ μžˆμŠ΅λ‹ˆλ‹€:
      1. 데이터 좔상화: 데이터가 μ—¬λŸ¬ λ…Έλ“œμ— 흩어져 μžˆμ§€λ§Œ ν•˜λ‚˜μ˜ 객체둜 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
      1. 탄λ ₯적 & Immutable : 데이터λ₯Ό λΆˆλ³€μœΌλ‘œ μœ μ§€ν•˜κ³  ν•„μš”ν•  λ•Œ μ•„μΉ΄μ΄λΈŒν•˜μ—¬ λ„€νŠΈμ›Œν¬, ν•˜λ“œμ›¨μ–΄, λ©”λͺ¨λ¦¬μ˜ μž₯μ• λ₯Ό μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
      1. μœ ν˜• μ•ˆμ „: 컴파일 μœ ν˜•μ„ νŒŒμ•…ν•˜μ—¬ 문제λ₯Ό 쑰기에 감지할 수 μžˆμŠ΅λ‹ˆλ‹€.
      1. μ •ν˜• 및 λΉ„μ •ν˜• 데이터λ₯Ό λͺ¨λ‘ μ²˜λ¦¬ν•©λ‹ˆλ‹€: ν…μŠ€νŠΈ 데이터와 ν‘œ ν˜•μ‹μ˜ 데이터 집합을 λͺ¨λ‘ μ²˜λ¦¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
      1. Lazy evaluation: κ²°κ³Όκ°€ ν•„μš”ν•  λ•ŒκΉŒμ§€ 계산을 μˆ˜ν–‰ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
  • RDD μž‘μ—…μ€ λ³€ν™˜κ³Ό μž‘μ—…μ˜ 두 λΆ€λΆ„μœΌλ‘œ λ‚˜λˆŒ 수 μžˆμŠ΅λ‹ˆλ‹€.
  • RDD μ‚¬μš©μ˜ μž₯μ μœΌλ‘œλŠ” μœ μ—°μ„±, κ°„κ²°ν•œ μ½”λ“œ, 데이터 μ‘°μž‘ μ‹œ 더 신쀑할 수 μžˆλ‹€λŠ” 점 등이 μžˆμŠ΅λ‹ˆλ‹€.
μ „λ°˜μ μœΌλ‘œ RDDλŠ” Apache Sparkμ—μ„œ λŒ€κ·œλͺ¨ λΆ„μ‚° 데이터 처리λ₯Ό μ²˜λ¦¬ν•˜λŠ” μœ μ—°ν•˜κ³  탄λ ₯적이며 효율적인 방법을 μ œκ³΅ν•©λ‹ˆλ‹€.

병렬 μ²˜λ¦¬μ™€ λΆ„μ‚° 처리

이 κ°•μ˜μ—μ„œλŠ” 병렬 μ²˜λ¦¬μ™€ λΆ„μ‚° 처리의 차이점에 λŒ€ν•΄ λ°°μ› μŠ΅λ‹ˆλ‹€:

병렬 처리

  • μ—¬λŸ¬ λ…Έλ“œμ—μ„œ λ™μ‹œμ— μž‘μ—…μ„ μ‹€ν–‰ν•˜λŠ” 것을 ν¬ν•¨ν•©λ‹ˆλ‹€.
  • 데이터λ₯Ό μ—¬λŸ¬ 쑰각으둜 λΆ„ν• ν•˜κ³  각 μž‘μ—…μ„ μ—¬λŸ¬ μŠ€λ ˆλ“œμ— μ μš©ν•©λ‹ˆλ‹€.
  • μ—¬λŸ¬ μŠ€λ ˆλ“œμ—μ„œ λ™μ‹œμ— μž‘λ™ν•˜μ—¬ 더 λΉ λ₯Έ κ²°κ³Όλ₯Ό 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
  • 각 μŠ€λ ˆλ“œμ˜ κ²°κ³Όλ₯Ό κ²°ν•©ν•΄μ•Ό ν•  μˆ˜λ„ 있음

λΆ„μ‚° 처리

  • "데이터 병렬 λΆ„μ‚°"을 μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  • 데이터λ₯Ό μ—¬λŸ¬ 쑰각으둜 λΆ„ν• ν•˜μ—¬ μ—¬λŸ¬ λ…Έλ“œλ‘œ μ „μ†‘ν•©λ‹ˆλ‹€.
  • μ—¬λŸ¬ λ…Έλ“œμ—μ„œ λ…λ¦½μ μœΌλ‘œ μž‘μ—… 적용
  • 각 λ…Έλ“œμ˜ κ²°κ³Όλ₯Ό ν•©μ‚°ν•΄μ•Ό ν•  μˆ˜λ„ 있음
  • λ…Έλ“œ κ°„ 톡신에 λŒ€ν•œ κ³ λ € ν•„μš”
μŠ€νŒŒν¬λŠ” RDDλ₯Ό μ΄μš©ν•΄ 데이터 병렬 λͺ¨λΈμ„ κ΅¬ν˜„ν•˜κ³  μΆ”μƒν™”ν•¨μœΌλ‘œμ¨ λΆ„μ‚° ν™˜κ²½μ—μ„œλ„ 병렬 μ²˜λ¦¬ν•˜λ“― μ½”λ“œλ₯Ό μž‘μ„±ν•  수 μžˆμŠ΅λ‹ˆλ‹€. ν•˜μ§€λ§Œ 졜적의 μ„±λŠ₯을 λ‚΄κΈ° μœ„ν•΄μ„œλŠ” 톡신 속도에 신경을 써야 ν•©λ‹ˆλ‹€.
λ‹€μŒ κ°•μ˜μ—μ„œλŠ” λΆ„μ‚° ν™˜κ²½μ—μ„œμ˜ λ…Έλ“œ κ°„ 톡신에 λŒ€ν•΄ μžμ„Ένžˆ μ•Œμ•„λ³΄κ² μŠ΅λ‹ˆλ‹€.

λΆ„μ‚° μ²˜λ¦¬μ™€ 지연 μ‹œκ°„ 문제

이번 λΈ”λ‘œκ·Έ ν¬μŠ€νŒ…μ—μ„œλŠ” Spark의 λΆ„μ‚° 처리 및 지연 λ¬Έμ œμ™€ κ΄€λ ¨λœ μ΄μŠˆμ— λŒ€ν•΄ μ•Œμ•„λ³΄κ² μŠ΅λ‹ˆλ‹€:
1. λΆ€λΆ„ μž₯μ•  문제 - Spark의 RDDλŠ” 일뢀 λ…Έλ“œμ— μž₯μ• κ°€ λ°œμƒν•  경우 데이터λ₯Ό λ‹€μ‹œ 계산할 수 μžˆλ„λ‘ ν•¨μœΌλ‘œμ¨ λΆ€λΆ„ μž₯μ•  문제λ₯Ό μ²˜λ¦¬ν•˜λŠ” 데 도움을 μ€λ‹ˆλ‹€. 이 ν”„λ‘œμ„ΈμŠ€λŠ” μžλ™ν™”λ˜μ–΄ 있으며 Sparkμ—μ„œ μ²˜λ¦¬ν•©λ‹ˆλ‹€.
2. ν•„μš”μ•…μœΌλ‘œμ„œμ˜ 속도 - μ—¬λŸ¬ λ…Έλ“œμ— 걸쳐 톡신이 ν•„μš”ν•œ μž‘μ—…μ€ 톡신 μ§€μ—°μœΌλ‘œ 인해 속도가 느렀질 수 μžˆμŠ΅λ‹ˆλ‹€. μ„±λŠ₯을 μ΅œμ ν™”ν•˜λ €λ©΄ μ‹ μ€‘ν•œ 코딩이 ν•„μš”ν•©λ‹ˆλ‹€.
3. 단좕킀 ν•¨μˆ˜μ˜ 영ν–₯ - 단좕킀 ν•¨μˆ˜μ— μ˜ν•΄ 슀파크 μ½”λ“œμ˜ μ„±λŠ₯이 영ν–₯을 받을 수 μžˆμŠ΅λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” 톡신이 ν•„μš”ν•˜λ©° μ½”λ“œ μ„±λŠ₯이 느렀질 수 μžˆμŠ΅λ‹ˆλ‹€. κ°μ†Œλœ λ°μ΄ν„°μ—μ„œ reducebykeyλ₯Ό ν˜ΈμΆœν•˜λ©΄ μ„±λŠ₯이 ν–₯상될 수 μžˆμŠ΅λ‹ˆλ‹€.
4. 톡신 속도 - λ„€νŠΈμ›Œν¬ 톡신은 μŠ€νŒŒν¬μ—μ„œ κ°€μž₯ 느린 μ—°μ‚°μœΌλ‘œ λ©”λͺ¨λ¦¬ 연산보닀 지연 μ‹œκ°„μ΄ μ•½ 100만 λ°° λŠλ¦½λ‹ˆλ‹€.
λ”°λΌμ„œ Spark μ½”λ“œλ₯Ό κ°œλ°œν•  λ•Œ λ„€νŠΈμ›Œν¬ 톡신이 μ„±λŠ₯에 λ―ΈμΉ˜λŠ” 영ν–₯을 κ³ λ €ν•˜κ³  그에 따라 μ΅œμ ν™”ν•˜λŠ” 것이 μ€‘μš”ν•©λ‹ˆλ‹€.

Key Value RDD (paired RDD)

이번 κ°•μ˜μ—μ„œλŠ” 킀와 κ°’ 쌍이 μžˆλŠ” RDD인 ν‚€κ°’ RDD에 λŒ€ν•΄ λ°°μ› μŠ΅λ‹ˆλ‹€. ν‚€λ₯Ό 기반으둜 고차원적인 연산을 ν•  수 있으며, 주둜 값을 기반으둜 톡계λ₯Ό κ³„μ‚°ν•˜λŠ” 데 μ‚¬μš©λ©λ‹ˆλ‹€. NoSQL 처럼 μ“Έμˆ˜ μžˆλŠ”λ“―
μ£Όμš” λ‚΄μš©μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
  • ν‚€ κ°’ RDDλŠ” 킀와 κ°’ 쌍이 μžˆλŠ” RDDμž…λ‹ˆλ‹€.
  • ν‚€λ₯Ό 기반으둜 ν•˜λŠ” 고차원 연산에 μ‚¬μš©λ©λ‹ˆλ‹€.
  • ν‚€ κ°’ RDDλŠ” 값을 기반으둜 톡계λ₯Ό κ³„μ‚°ν•˜λŠ” 데 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • Reduction: ν‚€ 값을 기반으둜 데이터λ₯Ό κ·Έλ£Ήν™”ν•˜μ—¬ μ²˜λ¦¬ν•˜λŠ” κΈ°λŠ₯으둜, 데이터λ₯Ό μ€„μ΄λŠ” 데 μ‚¬μš©λ©λ‹ˆλ‹€.
    • notion image
  • Join: ν‚€λ₯Ό 기반으둜 μ—¬λŸ¬ 개의 RDDλ₯Ό κ²°ν•©ν•˜λŠ” μ˜€λ²„λ‘œλ“œμž…λ‹ˆλ‹€.
  • ν‚€ κ°’ RDDλ₯Ό μ‚¬μš©ν•  λ•ŒλŠ” 효율적인 νŒŒν‹°μ…˜ μ‚¬μš©μ„ μœ„ν•΄ map() ν•¨μˆ˜ λŒ€μ‹  mapValues(), flatMapValues() 만 μ‚¬μš©ν•˜λŠ” 것이 μ€‘μš”ν•©λ‹ˆλ‹€. β†’Value만 λ‹€λ£¨λŠ” μ—°μ‚°λ“€μ΄μ§€λ§Œ RDDμ—μ„œ keyλŠ” μœ μ§€λ¨
  • ν‚€ κ°’ RDDλŠ” ν‘œμ€€ RDD와 크게 λ‹€λ₯΄μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
결둠적으둜 ν‚€κ°’ RDDλŠ” 킀와 값을 기반으둜 데이터λ₯Ό 효율적으둜 μ‘°μž‘ν•˜κ³  톡계 뢄석을 ν•  수 μžˆλŠ” κ°•λ ₯ν•œ λ„κ΅¬μž…λ‹ˆλ‹€. 빅데이터와 Spark둜 μž‘μ—…ν•˜λŠ” λͺ¨λ“  μ‚¬λžŒμ—κ²Œ ν•„μˆ˜μ μΈ κ°œλ…μž…λ‹ˆλ‹€.
Β 

슀파크의 νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜ μ΄ν•΄ν•˜κΈ°

νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ˜ κ°œλ… μ΄ν•΄ν•˜κΈ°

  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ˜ κΈ°λ³Έ κ°œλ… μ„€λͺ…
  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜: RDDμ—μ„œ μƒˆλ‘œμš΄ RDDλ₯Ό λ§Œλ“œλŠ” μ—°μ‚°
  • μ•‘μ…˜: κ²°κ³Όλ₯Ό 좜λ ₯ν•˜κ±°λ‚˜ μ €μž₯ν•˜λŠ” μ—°μ‚°
  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ˜ νŠΉμ§•: 지연 μ‹€ν–‰λ˜λ©°, μ•‘μ…˜μ„ λ§Œλ‚  λ•Œ 싀행됨
  • μ•‘μ…˜μ˜ νŠΉμ§•: μ¦‰μ‹œ 싀행됨

νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ˜ μ’…λ₯˜

  • 맡 (Map) νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜: 1λŒ€1 λ³€ν™˜, μš”μ†Œλ₯Ό λ‹€λ₯Έ μš”μ†Œλ‘œ λ³€ν™˜
  • ν•„ν„° (Filter) νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜: 주어진 쑰건에 λ§žλŠ” μš”μ†Œλ§Œ μ„ νƒν•˜μ—¬ μƒˆλ‘œμš΄ RDD 생성
  • 기타 νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜: reduceByKey, groupBy, flatMap λ“±μ˜ λ‹€μ–‘ν•œ ν•¨μˆ˜λ“€

μ•‘μ…˜μ˜ ν™œμš©κ³Ό μ£Όμ˜μ‚¬ν•­

  • μ»¬λ ‰νŠΈ (Collect) μ•‘μ…˜: RDD의 λͺ¨λ“  데이터λ₯Ό κ°€μ Έμ™€μ„œ 리슀트 ν˜•νƒœλ‘œ λ°˜ν™˜
  • 카운트 (Count) μ•‘μ…˜: RDD의 μš”μ†Œ 개수λ₯Ό μ„Έμ–΄μ€Œ
  • λ””μŠ€ν‹΄νŠΈ (Distinct) μ•‘μ…˜: 쀑볡을 μ œκ±°ν•˜μ—¬ μœ λ‹ˆν¬ν•œ κ°’λ§Œ μΆ”μΆœ
  • μ•‘μ…˜ μ‚¬μš© μ‹œ 주의점: λͺ¨λ“  데이터λ₯Ό κ°€μ Έμ˜€κΈ° λ•Œλ¬Έμ— λ¦¬μ†ŒμŠ€ 낭비에 주의

νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ˜ μ‹€ν–‰ μˆœμ„œ μ΄ν•΄ν•˜κΈ°

  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜: ν”„λ‘œκ·Έλž¨μ΄ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ„ λ§Œλ‚˜λ©΄ 연산을 κΈ°μ–΅ν•˜κ³  μ•‘μ…˜μ„ λ§Œλ‚  λ•Œ μ‹€ν–‰
  • μ•‘μ…˜: ν”„λ‘œκ·Έλž¨μ˜ 싀행을 μ‹œμž‘ν•˜λŠ” ν•¨μˆ˜λ‘œ κ²°κ³Όλ₯Ό μ¦‰μ‹œ 좜λ ₯ν•˜κ±°λ‚˜ μ €μž₯

μ‹€μŠ΅μ„ ν†΅ν•œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜ 이해

  • 예제λ₯Ό 톡해 맡과 ν•„ν„° λ“±μ˜ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜ μ‹€μŠ΅
  • μ»¬λ ‰νŠΈμ™€ 카운트λ₯Ό μ‚¬μš©ν•˜μ—¬ RDD의 데이터 확인과 개수 μ„ΈκΈ° μ‹€μŠ΅
  • 쀑볡 제거λ₯Ό μœ„ν•œ λ””μŠ€νŒ…νŠΈ ν•¨μˆ˜μ˜ ν™œμš©

νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ˜ μ„±λŠ₯ μ΅œμ ν™”

  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ˜ μ„±λŠ₯ 차이 이해
  • μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ…”ν”Œλ§μ˜ λ¦¬μ†ŒμŠ€ μ†ŒλΉ„ 이해
  • μ΅œμ ν™”λ₯Ό μœ„ν•œ νŠΈλ¦­μŠ€μ™€ 방법둠 μ†Œκ°œ

μœ λ‹ˆμ–Έ(Union) νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜

  • μœ λ‹ˆμ–Έμ˜ κ°œλ…κ³Ό ν™œμš©
  • 두 개의 RDDλ₯Ό ν•©μ³μ„œ μƒˆλ‘œμš΄ RDD 생성
  • μ€‘λ³΅λœ μš”μ†Œλ„ ν¬ν•¨ν•˜μ—¬ λͺ¨λ“  μš”μ†Œλ₯Ό ν•©μΉ¨
  • 예제λ₯Ό 톡해 μœ λ‹ˆμ–Έ μ‹€μŠ΅κ³Ό κ²°κ³Ό 확인

μΈν„°μ„Ήμ…˜(Intersection) νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜

  • μΈν„°μ„Ήμ…˜μ˜ κ°œλ…κ³Ό ν™œμš©
  • 두 개의 RDDμ—μ„œ κ³΅ν†΅λœ μš”μ†Œλ§Œ μΆ”μΆœν•˜μ—¬ μƒˆλ‘œμš΄ RDD 생성
  • μ€‘λ³΅λœ μš”μ†ŒλŠ” ν•œ 번만 포함
  • 예제λ₯Ό 톡해 μΈν„°μ„Ήμ…˜ μ‹€μŠ΅κ³Ό κ²°κ³Ό 확인

μ„œλΈŒνŠΈλž™νŠΈ(Subtract) νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜

  • μ„œλΈŒνŠΈλž™νŠΈμ˜ κ°œλ…κ³Ό ν™œμš©
  • 첫 번째 RDDμ—μ„œ 두 번째 RDD에 μžˆλŠ” μš”μ†Œλ₯Ό μ œμ™Έν•œ λ‚˜λ¨Έμ§€ μš”μ†Œ μΆ”μΆœν•˜μ—¬ μƒˆλ‘œμš΄ RDD 생성
  • 예제λ₯Ό 톡해 μ„œλΈŒνŠΈλž™νŠΈ μ‹€μŠ΅κ³Ό κ²°κ³Ό 확인

μƒ˜ν”Œλ§(Sampling) ν•¨μˆ˜

  • μƒ˜ν”Œλ§ ν•¨μˆ˜μ˜ κ°œλ…κ³Ό νŒŒλΌλ―Έν„° 이해
  • λ¬΄μž‘μœ„λ‘œ RDDμ—μ„œ 일뢀 μš”μ†Œ μΆ”μΆœ
  • 첫 번째 νŒŒλΌλ―Έν„°: μƒ˜ν”Œλ§ν•˜λŠ” μš”μ†Œ 개수
  • 두 번째 νŒŒλΌλ―Έν„°: 볡원 μΆ”μΆœ μ—¬λΆ€(기본값은 True)
  • μ„Έ 번째 νŒŒλΌλ―Έν„°: 랜덀 μ‹œλ“œ κ°’ 지정(기본값은 None)
  • 예제λ₯Ό 톡해 μƒ˜ν”Œλ§ ν•¨μˆ˜ μ‹€μŠ΅κ³Ό κ²°κ³Ό 확인

μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό groupBy νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜

  • μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ˜ κ°œλ…κ³Ό νŠΉμ§• 이해
  • groupBy νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ˜ ν™œμš©
  • νŠΉμ • κΈ°μ€€μœΌλ‘œ RDD의 데이터λ₯Ό κ·Έλ£Ήν™”ν•˜μ—¬ μƒˆλ‘œμš΄ RDD 생성
  • 예제λ₯Ό 톡해 groupBy νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜ μ‹€μŠ΅κ³Ό κ²°κ³Ό 확인

μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό ν‚€-κ°’(Key-Value) RDD

  • ν‚€-κ°’ RDD의 κ°œλ…κ³Ό ν™œμš© 이해
  • μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ„ ν‚€-κ°’ RDD에 μ μš©ν•˜λŠ” 방법
  • 기본적인 μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜ 예제λ₯Ό 톡해 ν‚€-κ°’ RDD 닀루기

μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ˜ κ³ κΈ‰ ν™œμš©

  • μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ„ ν™œμš©ν•œ λ³΅μž‘ν•œ 계산 예제 μ„€λͺ…
  • ν‚€-κ°’ RDDλ₯Ό μ΄μš©ν•˜μ—¬ κ·Έλ£Ήν™”ν•˜κ³  μ§‘κ³„ν•˜λŠ” 방법 μ†Œκ°œ
  • μ‹€μ „ 예제λ₯Ό 톡해 μ™€μ΄λ“œ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜μ˜ κ³ κΈ‰ ν™œμš© 이해

νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ„ ν™œμš©ν•œ 데이터 처리 ν”„λ‘œμ νŠΈ μ˜ˆμ‹œ

  • μœ λ‹ˆμ–Έ, μΈν„°μ„Ήμ…˜, μ„œλΈŒνŠΈλž™νŠΈ λ“±μ˜ νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ„ μ‘°ν•©ν•˜μ—¬ 데이터 μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œ 제곡
  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜μ˜ μ‹€ν–‰ μˆœμ„œλ₯Ό κ³ λ €ν•˜μ—¬ ν”„λ‘œμ νŠΈ 개발 방법 μ„€λͺ…
Β 

슀파크의 μ„±λŠ₯을 μ΅œμ ν™”ν•˜λŠ” 데 μ‚¬μš©λ˜λŠ” Cache()와 Persist()

  • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜κ³Ό μ•‘μ…˜: 슀파크의 두 가지 μ—°μ‚° μ˜€νΌλ ˆμ΄μ…˜
    • νŠΈλžœμŠ€ν¬λ©”μ΄μ…˜(Transformation): μƒˆλ‘œμš΄ RDDλ₯Ό μƒμ„±ν•˜λŠ” μ—°μ‚°μœΌλ‘œ 지연 싀행됨
    • μ•‘μ…˜(Action): κ²°κ³Όλ₯Ό 좜λ ₯ν•˜κ±°λ‚˜ μ €μž₯ν•˜λŠ” μ—°μ‚°μœΌλ‘œ μ¦‰μ‹œ 싀행됨
  • μ§€μ—°λœ μ—°μ‚°κ³Ό λ©”λͺ¨λ¦¬ μ΅œμ ν™”
    • μ§€μ—°λœ 연산은 λ””μŠ€ν¬μ™€ λ„€νŠΈμ›Œν¬ μ—°μ‚° μ΅œμ†Œν™”, λ©”λͺ¨λ¦¬ ν™œμš© μ΅œλŒ€ν™”μ— μœ μš©ν•¨
    • 특히, λ°˜λ³΅λ˜λŠ” μž‘μ—…μ—μ„œ μ„±λŠ₯ ν–₯상 κ°€λŠ₯
  • μΊμ‹œμ™€ νŽ„μ‹œμŠ€νŠΈ
    • μΊμ‹œ(Cache)와 νŽ„μ‹œμŠ€νŠΈ(Persist)둜 데이터λ₯Ό λ©”λͺ¨λ¦¬μ— μ €μž₯해두고 μž¬μ‚¬μš© κ°€λŠ₯
    • μΊμ‹œ: λ©”λͺ¨λ¦¬μ—λ§Œ 데이터 μ €μž₯
    • νŽ„μ‹œμŠ€νŠΈ: λ©”λͺ¨λ¦¬μ™€ λ””μŠ€ν¬μ— 데이터 μ €μž₯
  • μ˜ˆμ‹œ: λ¨Έμ‹ λŸ¬λ‹ λͺ¨λΈ ν•™μŠ΅κ³Ό μ„ ν˜• νšŒκ·€
    • λ¨Έμ‹ λŸ¬λ‹μ—μ„œ 반볡적으둜 데이터 μ²˜λ¦¬κ°€ ν•„μš”ν•œ 경우 μΊμ‹œμ™€ νŽ„μ‹œμŠ€νŠΈλ₯Ό ν™œμš©ν•˜μ—¬ μ„±λŠ₯ ν–₯상 κ°€λŠ₯
    • μ„ ν˜• νšŒκ·€ μ˜ˆμ‹œμ—μ„œ κ·Έλ ˆλ””μ–ΈνŠΈλ₯Ό μ—°μ‚°ν•˜λŠ” 반볡 μž‘μ—…μ„ μΊμ‹œλ₯Ό μ΄μš©ν•˜μ—¬ μ΅œμ ν™”
  • μŠ€ν† λ¦¬μ§€ 레벨(Storage Level)
    • λ©”λͺ¨λ¦¬ only, λ©”λͺ¨λ¦¬ μ•€ λ””μŠ€ν¬, SCR, λ””μŠ€ν¬ only, λ“±μ˜ μŠ€ν† λ¦¬μ§€ λ ˆλ²¨μ„ μ„ νƒν•˜μ—¬ 데이터 μ €μž₯
      • notion image
    • μΊμ‹œλŠ” λ””ν΄νŠΈ μŠ€ν† λ¦¬μ§€ λ ˆλ²¨μ„ μ‚¬μš©, νŽ„μ‹œμŠ€νŠΈλŠ” μ‚¬μš©μžκ°€ μ›ν•˜λŠ” λŒ€λ‘œ 지정 κ°€λŠ₯
이λ₯Ό 톡해 μŠ€νŒŒν¬μ—μ„œ 데이터 처리 μž‘μ—…μ˜ 속도λ₯Ό μ΅œμ ν™”ν•˜κΈ° μœ„ν•΄ μΊμ‹œμ™€ νŽ„μ‹œμŠ€νŠΈλ₯Ό ν™œμš©ν•˜λŠ” 방법과 μŠ€ν† λ¦¬μ§€ λ ˆλ²¨μ„ μ‘°μ •ν•˜μ—¬ μ„±λŠ₯을 ν–₯μƒμ‹œν‚€λŠ” 방법을 λ°°μ› μŠ΅λ‹ˆλ‹€. λ‹€μŒ μ‹œκ°„μ—λŠ” λΆ„μ‚° ν™˜κ²½μ—μ„œ 슀파크의 λ‚΄λΆ€ ꡬ쑰와 섀계에 λŒ€ν•΄ 더 μžμ„Ένžˆ μ•Œμ•„λ³΄κ² μŠ΅λ‹ˆλ‹€.

Apache Spark for Big Data Processing

Apache Spark is an open-source, high-speed distributed processing system used by companies like Amazon and Airbnb to address the common problem of big data. The big data problem is divided into three categories: size, speed, and diversity. Google solved the problem by developing a single ecosystem with three major parts: HDFS, Map Rhythms, and Yeon. Apache Spark is a project that replaces the calculation engine of Hadoop, and is fast due to its memory opening structure.
Spark can process data by storing it in memory across multiple nodes and processing it simultaneously. Spark has formed a cluster that can be divided into a driver program with a Spark context and a scrubber major work node. Spark can also perform exaggerated calculation while waiting, which speeds up the process. The Spark core data model is called RDD, which is an immutable, resilient distributed data cache that must be separated into multiple partitions.
Spark is slower than Pandas, but it is capable of horizontal expansion, making it faster and more efficient in processing big data. It is 100 times faster in memory and 10 times faster on disk than Hadoop Memorit.
In summary, Apache Spark is an essential tool for big data processing, providing a fast and efficient solution to the challenge of processing large amounts of data.
Β 

A Brief History of Apache Spark

In this blog post, we take a brief look at the history of Apache Spark, focusing on its development over versions 1, 2, and 3. Here are the key points:

Apache Spark version 1

  • Officially announced in 2014
  • Two-memory processing method using RDD
  • Data frame called "data height" added in version 1.3
  • Memory and CPU efficiency optimized due to engine upgrade via Project Tungsten

Apache Spark version 2

  • Announced in 2016
  • Further simplification and improved performance
  • Streaming processing became possible due to Project Structured Streaming on top of Spark SQL
  • Data caching via a certain amount of data frame called "data cache"
  • Support for multiple languages, including Python, Java, Scala, etc.

Apache Spark version 3

  • Released in 2020
  • Added MLlib and Spark SQL features
  • Significant performance improvement - about 200% faster than Spark 2.4
  • PySpark usage notable
  • Strengthened support for GPU-based workloads and distributed graph computation via Project GraphX
  • Python 2 support cut off, strengthening of support for Kubernetes

Spark Configuration

  • Spark is divided into 5 types: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX
  • The fundamental parts, such as Spark Core, remain unchanged even if color functions are added
In conclusion, while Apache Spark continues to add features and improve performance, the fundamental parts remain unchanged. It's best to start learning from the fundamental part, such as Spark Core, and work your way up.

Introduction to RDD in Apache Spark

In this blog post, we learn about RDD (Resilient Distributed Datasets) in Apache Spark. Here are the key points:
  • RDD stands for "Resilient Distributed Datasets" or "Wiseley End Distro View Load Data Three". In Korean, it means "Elastic Distributed Data".
  • We can create RDD by loading data from a file using the "textFile" function in SparkContext.
  • RDD has five characteristics:
    • Distributed data: Data is scattered across multiple nodes but can be used as a single object.
    • Resilient: It can handle failures in the network, hardware, and memory by keeping the data immutable and archiving it when needed.
    • Type safe: It can detect problems early by determining the compile type.
    • Handles both structured and unstructured data: It can handle both text data and tabular dataset.
    • Lazy evaluation: It does not perform calculations until the result is needed.
  • RDD operations can be divided into two parts: transformation and action.
  • Advantages of using RDD include flexibility, concise code, and more thoughtfulness in data manipulation.
Overall, RDD provides a flexible, resilient, and efficient way to handle large-scale distributed data processing in Apache Spark.

Parallel Processing and Distributed Processing

In this lecture, we learned about the differences between parallel processing and distributed processing:

Parallel Processing

  • Involves running a task on multiple nodes simultaneously
  • Splits data into multiple pieces and applies each task in multiple threads
  • Operates simultaneously in multiple threads to achieve faster results
  • May need to combine the results of each thread

Distributed Processing

  • Involves doing a "Distribute Data Parallel"
  • Splits data into multiple pieces and sends them to multiple nodes
  • Applies tasks independently in multiple nodes
  • May need to combine the results of each node
  • Requires consideration of communication between nodes
Spark allows us to write code as if it were parallel processing in a distributed environment by implementing and abstracting the data parallel model using RDD. However, it is important to pay attention to communication speed to achieve optimal performance.
In the next lecture, we will learn more about communication between nodes in a distributed environment.

Distributed Processing and Latency Issues

In this blog post, we will learn about the challenges associated with distributed processing and latency in Spark:
  1. The Partial Failure Problem - Spark's RDD helps to handle the problem of partial failures by allowing data to be recomputed if some nodes fail. This process is automated and taken care of by Spark.
  1. Speed as a Necessary Enemy - Tasks that require communication across multiple nodes can be slowed down by the latency of communication. Careful coding is needed to optimize performance.
  1. Impact of reducebykey Function - The performance of Spark code can be affected by the reducebykey function. This function requires communication and can slow down code performance. Calling reducebykey on reduced data can improve performance.
  1. Communication Speed - Network communication is the slowest operation in Spark, with a latency about 1 million times slower than memory operations.
Therefore, while developing Spark code, it is important to consider the impact of network communication on performance and optimize accordingly.

Key Value RDD

In this lecture, we learned about key value RDD, which is an RDD that has a key and value pair. It allows for higher-dimensional operations based on keys and is mainly used to calculate statistics based on the values.
Here are the key takeaways:
  • Key value RDD is an RDD that has a key and value pair.
  • It is used for higher-dimensional operations based on keys.
  • Key value RDD can be used to calculate statistics based on values.
  • Reduction is a function used to group and process data based on key values and is used to reduce data.
  • Join is an overload that combines multiple RDDs based on a key.
  • When using key value RDD, it is important to use map value ternary and flat value ternary as a value mandal instead of a map function for efficient use of partitions.
  • Key value RDD is not very different from standard RDDs.
In conclusion, key value RDD is a powerful tool that allows for efficient data manipulation and statistical analysis based on keys and values. It is an essential concept for anyone working with big data and Spark.