Apache Spark, Parallel Distributed Processing

Apache Spark, Parallel Distributed Processing

์„ค๋ช…
Summary of Spark & Flink for Real-time Big Data Processing - part2
Last Updated
Last updated July 29, 2023
ํƒœ๊ทธ
spark
DE
Data Engineering
๐Ÿ’ก
This article is a summary of "Spark & Flink Online for Real-Time Big Data Processing" from Fast Campus. It summarizes the overall definition and recommends taking the lecture for programming related content.

๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ (Apache Spark)

์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ๋Š” ์•„๋งˆ์กด, ์—์–ด๋น„์•ค๋น„ ๊ฐ™์€ ๊ธฐ์—…์—์„œ ๋น…๋ฐ์ดํ„ฐ๋ผ๋Š” ๊ณตํ†ต์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ๊ณ ์† ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ๋น…๋ฐ์ดํ„ฐ ๋ฌธ์ œ๋Š” ํฌ๊ธฐ, ์†๋„, ๋‹ค์–‘์„ฑ์˜ ์„ธ ๊ฐ€์ง€ ๋ฒ”์ฃผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ๊ตฌ๊ธ€์€ HDFS, ๋งต๋ฆฌ๋“ฌ, ์—ฐ์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋‹จ์ผ ์—์ฝ”์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ๋Š” ํ•˜๋‘ก์˜ ๊ณ„์‚ฐ ์—”์ง„์„ ๋Œ€์ฒดํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋กœ, ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋ฐฉํ˜• ๊ตฌ์กฐ๋กœ ์ธํ•ด ์†๋„๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค.
์ŠคํŒŒํฌ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๋…ธ๋“œ์— ๊ฑธ์ณ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•˜๊ณ  ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ŠคํŒŒํฌ๋Š” ์ŠคํŒŒํฌ ์ปจํ…์ŠคํŠธ๊ฐ€ ์žˆ๋Š” ๋“œ๋ผ์ด๋ฒ„ ํ”„๋กœ๊ทธ๋žจ๊ณผ ์Šคํฌ๋Ÿฌ๋ฒ„ ์ฃผ ์ž‘์—… ๋…ธ๋“œ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ˜•์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ŠคํŒŒํฌ๋Š” ๋Œ€๊ธฐ ์ค‘์—๋„ ๊ณผ์žฅ๋œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์–ด ํ”„๋กœ์„ธ์Šค ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ŠคํŒŒํฌ์˜ ํ•ต์‹ฌ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด์•ผ ํ•˜๋Š” ๋ถˆ๋ณ€์˜ ํƒ„๋ ฅ์ ์ธ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์บ์‹œ์ธ RDD๋ผ๊ณ  ๋ถˆ๋ฆฝ๋‹ˆ๋‹ค.
์ŠคํŒŒํฌ๋Š” ํŒ๋‹ค๋ณด๋‹ค ์†๋„๊ฐ€ ๋Š๋ฆฌ์ง€๋งŒ ์ˆ˜ํ‰ ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. Hadoop Memorit๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ์—์„œ 100๋ฐฐ, ๋””์Šคํฌ์—์„œ 10๋ฐฐ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค.
์š”์•ฝํ•˜์ž๋ฉด, ์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ๋Š” ๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜์ ์ธ ๋„๊ตฌ๋กœ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ผ๋Š” ๊ณผ์ œ์— ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ธ ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
ย 

Apache Spark์˜ ๊ฐ„๋žตํ•œ ์—ญ์‚ฌ

์ด ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ๋ฒ„์ „ 1, 2, 3์— ๊ฑธ์นœ ๊ฐœ๋ฐœ์— ์ดˆ์ ์„ ๋งž์ถฐ Apache Spark์˜ ์—ญ์‚ฌ๋ฅผ ๊ฐ„๋žตํ•˜๊ฒŒ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์‚ฌํ•ญ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ ๋ฒ„์ „ 1

  • 2014๋…„ ๊ณต์‹ ๋ฐœํ‘œ
  • RDD๋ฅผ ์ด์šฉํ•œ 2๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ ๋ฐฉ์‹
  • ๋ฒ„์ „ 1.3์—์„œ "๋ฐ์ดํ„ฐ ๋†’์ด"๋ผ๋Š” ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ถ”๊ฐ€
  • ํ”„๋กœ์ ํŠธ ํ……์Šคํ…์„ ํ†ตํ•œ ์—”์ง„ ์—…๊ทธ๋ ˆ์ด๋“œ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ CPU ํšจ์œจ์„ฑ ์ตœ์ ํ™”

์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ ๋ฒ„์ „ 2

  • 2016๋…„์— ๋ฐœํ‘œ
  • ๋”์šฑ ๋‹จ์ˆœํ™”๋˜๊ณ  ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • Spark SQL ์œ„์— ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐํ™”๋œ ์ŠคํŠธ๋ฆฌ๋ฐ์œผ๋กœ ์ŠคํŠธ๋ฆฌ๋ฐ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • "๋ฐ์ดํ„ฐ ์บ์‹œ"๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ์ผ์ •๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์บ์‹ฑ
  • ํŒŒ์ด์ฌ, ์ž๋ฐ”, ์Šค์นผ๋ผ ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ง€์›

์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ ๋ฒ„์ „ 3

  • 2020๋…„์— ์ถœ์‹œ
  • MLlib ๋ฐ Spark SQL ๊ธฐ๋Šฅ ์ถ”๊ฐ€
  • ๋Œ€ํญ์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ - Spark 2.4๋ณด๋‹ค ์•ฝ 200% ๋นจ๋ผ์ง
  • ์ฃผ๋ชฉํ•  ๋งŒํ•œ PySpark ์‚ฌ์šฉ
  • Project GraphX๋ฅผ ํ†ตํ•ด GPU ๊ธฐ๋ฐ˜ ์›Œํฌ๋กœ๋“œ ๋ฐ ๋ถ„์‚ฐ ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ง€์› ๊ฐ•ํ™”
  • ํŒŒ์ด์ฌ 2 ์ง€์› ์ค‘๋‹จ, ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค ์ง€์› ๊ฐ•ํ™”

์ŠคํŒŒํฌ ๊ตฌ์„ฑ

  • Spark๋Š” 5๊ฐ€์ง€ ์œ ํ˜•์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค: Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX์ž…๋‹ˆ๋‹ค.
  • ์ŠคํŒŒํฌ ์ฝ”์–ด์™€ ๊ฐ™์€ ๊ธฐ๋ณธ์ ์ธ ๋ถ€๋ถ„์€ ์ปฌ๋Ÿฌ ๊ธฐ๋Šฅ์ด ์ถ”๊ฐ€๋˜๋”๋ผ๋„ ๋ณ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
๊ฒฐ๋ก ์ ์œผ๋กœ ์•„ํŒŒ์น˜ ์ŠคํŒŒํฌ๋Š” ๊ณ„์†ํ•ด์„œ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๊ธฐ๋ณธ์ ์ธ ๋ถ€๋ถ„์€ ๋ณ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ŠคํŒŒํฌ ์ฝ”์–ด์™€ ๊ฐ™์€ ๊ธฐ์ดˆ์ ์ธ ๋ถ€๋ถ„๋ถ€ํ„ฐ ์ฐจ๊ทผ์ฐจ๊ทผ ๋ฐฐ์›Œ๋‚˜๊ฐ€๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์Šต๋‹ˆ๋‹ค.

Apache Spark์˜ RDD ์†Œ๊ฐœ

์ด ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” Apache Spark์˜ RDD(ํƒ„๋ ฅ์  ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ)์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  • RDD๋Š” "ํƒ„๋ ฅ์  ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ" ๋˜๋Š” "ํ˜„๋ช…ํ•œ ์—”๋“œ ๋ฐฐํฌ ๋ทฐ ๋กœ๋“œ ๋ฐ์ดํ„ฐ ์…‹"์˜ ์•ฝ์ž์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋ง๋กœ๋Š” "ํƒ„๋ ฅ์  ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ"๋ผ๋Š” ๋œป์ž…๋‹ˆ๋‹ค.
  • SparkContext์˜ "textFile" ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ผ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜์—ฌ RDD๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • RDD์—๋Š” ๋‹ค์„ฏ ๊ฐ€์ง€ ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค:
      1. ๋ฐ์ดํ„ฐ ์ถ”์ƒํ™”: ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๋…ธ๋“œ์— ํฉ์–ด์ ธ ์žˆ์ง€๋งŒ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      1. ํƒ„๋ ฅ์  & Immutable : ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋ณ€์œผ๋กœ ์œ ์ง€ํ•˜๊ณ  ํ•„์š”ํ•  ๋•Œ ์•„์นด์ด๋ธŒํ•˜์—ฌ ๋„คํŠธ์›Œํฌ, ํ•˜๋“œ์›จ์–ด, ๋ฉ”๋ชจ๋ฆฌ์˜ ์žฅ์• ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      1. ์œ ํ˜• ์•ˆ์ „: ์ปดํŒŒ์ผ ์œ ํ˜•์„ ํŒŒ์•…ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ์กฐ๊ธฐ์— ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      1. ์ •ํ˜• ๋ฐ ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค: ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€ ํ‘œ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      1. Lazy evaluation: ๊ฒฐ๊ณผ๊ฐ€ ํ•„์š”ํ•  ๋•Œ๊นŒ์ง€ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • RDD ์ž‘์—…์€ ๋ณ€ํ™˜๊ณผ ์ž‘์—…์˜ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • RDD ์‚ฌ์šฉ์˜ ์žฅ์ ์œผ๋กœ๋Š” ์œ ์—ฐ์„ฑ, ๊ฐ„๊ฒฐํ•œ ์ฝ”๋“œ, ๋ฐ์ดํ„ฐ ์กฐ์ž‘ ์‹œ ๋” ์‹ ์ค‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์  ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
์ „๋ฐ˜์ ์œผ๋กœ RDD๋Š” Apache Spark์—์„œ ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์œ ์—ฐํ•˜๊ณ  ํƒ„๋ ฅ์ ์ด๋ฉฐ ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ

์ด ๊ฐ•์˜์—์„œ๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ์˜ ์ฐจ์ด์ ์— ๋Œ€ํ•ด ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

  • ์—ฌ๋Ÿฌ ๋…ธ๋“œ์—์„œ ๋™์‹œ์— ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ์กฐ๊ฐ์œผ๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๊ฐ ์ž‘์—…์„ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์—์„œ ๋™์‹œ์— ์ž‘๋™ํ•˜์—ฌ ๋” ๋น ๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ ์Šค๋ ˆ๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ์Œ

๋ถ„์‚ฐ ์ฒ˜๋ฆฌ

  • "๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ๋ถ„์‚ฐ"์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ์กฐ๊ฐ์œผ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋…ธ๋“œ๋กœ ์ „์†กํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ ๋…ธ๋“œ์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘์—… ์ ์šฉ
  • ๊ฐ ๋…ธ๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์‚ฐํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ์Œ
  • ๋…ธ๋“œ ๊ฐ„ ํ†ต์‹ ์— ๋Œ€ํ•œ ๊ณ ๋ ค ํ•„์š”
์ŠคํŒŒํฌ๋Š” RDD๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜๊ณ  ์ถ”์ƒํ™”ํ•จ์œผ๋กœ์จ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ๋„ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋“ฏ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ†ต์‹  ์†๋„์— ์‹ ๊ฒฝ์„ ์จ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๋‹ค์Œ ๊ฐ•์˜์—์„œ๋Š” ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ์˜ ๋…ธ๋“œ ๊ฐ„ ํ†ต์‹ ์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ถ„์‚ฐ ์ฒ˜๋ฆฌ์™€ ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฌธ์ œ

์ด๋ฒˆ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŒ…์—์„œ๋Š” Spark์˜ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ๋ฐ ์ง€์—ฐ ๋ฌธ์ œ์™€ ๊ด€๋ จ๋œ ์ด์Šˆ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:
1. ๋ถ€๋ถ„ ์žฅ์•  ๋ฌธ์ œ - Spark์˜ RDD๋Š” ์ผ๋ถ€ ๋…ธ๋“œ์— ์žฅ์• ๊ฐ€ ๋ฐœ์ƒํ•  ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ์œผ๋กœ์จ ๋ถ€๋ถ„ ์žฅ์•  ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์„ธ์Šค๋Š” ์ž๋™ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ Spark์—์„œ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
2. ํ•„์š”์•…์œผ๋กœ์„œ์˜ ์†๋„ - ์—ฌ๋Ÿฌ ๋…ธ๋“œ์— ๊ฑธ์ณ ํ†ต์‹ ์ด ํ•„์š”ํ•œ ์ž‘์—…์€ ํ†ต์‹  ์ง€์—ฐ์œผ๋กœ ์ธํ•ด ์†๋„๊ฐ€ ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๋ ค๋ฉด ์‹ ์ค‘ํ•œ ์ฝ”๋”ฉ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
3. ๋‹จ์ถ•ํ‚ค ํ•จ์ˆ˜์˜ ์˜ํ–ฅ - ๋‹จ์ถ•ํ‚ค ํ•จ์ˆ˜์— ์˜ํ•ด ์ŠคํŒŒํฌ ์ฝ”๋“œ์˜ ์„ฑ๋Šฅ์ด ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ํ†ต์‹ ์ด ํ•„์š”ํ•˜๋ฉฐ ์ฝ”๋“œ ์„ฑ๋Šฅ์ด ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ์†Œ๋œ ๋ฐ์ดํ„ฐ์—์„œ reducebykey๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
4. ํ†ต์‹  ์†๋„ - ๋„คํŠธ์›Œํฌ ํ†ต์‹ ์€ ์ŠคํŒŒํฌ์—์„œ ๊ฐ€์žฅ ๋Š๋ฆฐ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ๋ณด๋‹ค ์ง€์—ฐ ์‹œ๊ฐ„์ด ์•ฝ 100๋งŒ ๋ฐฐ ๋Š๋ฆฝ๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ Spark ์ฝ”๋“œ๋ฅผ ๊ฐœ๋ฐœํ•  ๋•Œ ๋„คํŠธ์›Œํฌ ํ†ต์‹ ์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ณ ๋ คํ•˜๊ณ  ๊ทธ์— ๋”ฐ๋ผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Key Value RDD (paired RDD)

์ด๋ฒˆ ๊ฐ•์˜์—์„œ๋Š” ํ‚ค์™€ ๊ฐ’ ์Œ์ด ์žˆ๋Š” RDD์ธ ํ‚ค๊ฐ’ RDD์— ๋Œ€ํ•ด ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ํ‚ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ์ฐจ์›์ ์ธ ์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ฃผ๋กœ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. NoSQL ์ฒ˜๋Ÿผ ์“ธ์ˆ˜ ์žˆ๋Š”๋“ฏ
์ฃผ์š” ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  • ํ‚ค ๊ฐ’ RDD๋Š” ํ‚ค์™€ ๊ฐ’ ์Œ์ด ์žˆ๋Š” RDD์ž…๋‹ˆ๋‹ค.
  • ํ‚ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๊ณ ์ฐจ์› ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ํ‚ค ๊ฐ’ RDD๋Š” ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Reduction: ํ‚ค ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋Šฅ์œผ๋กœ, ๋ฐ์ดํ„ฐ๋ฅผ ์ค„์ด๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    • notion image
  • Join: ํ‚ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ RDD๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ์˜ค๋ฒ„๋กœ๋“œ์ž…๋‹ˆ๋‹ค.
  • ํ‚ค ๊ฐ’ RDD๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋Š” ํšจ์œจ์ ์ธ ํŒŒํ‹ฐ์…˜ ์‚ฌ์šฉ์„ ์œ„ํ•ด map() ํ•จ์ˆ˜ ๋Œ€์‹  mapValues(), flatMapValues() ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. โ†’Value๋งŒ ๋‹ค๋ฃจ๋Š” ์—ฐ์‚ฐ๋“ค์ด์ง€๋งŒ RDD์—์„œ key๋Š” ์œ ์ง€๋จ
  • ํ‚ค ๊ฐ’ RDD๋Š” ํ‘œ์ค€ RDD์™€ ํฌ๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
๊ฒฐ๋ก ์ ์œผ๋กœ ํ‚ค๊ฐ’ RDD๋Š” ํ‚ค์™€ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์กฐ์ž‘ํ•˜๊ณ  ํ†ต๊ณ„ ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. ๋น…๋ฐ์ดํ„ฐ์™€ Spark๋กœ ์ž‘์—…ํ•˜๋Š” ๋ชจ๋“  ์‚ฌ๋žŒ์—๊ฒŒ ํ•„์ˆ˜์ ์ธ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.
ย 

์ŠคํŒŒํฌ์˜ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜ ์ดํ•ดํ•˜๊ธฐ

ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์˜ ๊ฐœ๋… ์ดํ•ดํ•˜๊ธฐ

  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์˜ ๊ธฐ๋ณธ ๊ฐœ๋… ์„ค๋ช…
  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜: RDD์—์„œ ์ƒˆ๋กœ์šด RDD๋ฅผ ๋งŒ๋“œ๋Š” ์—ฐ์‚ฐ
  • ์•ก์…˜: ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๊ฑฐ๋‚˜ ์ €์žฅํ•˜๋Š” ์—ฐ์‚ฐ
  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์˜ ํŠน์ง•: ์ง€์—ฐ ์‹คํ–‰๋˜๋ฉฐ, ์•ก์…˜์„ ๋งŒ๋‚  ๋•Œ ์‹คํ–‰๋จ
  • ์•ก์…˜์˜ ํŠน์ง•: ์ฆ‰์‹œ ์‹คํ–‰๋จ

ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์˜ ์ข…๋ฅ˜

  • ๋งต (Map) ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜: 1๋Œ€1 ๋ณ€ํ™˜, ์š”์†Œ๋ฅผ ๋‹ค๋ฅธ ์š”์†Œ๋กœ ๋ณ€ํ™˜
  • ํ•„ํ„ฐ (Filter) ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜: ์ฃผ์–ด์ง„ ์กฐ๊ฑด์— ๋งž๋Š” ์š”์†Œ๋งŒ ์„ ํƒํ•˜์—ฌ ์ƒˆ๋กœ์šด RDD ์ƒ์„ฑ
  • ๊ธฐํƒ€ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜: reduceByKey, groupBy, flatMap ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ํ•จ์ˆ˜๋“ค

์•ก์…˜์˜ ํ™œ์šฉ๊ณผ ์ฃผ์˜์‚ฌํ•ญ

  • ์ปฌ๋ ‰ํŠธ (Collect) ์•ก์…˜: RDD์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜
  • ์นด์šดํŠธ (Count) ์•ก์…˜: RDD์˜ ์š”์†Œ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์–ด์คŒ
  • ๋””์Šคํ‹ดํŠธ (Distinct) ์•ก์…˜: ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜์—ฌ ์œ ๋‹ˆํฌํ•œ ๊ฐ’๋งŒ ์ถ”์ถœ
  • ์•ก์…˜ ์‚ฌ์šฉ ์‹œ ์ฃผ์˜์ : ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์— ๋ฆฌ์†Œ์Šค ๋‚ญ๋น„์— ์ฃผ์˜

ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์˜ ์‹คํ–‰ ์ˆœ์„œ ์ดํ•ดํ•˜๊ธฐ

  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜: ํ”„๋กœ๊ทธ๋žจ์ด ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์„ ๋งŒ๋‚˜๋ฉด ์—ฐ์‚ฐ์„ ๊ธฐ์–ตํ•˜๊ณ  ์•ก์…˜์„ ๋งŒ๋‚  ๋•Œ ์‹คํ–‰
  • ์•ก์…˜: ํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์„ ์‹œ์ž‘ํ•˜๋Š” ํ•จ์ˆ˜๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ฆ‰์‹œ ์ถœ๋ ฅํ•˜๊ฑฐ๋‚˜ ์ €์žฅ

์‹ค์Šต์„ ํ†ตํ•œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜ ์ดํ•ด

  • ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ๋งต๊ณผ ํ•„ํ„ฐ ๋“ฑ์˜ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜ ์‹ค์Šต
  • ์ปฌ๋ ‰ํŠธ์™€ ์นด์šดํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ RDD์˜ ๋ฐ์ดํ„ฐ ํ™•์ธ๊ณผ ๊ฐœ์ˆ˜ ์„ธ๊ธฐ ์‹ค์Šต
  • ์ค‘๋ณต ์ œ๊ฑฐ๋ฅผ ์œ„ํ•œ ๋””์ŠคํŒ…ํŠธ ํ•จ์ˆ˜์˜ ํ™œ์šฉ

ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™”

  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์˜ ์„ฑ๋Šฅ ์ฐจ์ด ์ดํ•ด
  • ์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์…”ํ”Œ๋ง์˜ ๋ฆฌ์†Œ์Šค ์†Œ๋น„ ์ดํ•ด
  • ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ํŠธ๋ฆญ์Šค์™€ ๋ฐฉ๋ฒ•๋ก  ์†Œ๊ฐœ

์œ ๋‹ˆ์–ธ(Union) ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜

  • ์œ ๋‹ˆ์–ธ์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ
  • ๋‘ ๊ฐœ์˜ RDD๋ฅผ ํ•ฉ์ณ์„œ ์ƒˆ๋กœ์šด RDD ์ƒ์„ฑ
  • ์ค‘๋ณต๋œ ์š”์†Œ๋„ ํฌํ•จํ•˜์—ฌ ๋ชจ๋“  ์š”์†Œ๋ฅผ ํ•ฉ์นจ
  • ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์œ ๋‹ˆ์–ธ ์‹ค์Šต๊ณผ ๊ฒฐ๊ณผ ํ™•์ธ

์ธํ„ฐ์„น์…˜(Intersection) ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜

  • ์ธํ„ฐ์„น์…˜์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ
  • ๋‘ ๊ฐœ์˜ RDD์—์„œ ๊ณตํ†ต๋œ ์š”์†Œ๋งŒ ์ถ”์ถœํ•˜์—ฌ ์ƒˆ๋กœ์šด RDD ์ƒ์„ฑ
  • ์ค‘๋ณต๋œ ์š”์†Œ๋Š” ํ•œ ๋ฒˆ๋งŒ ํฌํ•จ
  • ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์ธํ„ฐ์„น์…˜ ์‹ค์Šต๊ณผ ๊ฒฐ๊ณผ ํ™•์ธ

์„œ๋ธŒํŠธ๋ž™ํŠธ(Subtract) ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜

  • ์„œ๋ธŒํŠธ๋ž™ํŠธ์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ
  • ์ฒซ ๋ฒˆ์งธ RDD์—์„œ ๋‘ ๋ฒˆ์งธ RDD์— ์žˆ๋Š” ์š”์†Œ๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ์š”์†Œ ์ถ”์ถœํ•˜์—ฌ ์ƒˆ๋กœ์šด RDD ์ƒ์„ฑ
  • ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์„œ๋ธŒํŠธ๋ž™ํŠธ ์‹ค์Šต๊ณผ ๊ฒฐ๊ณผ ํ™•์ธ

์ƒ˜ํ”Œ๋ง(Sampling) ํ•จ์ˆ˜

  • ์ƒ˜ํ”Œ๋ง ํ•จ์ˆ˜์˜ ๊ฐœ๋…๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ดํ•ด
  • ๋ฌด์ž‘์œ„๋กœ RDD์—์„œ ์ผ๋ถ€ ์š”์†Œ ์ถ”์ถœ
  • ์ฒซ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ: ์ƒ˜ํ”Œ๋งํ•˜๋Š” ์š”์†Œ ๊ฐœ์ˆ˜
  • ๋‘ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ: ๋ณต์› ์ถ”์ถœ ์—ฌ๋ถ€(๊ธฐ๋ณธ๊ฐ’์€ True)
  • ์„ธ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ: ๋žœ๋ค ์‹œ๋“œ ๊ฐ’ ์ง€์ •(๊ธฐ๋ณธ๊ฐ’์€ None)
  • ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์ƒ˜ํ”Œ๋ง ํ•จ์ˆ˜ ์‹ค์Šต๊ณผ ๊ฒฐ๊ณผ ํ™•์ธ

์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ groupBy ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜

  • ์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์˜ ๊ฐœ๋…๊ณผ ํŠน์ง• ์ดํ•ด
  • groupBy ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์˜ ํ™œ์šฉ
  • ํŠน์ • ๊ธฐ์ค€์œผ๋กœ RDD์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์ƒˆ๋กœ์šด RDD ์ƒ์„ฑ
  • ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด groupBy ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜ ์‹ค์Šต๊ณผ ๊ฒฐ๊ณผ ํ™•์ธ

์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ํ‚ค-๊ฐ’(Key-Value) RDD

  • ํ‚ค-๊ฐ’ RDD์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ ์ดํ•ด
  • ์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์„ ํ‚ค-๊ฐ’ RDD์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ๊ธฐ๋ณธ์ ์ธ ์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜ ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ํ‚ค-๊ฐ’ RDD ๋‹ค๋ฃจ๊ธฐ

์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์˜ ๊ณ ๊ธ‰ ํ™œ์šฉ

  • ์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์„ ํ™œ์šฉํ•œ ๋ณต์žกํ•œ ๊ณ„์‚ฐ ์˜ˆ์ œ ์„ค๋ช…
  • ํ‚ค-๊ฐ’ RDD๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ์ง‘๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ• ์†Œ๊ฐœ
  • ์‹ค์ „ ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด ์™€์ด๋“œ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜์˜ ๊ณ ๊ธ‰ ํ™œ์šฉ ์ดํ•ด

ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์„ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํ”„๋กœ์ ํŠธ ์˜ˆ์‹œ

  • ์œ ๋‹ˆ์–ธ, ์ธํ„ฐ์„น์…˜, ์„œ๋ธŒํŠธ๋ž™ํŠธ ๋“ฑ์˜ ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์„ ์กฐํ•ฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ ์ œ๊ณต
  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜์˜ ์‹คํ–‰ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ”„๋กœ์ ํŠธ ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ• ์„ค๋ช…
ย 

์ŠคํŒŒํฌ์˜ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” Cache()์™€ Persist()

  • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜๊ณผ ์•ก์…˜: ์ŠคํŒŒํฌ์˜ ๋‘ ๊ฐ€์ง€ ์—ฐ์‚ฐ ์˜คํผ๋ ˆ์ด์…˜
    • ํŠธ๋žœ์Šคํฌ๋ฉ”์ด์…˜(Transformation): ์ƒˆ๋กœ์šด RDD๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์—ฐ์‚ฐ์œผ๋กœ ์ง€์—ฐ ์‹คํ–‰๋จ
    • ์•ก์…˜(Action): ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๊ฑฐ๋‚˜ ์ €์žฅํ•˜๋Š” ์—ฐ์‚ฐ์œผ๋กœ ์ฆ‰์‹œ ์‹คํ–‰๋จ
  • ์ง€์—ฐ๋œ ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
    • ์ง€์—ฐ๋œ ์—ฐ์‚ฐ์€ ๋””์Šคํฌ์™€ ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ ์ตœ์†Œํ™”, ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ ์ตœ๋Œ€ํ™”์— ์œ ์šฉํ•จ
    • ํŠนํžˆ, ๋ฐ˜๋ณต๋˜๋Š” ์ž‘์—…์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
  • ์บ์‹œ์™€ ํŽ„์‹œ์ŠคํŠธ
    • ์บ์‹œ(Cache)์™€ ํŽ„์‹œ์ŠคํŠธ(Persist)๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๊ณ  ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ
    • ์บ์‹œ: ๋ฉ”๋ชจ๋ฆฌ์—๋งŒ ๋ฐ์ดํ„ฐ ์ €์žฅ
    • ํŽ„์‹œ์ŠคํŠธ: ๋ฉ”๋ชจ๋ฆฌ์™€ ๋””์Šคํฌ์— ๋ฐ์ดํ„ฐ ์ €์žฅ
  • ์˜ˆ์‹œ: ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต๊ณผ ์„ ํ˜• ํšŒ๊ท€
    • ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์บ์‹œ์™€ ํŽ„์‹œ์ŠคํŠธ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
    • ์„ ํ˜• ํšŒ๊ท€ ์˜ˆ์‹œ์—์„œ ๊ทธ๋ ˆ๋””์–ธํŠธ๋ฅผ ์—ฐ์‚ฐํ•˜๋Š” ๋ฐ˜๋ณต ์ž‘์—…์„ ์บ์‹œ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™”
  • ์Šคํ† ๋ฆฌ์ง€ ๋ ˆ๋ฒจ(Storage Level)
    • ๋ฉ”๋ชจ๋ฆฌ only, ๋ฉ”๋ชจ๋ฆฌ ์•ค ๋””์Šคํฌ, SCR, ๋””์Šคํฌ only, ๋“ฑ์˜ ์Šคํ† ๋ฆฌ์ง€ ๋ ˆ๋ฒจ์„ ์„ ํƒํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ €์žฅ
      • notion image
    • ์บ์‹œ๋Š” ๋””ํดํŠธ ์Šคํ† ๋ฆฌ์ง€ ๋ ˆ๋ฒจ์„ ์‚ฌ์šฉ, ํŽ„์‹œ์ŠคํŠธ๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ๋Œ€๋กœ ์ง€์ • ๊ฐ€๋Šฅ
์ด๋ฅผ ํ†ตํ•ด ์ŠคํŒŒํฌ์—์„œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์ž‘์—…์˜ ์†๋„๋ฅผ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์บ์‹œ์™€ ํŽ„์‹œ์ŠคํŠธ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์Šคํ† ๋ฆฌ์ง€ ๋ ˆ๋ฒจ์„ ์กฐ์ •ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์‹œ๊ฐ„์—๋Š” ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ์ŠคํŒŒํฌ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ์™€ ์„ค๊ณ„์— ๋Œ€ํ•ด ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Apache Spark for Big Data Processing

Apache Spark is an open-source, high-speed distributed processing system used by companies like Amazon and Airbnb to address the common problem of big data. The big data problem is divided into three categories: size, speed, and diversity. Google solved the problem by developing a single ecosystem with three major parts: HDFS, Map Rhythms, and Yeon. Apache Spark is a project that replaces the calculation engine of Hadoop, and is fast due to its memory opening structure.
Spark can process data by storing it in memory across multiple nodes and processing it simultaneously. Spark has formed a cluster that can be divided into a driver program with a Spark context and a scrubber major work node. Spark can also perform exaggerated calculation while waiting, which speeds up the process. The Spark core data model is called RDD, which is an immutable, resilient distributed data cache that must be separated into multiple partitions.
Spark is slower than Pandas, but it is capable of horizontal expansion, making it faster and more efficient in processing big data. It is 100 times faster in memory and 10 times faster on disk than Hadoop Memorit.
In summary, Apache Spark is an essential tool for big data processing, providing a fast and efficient solution to the challenge of processing large amounts of data.
ย 

A Brief History of Apache Spark

In this blog post, we take a brief look at the history of Apache Spark, focusing on its development over versions 1, 2, and 3. Here are the key points:

Apache Spark version 1

  • Officially announced in 2014
  • Two-memory processing method using RDD
  • Data frame called "data height" added in version 1.3
  • Memory and CPU efficiency optimized due to engine upgrade via Project Tungsten

Apache Spark version 2

  • Announced in 2016
  • Further simplification and improved performance
  • Streaming processing became possible due to Project Structured Streaming on top of Spark SQL
  • Data caching via a certain amount of data frame called "data cache"
  • Support for multiple languages, including Python, Java, Scala, etc.

Apache Spark version 3

  • Released in 2020
  • Added MLlib and Spark SQL features
  • Significant performance improvement - about 200% faster than Spark 2.4
  • PySpark usage notable
  • Strengthened support for GPU-based workloads and distributed graph computation via Project GraphX
  • Python 2 support cut off, strengthening of support for Kubernetes

Spark Configuration

  • Spark is divided into 5 types: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX
  • The fundamental parts, such as Spark Core, remain unchanged even if color functions are added
In conclusion, while Apache Spark continues to add features and improve performance, the fundamental parts remain unchanged. It's best to start learning from the fundamental part, such as Spark Core, and work your way up.

Introduction to RDD in Apache Spark

In this blog post, we learn about RDD (Resilient Distributed Datasets) in Apache Spark. Here are the key points:
  • RDD stands for "Resilient Distributed Datasets" or "Wiseley End Distro View Load Data Three". In Korean, it means "Elastic Distributed Data".
  • We can create RDD by loading data from a file using the "textFile" function in SparkContext.
  • RDD has five characteristics:
    • Distributed data: Data is scattered across multiple nodes but can be used as a single object.
    • Resilient: It can handle failures in the network, hardware, and memory by keeping the data immutable and archiving it when needed.
    • Type safe: It can detect problems early by determining the compile type.
    • Handles both structured and unstructured data: It can handle both text data and tabular dataset.
    • Lazy evaluation: It does not perform calculations until the result is needed.
  • RDD operations can be divided into two parts: transformation and action.
  • Advantages of using RDD include flexibility, concise code, and more thoughtfulness in data manipulation.
Overall, RDD provides a flexible, resilient, and efficient way to handle large-scale distributed data processing in Apache Spark.

Parallel Processing and Distributed Processing

In this lecture, we learned about the differences between parallel processing and distributed processing:

Parallel Processing

  • Involves running a task on multiple nodes simultaneously
  • Splits data into multiple pieces and applies each task in multiple threads
  • Operates simultaneously in multiple threads to achieve faster results
  • May need to combine the results of each thread

Distributed Processing

  • Involves doing a "Distribute Data Parallel"
  • Splits data into multiple pieces and sends them to multiple nodes
  • Applies tasks independently in multiple nodes
  • May need to combine the results of each node
  • Requires consideration of communication between nodes
Spark allows us to write code as if it were parallel processing in a distributed environment by implementing and abstracting the data parallel model using RDD. However, it is important to pay attention to communication speed to achieve optimal performance.
In the next lecture, we will learn more about communication between nodes in a distributed environment.

Distributed Processing and Latency Issues

In this blog post, we will learn about the challenges associated with distributed processing and latency in Spark:
  1. The Partial Failure Problem - Spark's RDD helps to handle the problem of partial failures by allowing data to be recomputed if some nodes fail. This process is automated and taken care of by Spark.
  1. Speed as a Necessary Enemy - Tasks that require communication across multiple nodes can be slowed down by the latency of communication. Careful coding is needed to optimize performance.
  1. Impact of reducebykey Function - The performance of Spark code can be affected by the reducebykey function. This function requires communication and can slow down code performance. Calling reducebykey on reduced data can improve performance.
  1. Communication Speed - Network communication is the slowest operation in Spark, with a latency about 1 million times slower than memory operations.
Therefore, while developing Spark code, it is important to consider the impact of network communication on performance and optimize accordingly.

Key Value RDD

In this lecture, we learned about key value RDD, which is an RDD that has a key and value pair. It allows for higher-dimensional operations based on keys and is mainly used to calculate statistics based on the values.
Here are the key takeaways:
  • Key value RDD is an RDD that has a key and value pair.
  • It is used for higher-dimensional operations based on keys.
  • Key value RDD can be used to calculate statistics based on values.
  • Reduction is a function used to group and process data based on key values and is used to reduce data.
  • Join is an overload that combines multiple RDDs based on a key.
  • When using key value RDD, it is important to use map value ternary and flat value ternary as a value mandal instead of a map function for efficient use of partitions.
  • Key value RDD is not very different from standard RDDs.
In conclusion, key value RDD is a powerful tool that allows for efficient data manipulation and statistical analysis based on keys and values. It is an essential concept for anyone working with big data and Spark.