This article is a summary of "Spark & Flink Online for Real-Time Big Data Processing" from Fast Campus. It summarizes the overall definition and recommends taking the lecture for programming related content.
λΉ λ°μ΄ν° μ²λ¦¬λ₯Ό μν μνμΉ μ€νν¬ (Apache Spark)
μνμΉ μ€νν¬λ μλ§μ‘΄, μμ΄λΉμ€λΉ κ°μ κΈ°μ
μμ λΉ
λ°μ΄ν°λΌλ 곡ν΅μ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μ¬μ©νλ μ€νμμ€ κ³ μ λΆμ° μ²λ¦¬ μμ€ν
μ
λλ€. λΉ
λ°μ΄ν° λ¬Έμ λ ν¬κΈ°, μλ, λ€μμ±μ μΈ κ°μ§ λ²μ£Όλ‘ λλ©λλ€. ꡬκΈμ HDFS, 맡리λ¬, μ°μ μΈ κ°μ§ μ£Όμ λΆλΆμΌλ‘ ꡬμ±λ λ¨μΌ μμ½μμ€ν
μ κ°λ°νμ¬ μ΄ λ¬Έμ λ₯Ό ν΄κ²°νμ΅λλ€. μνμΉ μ€νν¬λ νλ‘μ κ³μ° μμ§μ λ체νλ νλ‘μ νΈλ‘, λ©λͺ¨λ¦¬ κ°λ°©ν κ΅¬μ‘°λ‘ μΈν΄ μλκ° λΉ λ¦
λλ€.
μ€νν¬λ λ°μ΄ν°λ₯Ό μ¬λ¬ λ
Έλμ κ±Έμ³ λ©λͺ¨λ¦¬μ μ μ₯νκ³ λμμ μ²λ¦¬νλ λ°©μμΌλ‘ λ°μ΄ν°λ₯Ό μ²λ¦¬ν μ μμ΅λλ€. μ€νν¬λ μ€νν¬ μ»¨ν
μ€νΈκ° μλ λλΌμ΄λ² νλ‘κ·Έλ¨κ³Ό μ€ν¬λ¬λ² μ£Ό μμ
λ
Έλλ‘ λλ μ μλ ν΄λ¬μ€ν°λ₯Ό νμ±νκ³ μμ΅λλ€. λν μ€νν¬λ λκΈ° μ€μλ κ³Όμ₯λ κ³μ°μ μνν μ μμ΄ νλ‘μΈμ€ μλλ₯Ό λμΌ μ μμ΅λλ€. μ€νν¬μ ν΅μ¬ λ°μ΄ν° λͺ¨λΈμ μ¬λ¬ νν°μ
μΌλ‘ λΆλ¦¬λμ΄μΌ νλ λΆλ³μ νλ ₯μ μΈ λΆμ° λ°μ΄ν° μΊμμΈ RDDλΌκ³ λΆλ¦½λλ€.
μ€νν¬λ νλ€λ³΄λ€ μλκ° λ리μ§λ§ μν νμ₯μ΄ κ°λ₯νκΈ° λλ¬Έμ λΉ
λ°μ΄ν° μ²λ¦¬ μλκ° λΉ λ₯΄κ³ ν¨μ¨μ μ
λλ€. Hadoop Memoritλ³΄λ€ λ©λͺ¨λ¦¬μμ 100λ°°, λμ€ν¬μμ 10λ°° λ λΉ λ¦
λλ€.
μμ½νμλ©΄, μνμΉ μ€νν¬λ λΉ
λ°μ΄ν° μ²λ¦¬μ νμμ μΈ λκ΅¬λ‘ λμ©λ λ°μ΄ν° μ²λ¦¬λΌλ κ³Όμ μ λΉ λ₯΄κ³ ν¨μ¨μ μΈ μ루μ
μ μ 곡ν©λλ€.
Β
Apache Sparkμ κ°λ΅ν μμ¬
μ΄ λΈλ‘κ·Έ κ²μλ¬Όμμλ λ²μ 1, 2, 3μ κ±ΈμΉ κ°λ°μ μ΄μ μ λ§μΆ° Apache Sparkμ μμ¬λ₯Ό κ°λ΅νκ² μ΄ν΄λ΄
λλ€. ν΅μ¬ μ¬νμ λ€μκ³Ό κ°μ΅λλ€:
μνμΉ μ€νν¬ λ²μ 1
- 2014λ 곡μ λ°ν
- RDDλ₯Ό μ΄μ©ν 2λ©λͺ¨λ¦¬ μ²λ¦¬ λ°©μ
- λ²μ 1.3μμ "λ°μ΄ν° λμ΄"λΌλ λ°μ΄ν° νλ μ μΆκ°
- νλ‘μ νΈ ν μ€ν μ ν΅ν μμ§ μ κ·Έλ μ΄λλ‘ λ©λͺ¨λ¦¬ λ° CPU ν¨μ¨μ± μ΅μ ν
μνμΉ μ€νν¬ λ²μ 2
- 2016λ μ λ°ν
- λμ± λ¨μνλκ³ μ±λ₯μ΄ ν₯μλμμ΅λλ€.
- Spark SQL μμ νλ‘μ νΈ κ΅¬μ‘°νλ μ€νΈλ¦¬λ°μΌλ‘ μ€νΈλ¦¬λ° μ²λ¦¬ κ°λ₯
- "λ°μ΄ν° μΊμ"λΌκ³ λΆλ¦¬λ μΌμ λμ λ°μ΄ν° νλ μμ ν΅ν λ°μ΄ν° μΊμ±
- νμ΄μ¬, μλ°, μ€μΉΌλΌ λ± λ€μν μΈμ΄ μ§μ
μνμΉ μ€νν¬ λ²μ 3
- 2020λ μ μΆμ
- MLlib λ° Spark SQL κΈ°λ₯ μΆκ°
- λνμ μΈ μ±λ₯ ν₯μ - Spark 2.4λ³΄λ€ μ½ 200% λΉ¨λΌμ§
- μ£Όλͺ©ν λ§ν PySpark μ¬μ©
- Project GraphXλ₯Ό ν΅ν΄ GPU κΈ°λ° μν¬λ‘λ λ° λΆμ° κ·Έλν μ°μ°μ λν μ§μ κ°ν
- νμ΄μ¬ 2 μ§μ μ€λ¨, μΏ λ²λ€ν°μ€ μ§μ κ°ν
μ€νν¬ κ΅¬μ±
- Sparkλ 5κ°μ§ μ νμΌλ‘ λλ©λλ€: Spark Core, Spark SQL, Spark Streaming, MLlib, GraphXμ λλ€.
- μ€νν¬ μ½μ΄μ κ°μ κΈ°λ³Έμ μΈ λΆλΆμ μ»¬λ¬ κΈ°λ₯μ΄ μΆκ°λλλΌλ λ³νμ§ μμ΅λλ€.
κ²°λ‘ μ μΌλ‘ μνμΉ μ€νν¬λ κ³μν΄μ κΈ°λ₯μ μΆκ°νκ³ μ±λ₯μ κ°μ νκ³ μμ§λ§, κΈ°λ³Έμ μΈ λΆλΆμ λ³νμ§ μμ΅λλ€. λ°λΌμ μ€νν¬ μ½μ΄μ κ°μ κΈ°μ΄μ μΈ λΆλΆλΆν° μ°¨κ·Όμ°¨κ·Ό λ°°μλκ°λ κ²μ΄ κ°μ₯ μ’μ΅λλ€.
Apache Sparkμ RDD μκ°
μ΄ λΈλ‘κ·Έ κ²μλ¬Όμμλ Apache Sparkμ RDD(νλ ₯μ λΆμ° λ°μ΄ν° μΈνΈ)μ λν΄ μμλ΄
λλ€. ν΅μ¬μ λ€μκ³Ό κ°μ΅λλ€:
- RDDλ "νλ ₯μ λΆμ° λ°μ΄ν° μΈνΈ" λλ "νλͺ ν μλ λ°°ν¬ λ·° λ‘λ λ°μ΄ν° μ "μ μ½μμ λλ€. μ°λ¦¬λ§λ‘λ "νλ ₯μ λΆμ° λ°μ΄ν°"λΌλ λ»μ λλ€.
- SparkContextμ "textFile" ν¨μλ₯Ό μ¬μ©νμ¬ νμΌμμ λ°μ΄ν°λ₯Ό λ‘λνμ¬ RDDλ₯Ό μμ±ν μ μμ΅λλ€.
- RDDμλ λ€μ― κ°μ§ νΉμ§μ΄ μμ΅λλ€:
- λ°μ΄ν° μΆμν: λ°μ΄ν°κ° μ¬λ¬ λ Έλμ ν©μ΄μ Έ μμ§λ§ νλμ κ°μ²΄λ‘ μ¬μ©ν μ μμ΅λλ€.
- νλ ₯μ & Immutable : λ°μ΄ν°λ₯Ό λΆλ³μΌλ‘ μ μ§νκ³ νμν λ μμΉ΄μ΄λΈνμ¬ λ€νΈμν¬, νλμ¨μ΄, λ©λͺ¨λ¦¬μ μ₯μ λ₯Ό μ²λ¦¬ν μ μμ΅λλ€.
- μ ν μμ : μ»΄νμΌ μ νμ νμ νμ¬ λ¬Έμ λ₯Ό μ‘°κΈ°μ κ°μ§ν μ μμ΅λλ€.
- μ ν λ° λΉμ ν λ°μ΄ν°λ₯Ό λͺ¨λ μ²λ¦¬ν©λλ€: ν μ€νΈ λ°μ΄ν°μ ν νμμ λ°μ΄ν° μ§ν©μ λͺ¨λ μ²λ¦¬ν μ μμ΅λλ€.
- Lazy evaluation: κ²°κ³Όκ° νμν λκΉμ§ κ³μ°μ μννμ§ μμ΅λλ€.
- RDD μμ μ λ³νκ³Ό μμ μ λ λΆλΆμΌλ‘ λλ μ μμ΅λλ€.
- RDD μ¬μ©μ μ₯μ μΌλ‘λ μ μ°μ±, κ°κ²°ν μ½λ, λ°μ΄ν° μ‘°μ μ λ μ μ€ν μ μλ€λ μ λ±μ΄ μμ΅λλ€.
μ λ°μ μΌλ‘ RDDλ Apache Sparkμμ λκ·λͺ¨ λΆμ° λ°μ΄ν° μ²λ¦¬λ₯Ό μ²λ¦¬νλ μ μ°νκ³ νλ ₯μ μ΄λ©° ν¨μ¨μ μΈ λ°©λ²μ μ 곡ν©λλ€.
λ³λ ¬ μ²λ¦¬μ λΆμ° μ²λ¦¬
μ΄ κ°μμμλ λ³λ ¬ μ²λ¦¬μ λΆμ° μ²λ¦¬μ μ°¨μ΄μ μ λν΄ λ°°μ μ΅λλ€:
λ³λ ¬ μ²λ¦¬
- μ¬λ¬ λ Έλμμ λμμ μμ μ μ€ννλ κ²μ ν¬ν¨ν©λλ€.
- λ°μ΄ν°λ₯Ό μ¬λ¬ μ‘°κ°μΌλ‘ λΆν νκ³ κ° μμ μ μ¬λ¬ μ€λ λμ μ μ©ν©λλ€.
- μ¬λ¬ μ€λ λμμ λμμ μλνμ¬ λ λΉ λ₯Έ κ²°κ³Όλ₯Ό μ»μ μ μμ΅λλ€.
- κ° μ€λ λμ κ²°κ³Όλ₯Ό κ²°ν©ν΄μΌ ν μλ μμ
λΆμ° μ²λ¦¬
- "λ°μ΄ν° λ³λ ¬ λΆμ°"μ μνν©λλ€.
- λ°μ΄ν°λ₯Ό μ¬λ¬ μ‘°κ°μΌλ‘ λΆν νμ¬ μ¬λ¬ λ Έλλ‘ μ μ‘ν©λλ€.
- μ¬λ¬ λ Έλμμ λ 립μ μΌλ‘ μμ μ μ©
- κ° λ Έλμ κ²°κ³Όλ₯Ό ν©μ°ν΄μΌ ν μλ μμ
- λ Έλ κ° ν΅μ μ λν κ³ λ € νμ
μ€νν¬λ RDDλ₯Ό μ΄μ©ν΄ λ°μ΄ν° λ³λ ¬ λͺ¨λΈμ ꡬννκ³ μΆμνν¨μΌλ‘μ¨ λΆμ° νκ²½μμλ λ³λ ¬ μ²λ¦¬νλ― μ½λλ₯Ό μμ±ν μ μμ΅λλ€. νμ§λ§ μ΅μ μ μ±λ₯μ λ΄κΈ° μν΄μλ ν΅μ μλμ μ κ²½μ μ¨μΌ ν©λλ€.
λ€μ κ°μμμλ λΆμ° νκ²½μμμ λ
Έλ κ° ν΅μ μ λν΄ μμΈν μμλ³΄κ² μ΅λλ€.
λΆμ° μ²λ¦¬μ μ§μ° μκ° λ¬Έμ
μ΄λ² λΈλ‘κ·Έ ν¬μ€ν
μμλ Sparkμ λΆμ° μ²λ¦¬ λ° μ§μ° λ¬Έμ μ κ΄λ ¨λ μ΄μμ λν΄ μμλ³΄κ² μ΅λλ€:
1. λΆλΆ μ₯μ λ¬Έμ - Sparkμ RDDλ μΌλΆ λ
Έλμ μ₯μ κ° λ°μν κ²½μ° λ°μ΄ν°λ₯Ό λ€μ κ³μ°ν μ μλλ‘ ν¨μΌλ‘μ¨ λΆλΆ μ₯μ λ¬Έμ λ₯Ό μ²λ¦¬νλ λ° λμμ μ€λλ€. μ΄ νλ‘μΈμ€λ μλνλμ΄ μμΌλ©° Sparkμμ μ²λ¦¬ν©λλ€.
2. νμμ
μΌλ‘μμ μλ - μ¬λ¬ λ
Έλμ κ±Έμ³ ν΅μ μ΄ νμν μμ
μ ν΅μ μ§μ°μΌλ‘ μΈν΄ μλκ° λλ €μ§ μ μμ΅λλ€. μ±λ₯μ μ΅μ ννλ €λ©΄ μ μ€ν μ½λ©μ΄ νμν©λλ€.
3. λ¨μΆν€ ν¨μμ μν₯ - λ¨μΆν€ ν¨μμ μν΄ μ€νν¬ μ½λμ μ±λ₯μ΄ μν₯μ λ°μ μ μμ΅λλ€. μ΄ ν¨μλ ν΅μ μ΄ νμνλ©° μ½λ μ±λ₯μ΄ λλ €μ§ μ μμ΅λλ€. κ°μλ λ°μ΄ν°μμ reducebykeyλ₯Ό νΈμΆνλ©΄ μ±λ₯μ΄ ν₯μλ μ μμ΅λλ€.
4. ν΅μ μλ - λ€νΈμν¬ ν΅μ μ μ€νν¬μμ κ°μ₯ λλ¦° μ°μ°μΌλ‘ λ©λͺ¨λ¦¬ μ°μ°λ³΄λ€ μ§μ° μκ°μ΄ μ½ 100λ§ λ°° λ립λλ€.
λ°λΌμ Spark μ½λλ₯Ό κ°λ°ν λ λ€νΈμν¬ ν΅μ μ΄ μ±λ₯μ λ―ΈμΉλ μν₯μ κ³ λ €νκ³ κ·Έμ λ°λΌ μ΅μ ννλ κ²μ΄ μ€μν©λλ€.
Key Value RDD (paired RDD)
μ΄λ² κ°μμμλ ν€μ κ° μμ΄ μλ RDDμΈ ν€κ° RDDμ λν΄ λ°°μ μ΅λλ€. ν€λ₯Ό κΈ°λ°μΌλ‘ κ³ μ°¨μμ μΈ μ°μ°μ ν μ μμΌλ©°, μ£Όλ‘ κ°μ κΈ°λ°μΌλ‘ ν΅κ³λ₯Ό κ³μ°νλ λ° μ¬μ©λ©λλ€. NoSQL μ²λΌ μΈμ μλλ―
μ£Όμ λ΄μ©μ λ€μκ³Ό κ°μ΅λλ€:
- ν€ κ° RDDλ ν€μ κ° μμ΄ μλ RDDμ λλ€.
- ν€λ₯Ό κΈ°λ°μΌλ‘ νλ κ³ μ°¨μ μ°μ°μ μ¬μ©λ©λλ€.
- ν€ κ° RDDλ κ°μ κΈ°λ°μΌλ‘ ν΅κ³λ₯Ό κ³μ°νλ λ° μ¬μ©ν μ μμ΅λλ€.
- Reduction: ν€ κ°μ κΈ°λ°μΌλ‘ λ°μ΄ν°λ₯Ό κ·Έλ£Ήννμ¬ μ²λ¦¬νλ κΈ°λ₯μΌλ‘, λ°μ΄ν°λ₯Ό μ€μ΄λ λ° μ¬μ©λ©λλ€.
- Join: ν€λ₯Ό κΈ°λ°μΌλ‘ μ¬λ¬ κ°μ RDDλ₯Ό κ²°ν©νλ μ€λ²λ‘λμ λλ€.
- ν€ κ° RDDλ₯Ό μ¬μ©ν λλ ν¨μ¨μ μΈ νν°μ μ¬μ©μ μν΄ map() ν¨μ λμ mapValues(), flatMapValues() λ§ μ¬μ©νλ κ²μ΄ μ€μν©λλ€. βValueλ§ λ€λ£¨λ μ°μ°λ€μ΄μ§λ§ RDDμμ keyλ μ μ§λ¨
- ν€ κ° RDDλ νμ€ RDDμ ν¬κ² λ€λ₯΄μ§ μμ΅λλ€.
κ²°λ‘ μ μΌλ‘ ν€κ° RDDλ ν€μ κ°μ κΈ°λ°μΌλ‘ λ°μ΄ν°λ₯Ό ν¨μ¨μ μΌλ‘ μ‘°μνκ³ ν΅κ³ λΆμμ ν μ μλ κ°λ ₯ν λꡬμ
λλ€. λΉ
λ°μ΄ν°μ Sparkλ‘ μμ
νλ λͺ¨λ μ¬λμκ² νμμ μΈ κ°λ
μ
λλ€.
Β
μ€νν¬μ νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ΄ν΄νκΈ°
νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ κ°λ μ΄ν΄νκΈ°
- νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ κΈ°λ³Έ κ°λ μ€λͺ
- νΈλμ€ν¬λ©μ΄μ : RDDμμ μλ‘μ΄ RDDλ₯Ό λ§λλ μ°μ°
- μ‘μ : κ²°κ³Όλ₯Ό μΆλ ₯νκ±°λ μ μ₯νλ μ°μ°
- νΈλμ€ν¬λ©μ΄μ μ νΉμ§: μ§μ° μ€νλλ©°, μ‘μ μ λ§λ λ μ€νλ¨
- μ‘μ μ νΉμ§: μ¦μ μ€νλ¨
νΈλμ€ν¬λ©μ΄μ μ μ’ λ₯
- 맡 (Map) νΈλμ€ν¬λ©μ΄μ : 1λ1 λ³ν, μμλ₯Ό λ€λ₯Έ μμλ‘ λ³ν
- νν° (Filter) νΈλμ€ν¬λ©μ΄μ : μ£Όμ΄μ§ 쑰건μ λ§λ μμλ§ μ ννμ¬ μλ‘μ΄ RDD μμ±
- κΈ°ν νΈλμ€ν¬λ©μ΄μ : reduceByKey, groupBy, flatMap λ±μ λ€μν ν¨μλ€
μ‘μ μ νμ©κ³Ό μ£Όμμ¬ν
- 컬λ νΈ (Collect) μ‘μ : RDDμ λͺ¨λ λ°μ΄ν°λ₯Ό κ°μ Έμμ 리μ€νΈ ννλ‘ λ°ν
- μΉ΄μ΄νΈ (Count) μ‘μ : RDDμ μμ κ°μλ₯Ό μΈμ΄μ€
- λμ€ν΄νΈ (Distinct) μ‘μ : μ€λ³΅μ μ κ±°νμ¬ μ λν¬ν κ°λ§ μΆμΆ
- μ‘μ μ¬μ© μ μ£Όμμ : λͺ¨λ λ°μ΄ν°λ₯Ό κ°μ Έμ€κΈ° λλ¬Έμ 리μμ€ λλΉμ μ£Όμ
νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ μ€ν μμ μ΄ν΄νκΈ°
- νΈλμ€ν¬λ©μ΄μ : νλ‘κ·Έλ¨μ΄ νΈλμ€ν¬λ©μ΄μ μ λ§λλ©΄ μ°μ°μ κΈ°μ΅νκ³ μ‘μ μ λ§λ λ μ€ν
- μ‘μ : νλ‘κ·Έλ¨μ μ€νμ μμνλ ν¨μλ‘ κ²°κ³Όλ₯Ό μ¦μ μΆλ ₯νκ±°λ μ μ₯
μ€μ΅μ ν΅ν νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ΄ν΄
- μμ λ₯Ό ν΅ν΄ 맡과 νν° λ±μ νΈλμ€ν¬λ©μ΄μ μ€μ΅
- 컬λ νΈμ μΉ΄μ΄νΈλ₯Ό μ¬μ©νμ¬ RDDμ λ°μ΄ν° νμΈκ³Ό κ°μ μΈκΈ° μ€μ΅
- μ€λ³΅ μ κ±°λ₯Ό μν λμ€ν νΈ ν¨μμ νμ©
νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ μ±λ₯ μ΅μ ν
- νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ μ±λ₯ μ°¨μ΄ μ΄ν΄
- μμ΄λ νΈλμ€ν¬λ©μ΄μ κ³Ό μ νλ§μ 리μμ€ μλΉ μ΄ν΄
- μ΅μ νλ₯Ό μν νΈλ¦μ€μ λ°©λ²λ‘ μκ°
μ λμΈ(Union) νΈλμ€ν¬λ©μ΄μ
- μ λμΈμ κ°λ κ³Ό νμ©
- λ κ°μ RDDλ₯Ό ν©μ³μ μλ‘μ΄ RDD μμ±
- μ€λ³΅λ μμλ ν¬ν¨νμ¬ λͺ¨λ μμλ₯Ό ν©μΉ¨
- μμ λ₯Ό ν΅ν΄ μ λμΈ μ€μ΅κ³Ό κ²°κ³Ό νμΈ
μΈν°μΉμ (Intersection) νΈλμ€ν¬λ©μ΄μ
- μΈν°μΉμ μ κ°λ κ³Ό νμ©
- λ κ°μ RDDμμ 곡ν΅λ μμλ§ μΆμΆνμ¬ μλ‘μ΄ RDD μμ±
- μ€λ³΅λ μμλ ν λ²λ§ ν¬ν¨
- μμ λ₯Ό ν΅ν΄ μΈν°μΉμ μ€μ΅κ³Ό κ²°κ³Ό νμΈ
μλΈνΈλνΈ(Subtract) νΈλμ€ν¬λ©μ΄μ
- μλΈνΈλνΈμ κ°λ κ³Ό νμ©
- 첫 λ²μ§Έ RDDμμ λ λ²μ§Έ RDDμ μλ μμλ₯Ό μ μΈν λλ¨Έμ§ μμ μΆμΆνμ¬ μλ‘μ΄ RDD μμ±
- μμ λ₯Ό ν΅ν΄ μλΈνΈλνΈ μ€μ΅κ³Ό κ²°κ³Ό νμΈ
μνλ§(Sampling) ν¨μ
- μνλ§ ν¨μμ κ°λ κ³Ό νλΌλ―Έν° μ΄ν΄
- 무μμλ‘ RDDμμ μΌλΆ μμ μΆμΆ
- 첫 λ²μ§Έ νλΌλ―Έν°: μνλ§νλ μμ κ°μ
- λ λ²μ§Έ νλΌλ―Έν°: 볡μ μΆμΆ μ¬λΆ(κΈ°λ³Έκ°μ True)
- μΈ λ²μ§Έ νλΌλ―Έν°: λλ€ μλ κ° μ§μ (κΈ°λ³Έκ°μ None)
- μμ λ₯Ό ν΅ν΄ μνλ§ ν¨μ μ€μ΅κ³Ό κ²°κ³Ό νμΈ
μμ΄λ νΈλμ€ν¬λ©μ΄μ κ³Ό groupBy νΈλμ€ν¬λ©μ΄μ
- μμ΄λ νΈλμ€ν¬λ©μ΄μ μ κ°λ κ³Ό νΉμ§ μ΄ν΄
- groupBy νΈλμ€ν¬λ©μ΄μ μ νμ©
- νΉμ κΈ°μ€μΌλ‘ RDDμ λ°μ΄ν°λ₯Ό κ·Έλ£Ήννμ¬ μλ‘μ΄ RDD μμ±
- μμ λ₯Ό ν΅ν΄ groupBy νΈλμ€ν¬λ©μ΄μ μ€μ΅κ³Ό κ²°κ³Ό νμΈ
μμ΄λ νΈλμ€ν¬λ©μ΄μ κ³Ό ν€-κ°(Key-Value) RDD
- ν€-κ° RDDμ κ°λ κ³Ό νμ© μ΄ν΄
- μμ΄λ νΈλμ€ν¬λ©μ΄μ μ ν€-κ° RDDμ μ μ©νλ λ°©λ²
- κΈ°λ³Έμ μΈ μμ΄λ νΈλμ€ν¬λ©μ΄μ μμ λ₯Ό ν΅ν΄ ν€-κ° RDD λ€λ£¨κΈ°
μμ΄λ νΈλμ€ν¬λ©μ΄μ μ κ³ κΈ νμ©
- μμ΄λ νΈλμ€ν¬λ©μ΄μ μ νμ©ν 볡μ‘ν κ³μ° μμ μ€λͺ
- ν€-κ° RDDλ₯Ό μ΄μ©νμ¬ κ·Έλ£Ήννκ³ μ§κ³νλ λ°©λ² μκ°
- μ€μ μμ λ₯Ό ν΅ν΄ μμ΄λ νΈλμ€ν¬λ©μ΄μ μ κ³ κΈ νμ© μ΄ν΄
νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ νμ©ν λ°μ΄ν° μ²λ¦¬ νλ‘μ νΈ μμ
- μ λμΈ, μΈν°μΉμ , μλΈνΈλνΈ λ±μ νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ μ‘°ν©νμ¬ λ°μ΄ν° μ²λ¦¬νλ μμ μ 곡
- νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ μ μ€ν μμλ₯Ό κ³ λ €νμ¬ νλ‘μ νΈ κ°λ° λ°©λ² μ€λͺ
Β
μ€νν¬μ μ±λ₯μ μ΅μ ννλ λ° μ¬μ©λλ Cache()μ Persist()
- νΈλμ€ν¬λ©μ΄μ κ³Ό μ‘μ : μ€νν¬μ λ κ°μ§ μ°μ° μ€νΌλ μ΄μ
- νΈλμ€ν¬λ©μ΄μ (Transformation): μλ‘μ΄ RDDλ₯Ό μμ±νλ μ°μ°μΌλ‘ μ§μ° μ€νλ¨
- μ‘μ (Action): κ²°κ³Όλ₯Ό μΆλ ₯νκ±°λ μ μ₯νλ μ°μ°μΌλ‘ μ¦μ μ€νλ¨
- μ§μ°λ μ°μ°κ³Ό λ©λͺ¨λ¦¬ μ΅μ ν
- μ§μ°λ μ°μ°μ λμ€ν¬μ λ€νΈμν¬ μ°μ° μ΅μν, λ©λͺ¨λ¦¬ νμ© μ΅λνμ μ μ©ν¨
- νΉν, λ°λ³΅λλ μμ μμ μ±λ₯ ν₯μ κ°λ₯
- μΊμμ νμμ€νΈ
- μΊμ(Cache)μ νμμ€νΈ(Persist)λ‘ λ°μ΄ν°λ₯Ό λ©λͺ¨λ¦¬μ μ μ₯ν΄λκ³ μ¬μ¬μ© κ°λ₯
- μΊμ: λ©λͺ¨λ¦¬μλ§ λ°μ΄ν° μ μ₯
- νμμ€νΈ: λ©λͺ¨λ¦¬μ λμ€ν¬μ λ°μ΄ν° μ μ₯
- μμ: λ¨Έμ λ¬λ λͺ¨λΈ νμ΅κ³Ό μ ν νκ·
- λ¨Έμ λ¬λμμ λ°λ³΅μ μΌλ‘ λ°μ΄ν° μ²λ¦¬κ° νμν κ²½μ° μΊμμ νμμ€νΈλ₯Ό νμ©νμ¬ μ±λ₯ ν₯μ κ°λ₯
- μ ν νκ· μμμμ κ·Έλ λμΈνΈλ₯Ό μ°μ°νλ λ°λ³΅ μμ μ μΊμλ₯Ό μ΄μ©νμ¬ μ΅μ ν
- μ€ν λ¦¬μ§ λ 벨(Storage Level)
- λ©λͺ¨λ¦¬ only, λ©λͺ¨λ¦¬ μ€ λμ€ν¬, SCR, λμ€ν¬ only, λ±μ μ€ν λ¦¬μ§ λ 벨μ μ ννμ¬ λ°μ΄ν° μ μ₯
- μΊμλ λν΄νΈ μ€ν λ¦¬μ§ λ 벨μ μ¬μ©, νμμ€νΈλ μ¬μ©μκ° μνλ λλ‘ μ§μ κ°λ₯
μ΄λ₯Ό ν΅ν΄ μ€νν¬μμ λ°μ΄ν° μ²λ¦¬ μμ
μ μλλ₯Ό μ΅μ ννκΈ° μν΄ μΊμμ νμμ€νΈλ₯Ό νμ©νλ λ°©λ²κ³Ό μ€ν λ¦¬μ§ λ 벨μ μ‘°μ νμ¬ μ±λ₯μ ν₯μμν€λ λ°©λ²μ λ°°μ μ΅λλ€. λ€μ μκ°μλ λΆμ° νκ²½μμ μ€νν¬μ λ΄λΆ ꡬ쑰μ μ€κ³μ λν΄ λ μμΈν μμλ³΄κ² μ΅λλ€.
Apache Spark for Big Data Processing
Apache Spark is an open-source, high-speed distributed processing system used by companies like Amazon and Airbnb to address the common problem of big data. The big data problem is divided into three categories: size, speed, and diversity. Google solved the problem by developing a single ecosystem with three major parts: HDFS, Map Rhythms, and Yeon. Apache Spark is a project that replaces the calculation engine of Hadoop, and is fast due to its memory opening structure.
Spark can process data by storing it in memory across multiple nodes and processing it simultaneously. Spark has formed a cluster that can be divided into a driver program with a Spark context and a scrubber major work node. Spark can also perform exaggerated calculation while waiting, which speeds up the process. The Spark core data model is called RDD, which is an immutable, resilient distributed data cache that must be separated into multiple partitions.
Spark is slower than Pandas, but it is capable of horizontal expansion, making it faster and more efficient in processing big data. It is 100 times faster in memory and 10 times faster on disk than Hadoop Memorit.
In summary, Apache Spark is an essential tool for big data processing, providing a fast and efficient solution to the challenge of processing large amounts of data.
Β
A Brief History of Apache Spark
In this blog post, we take a brief look at the history of Apache Spark, focusing on its development over versions 1, 2, and 3. Here are the key points:
Apache Spark version 1
- Officially announced in 2014
- Two-memory processing method using RDD
- Data frame called "data height" added in version 1.3
- Memory and CPU efficiency optimized due to engine upgrade via Project Tungsten
Apache Spark version 2
- Announced in 2016
- Further simplification and improved performance
- Streaming processing became possible due to Project Structured Streaming on top of Spark SQL
- Data caching via a certain amount of data frame called "data cache"
- Support for multiple languages, including Python, Java, Scala, etc.
Apache Spark version 3
- Released in 2020
- Added MLlib and Spark SQL features
- Significant performance improvement - about 200% faster than Spark 2.4
- PySpark usage notable
- Strengthened support for GPU-based workloads and distributed graph computation via Project GraphX
- Python 2 support cut off, strengthening of support for Kubernetes
Spark Configuration
- Spark is divided into 5 types: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX
- The fundamental parts, such as Spark Core, remain unchanged even if color functions are added
In conclusion, while Apache Spark continues to add features and improve performance, the fundamental parts remain unchanged. It's best to start learning from the fundamental part, such as Spark Core, and work your way up.
Introduction to RDD in Apache Spark
In this blog post, we learn about RDD (Resilient Distributed Datasets) in Apache Spark. Here are the key points:
- RDD stands for "Resilient Distributed Datasets" or "Wiseley End Distro View Load Data Three". In Korean, it means "Elastic Distributed Data".
- We can create RDD by loading data from a file using the "textFile" function in SparkContext.
- RDD has five characteristics:
- Distributed data: Data is scattered across multiple nodes but can be used as a single object.
- Resilient: It can handle failures in the network, hardware, and memory by keeping the data immutable and archiving it when needed.
- Type safe: It can detect problems early by determining the compile type.
- Handles both structured and unstructured data: It can handle both text data and tabular dataset.
- Lazy evaluation: It does not perform calculations until the result is needed.
- RDD operations can be divided into two parts: transformation and action.
- Advantages of using RDD include flexibility, concise code, and more thoughtfulness in data manipulation.
Overall, RDD provides a flexible, resilient, and efficient way to handle large-scale distributed data processing in Apache Spark.
Parallel Processing and Distributed Processing
In this lecture, we learned about the differences between parallel processing and distributed processing:
Parallel Processing
- Involves running a task on multiple nodes simultaneously
- Splits data into multiple pieces and applies each task in multiple threads
- Operates simultaneously in multiple threads to achieve faster results
- May need to combine the results of each thread
Distributed Processing
- Involves doing a "Distribute Data Parallel"
- Splits data into multiple pieces and sends them to multiple nodes
- Applies tasks independently in multiple nodes
- May need to combine the results of each node
- Requires consideration of communication between nodes
Spark allows us to write code as if it were parallel processing in a distributed environment by implementing and abstracting the data parallel model using RDD. However, it is important to pay attention to communication speed to achieve optimal performance.
In the next lecture, we will learn more about communication between nodes in a distributed environment.
Distributed Processing and Latency Issues
In this blog post, we will learn about the challenges associated with distributed processing and latency in Spark:
- The Partial Failure Problem - Spark's RDD helps to handle the problem of partial failures by allowing data to be recomputed if some nodes fail. This process is automated and taken care of by Spark.
- Speed as a Necessary Enemy - Tasks that require communication across multiple nodes can be slowed down by the latency of communication. Careful coding is needed to optimize performance.
- Impact of reducebykey Function - The performance of Spark code can be affected by the reducebykey function. This function requires communication and can slow down code performance. Calling reducebykey on reduced data can improve performance.
- Communication Speed - Network communication is the slowest operation in Spark, with a latency about 1 million times slower than memory operations.
Therefore, while developing Spark code, it is important to consider the impact of network communication on performance and optimize accordingly.
Key Value RDD
In this lecture, we learned about key value RDD, which is an RDD that has a key and value pair. It allows for higher-dimensional operations based on keys and is mainly used to calculate statistics based on the values.
Here are the key takeaways:
- Key value RDD is an RDD that has a key and value pair.
- It is used for higher-dimensional operations based on keys.
- Key value RDD can be used to calculate statistics based on values.
- Reduction is a function used to group and process data based on key values and is used to reduce data.
- Join is an overload that combines multiple RDDs based on a key.
- When using key value RDD, it is important to use map value ternary and flat value ternary as a value mandal instead of a map function for efficient use of partitions.
- Key value RDD is not very different from standard RDDs.
In conclusion, key value RDD is a powerful tool that allows for efficient data manipulation and statistical analysis based on keys and values. It is an essential concept for anyone working with big data and Spark.