Apache Spark, Parallel Distributed Processing

💡

This article is a summary of "Spark & Flink Online for Real-Time Big Data Processing" from Fast Campus. It summarizes the overall definition and recommends taking the lecture for programming related content.

빅데이터 처리를 위한 아파치 스파크 (Apache Spark)

아파치 스파크는 아마존, 에어비앤비 같은 기업에서 빅데이터라는 공통의 문제를 해결하기 위해 사용하는 오픈소스 고속 분산 처리 시스템입니다. 빅데이터 문제는 크기, 속도, 다양성의 세 가지 범주로 나뉩니다. 구글은 HDFS, 맵리듬, 연의 세 가지 주요 부분으로 구성된 단일 에코시스템을 개발하여 이 문제를 해결했습니다. 아파치 스파크는 하둡의 계산 엔진을 대체하는 프로젝트로, 메모리 개방형 구조로 인해 속도가 빠릅니다.

스파크는 데이터를 여러 노드에 걸쳐 메모리에 저장하고 동시에 처리하는 방식으로 데이터를 처리할 수 있습니다. 스파크는 스파크 컨텍스트가 있는 드라이버 프로그램과 스크러버 주 작업 노드로 나눌 수 있는 클러스터를 형성하고 있습니다. 또한 스파크는 대기 중에도 과장된 계산을 수행할 수 있어 프로세스 속도를 높일 수 있습니다. 스파크의 핵심 데이터 모델은 여러 파티션으로 분리되어야 하는 불변의 탄력적인 분산 데이터 캐시인 RDD라고 불립니다.

스파크는 판다보다 속도가 느리지만 수평 확장이 가능하기 때문에 빅데이터 처리 속도가 빠르고 효율적입니다. Hadoop Memorit보다 메모리에서 100배, 디스크에서 10배 더 빠릅니다.

요약하자면, 아파치 스파크는 빅데이터 처리에 필수적인 도구로 대용량 데이터 처리라는 과제에 빠르고 효율적인 솔루션을 제공합니다.

Apache Spark의 간략한 역사

이 블로그 게시물에서는 버전 1, 2, 3에 걸친 개발에 초점을 맞춰 Apache Spark의 역사를 간략하게 살펴봅니다. 핵심 사항은 다음과 같습니다:

아파치 스파크 버전 1

2014년 공식 발표

RDD를 이용한 2메모리 처리 방식

버전 1.3에서 "데이터 높이"라는 데이터 프레임 추가

프로젝트 텅스텐을 통한 엔진 업그레이드로 메모리 및 CPU 효율성 최적화

아파치 스파크 버전 2

2016년에 발표

더욱 단순화되고 성능이 향상되었습니다.

Spark SQL 위에 프로젝트 구조화된 스트리밍으로 스트리밍 처리 가능

"데이터 캐시"라고 불리는 일정량의 데이터 프레임을 통한 데이터 캐싱

파이썬, 자바, 스칼라 등 다양한 언어 지원

아파치 스파크 버전 3

2020년에 출시

MLlib 및 Spark SQL 기능 추가

대폭적인 성능 향상 - Spark 2.4보다 약 200% 빨라짐

주목할 만한 PySpark 사용

Project GraphX를 통해 GPU 기반 워크로드 및 분산 그래프 연산에 대한 지원 강화

파이썬 2 지원 중단, 쿠버네티스 지원 강화

스파크 구성

Spark는 5가지 유형으로 나뉩니다: Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX입니다.

스파크 코어와 같은 기본적인 부분은 컬러 기능이 추가되더라도 변하지 않습니다.

결론적으로 아파치 스파크는 계속해서 기능을 추가하고 성능을 개선하고 있지만, 기본적인 부분은 변하지 않습니다. 따라서 스파크 코어와 같은 기초적인 부분부터 차근차근 배워나가는 것이 가장 좋습니다.

Apache Spark의 RDD 소개

이 블로그 게시물에서는 Apache Spark의 RDD(탄력적 분산 데이터 세트)에 대해 알아봅니다. 핵심은 다음과 같습니다:

RDD는 "탄력적 분산 데이터 세트" 또는 "현명한 엔드 배포 뷰 로드 데이터 셋"의 약자입니다. 우리말로는 "탄력적 분산 데이터"라는 뜻입니다.

SparkContext의 "textFile" 함수를 사용하여 파일에서 데이터를 로드하여 RDD를 생성할 수 있습니다.

RDD에는 다섯 가지 특징이 있습니다:

데이터 추상화: 데이터가 여러 노드에 흩어져 있지만 하나의 객체로 사용할 수 있습니다.

탄력적 & Immutable : 데이터를 불변으로 유지하고 필요할 때 아카이브하여 네트워크, 하드웨어, 메모리의 장애를 처리할 수 있습니다.

유형 안전: 컴파일 유형을 파악하여 문제를 조기에 감지할 수 있습니다.

정형 및 비정형 데이터를 모두 처리합니다: 텍스트 데이터와 표 형식의 데이터 집합을 모두 처리할 수 있습니다.

Lazy evaluation: 결과가 필요할 때까지 계산을 수행하지 않습니다.

RDD 작업은 변환과 작업의 두 부분으로 나눌 수 있습니다.

RDD 사용의 장점으로는 유연성, 간결한 코드, 데이터 조작 시 더 신중할 수 있다는 점 등이 있습니다.

전반적으로 RDD는 Apache Spark에서 대규모 분산 데이터 처리를 처리하는 유연하고 탄력적이며 효율적인 방법을 제공합니다.

병렬 처리와 분산 처리

이 강의에서는 병렬 처리와 분산 처리의 차이점에 대해 배웠습니다:

병렬 처리

여러 노드에서 동시에 작업을 실행하는 것을 포함합니다.

데이터를 여러 조각으로 분할하고 각 작업을 여러 스레드에 적용합니다.

여러 스레드에서 동시에 작동하여 더 빠른 결과를 얻을 수 있습니다.

각 스레드의 결과를 결합해야 할 수도 있음

분산 처리

"데이터 병렬 분산"을 수행합니다.

데이터를 여러 조각으로 분할하여 여러 노드로 전송합니다.

여러 노드에서 독립적으로 작업 적용

각 노드의 결과를 합산해야 할 수도 있음

노드 간 통신에 대한 고려 필요

스파크는 RDD를 이용해 데이터 병렬 모델을 구현하고 추상화함으로써 분산 환경에서도 병렬 처리하듯 코드를 작성할 수 있습니다. 하지만 최적의 성능을 내기 위해서는 통신 속도에 신경을 써야 합니다.

다음 강의에서는 분산 환경에서의 노드 간 통신에 대해 자세히 알아보겠습니다.

분산 처리와 지연 시간 문제

이번 블로그 포스팅에서는 Spark의 분산 처리 및 지연 문제와 관련된 이슈에 대해 알아보겠습니다:

1. 부분 장애 문제 - Spark의 RDD는 일부 노드에 장애가 발생할 경우 데이터를 다시 계산할 수 있도록 함으로써 부분 장애 문제를 처리하는 데 도움을 줍니다. 이 프로세스는 자동화되어 있으며 Spark에서 처리합니다.

2. 필요악으로서의 속도 - 여러 노드에 걸쳐 통신이 필요한 작업은 통신 지연으로 인해 속도가 느려질 수 있습니다. 성능을 최적화하려면 신중한 코딩이 필요합니다.

3. 단축키 함수의 영향 - 단축키 함수에 의해 스파크 코드의 성능이 영향을 받을 수 있습니다. 이 함수는 통신이 필요하며 코드 성능이 느려질 수 있습니다. 감소된 데이터에서 reducebykey를 호출하면 성능이 향상될 수 있습니다.

4. 통신 속도 - 네트워크 통신은 스파크에서 가장 느린 연산으로 메모리 연산보다 지연 시간이 약 100만 배 느립니다.

따라서 Spark 코드를 개발할 때 네트워크 통신이 성능에 미치는 영향을 고려하고 그에 따라 최적화하는 것이 중요합니다.

Key Value RDD (paired RDD)

이번 강의에서는 키와 값 쌍이 있는 RDD인 키값 RDD에 대해 배웠습니다. 키를 기반으로 고차원적인 연산을 할 수 있으며, 주로 값을 기반으로 통계를 계산하는 데 사용됩니다. NoSQL 처럼 쓸수 있는듯

주요 내용은 다음과 같습니다:

키 값 RDD는 키와 값 쌍이 있는 RDD입니다.

키를 기반으로 하는 고차원 연산에 사용됩니다.

키 값 RDD는 값을 기반으로 통계를 계산하는 데 사용할 수 있습니다.

Reduction: 키 값을 기반으로 데이터를 그룹화하여 처리하는 기능으로, 데이터를 줄이는 데 사용됩니다.

Join: 키를 기반으로 여러 개의 RDD를 결합하는 오버로드입니다.

키 값 RDD를 사용할 때는 효율적인 파티션 사용을 위해 map() 함수 대신 mapValues(), flatMapValues() 만 사용하는 것이 중요합니다. →Value만 다루는 연산들이지만 RDD에서 key는 유지됨

키 값 RDD는 표준 RDD와 크게 다르지 않습니다.

결론적으로 키값 RDD는 키와 값을 기반으로 데이터를 효율적으로 조작하고 통계 분석을 할 수 있는 강력한 도구입니다. 빅데이터와 Spark로 작업하는 모든 사람에게 필수적인 개념입니다.

스파크의 트랜스포메이션과 액션 이해하기

트랜스포메이션과 액션의 개념 이해하기

트랜스포메이션과 액션의 기본 개념 설명

트랜스포메이션: RDD에서 새로운 RDD를 만드는 연산

액션: 결과를 출력하거나 저장하는 연산

트랜스포메이션의 특징: 지연 실행되며, 액션을 만날 때 실행됨

액션의 특징: 즉시 실행됨

트랜스포메이션의 종류

맵 (Map) 트랜스포메이션: 1대1 변환, 요소를 다른 요소로 변환

필터 (Filter) 트랜스포메이션: 주어진 조건에 맞는 요소만 선택하여 새로운 RDD 생성

기타 트랜스포메이션: reduceByKey, groupBy, flatMap 등의 다양한 함수들

액션의 활용과 주의사항

컬렉트 (Collect) 액션: RDD의 모든 데이터를 가져와서 리스트 형태로 반환

카운트 (Count) 액션: RDD의 요소 개수를 세어줌

디스틴트 (Distinct) 액션: 중복을 제거하여 유니크한 값만 추출

액션 사용 시 주의점: 모든 데이터를 가져오기 때문에 리소스 낭비에 주의

트랜스포메이션과 액션의 실행 순서 이해하기

트랜스포메이션: 프로그램이 트랜스포메이션을 만나면 연산을 기억하고 액션을 만날 때 실행

액션: 프로그램의 실행을 시작하는 함수로 결과를 즉시 출력하거나 저장

실습을 통한 트랜스포메이션과 액션 이해

예제를 통해 맵과 필터 등의 트랜스포메이션 실습

컬렉트와 카운트를 사용하여 RDD의 데이터 확인과 개수 세기 실습

중복 제거를 위한 디스팅트 함수의 활용

트랜스포메이션과 액션의 성능 최적화

트랜스포메이션과 액션의 성능 차이 이해

와이드 트랜스포메이션과 셔플링의 리소스 소비 이해

최적화를 위한 트릭스와 방법론 소개

유니언(Union) 트랜스포메이션

유니언의 개념과 활용

두 개의 RDD를 합쳐서 새로운 RDD 생성

중복된 요소도 포함하여 모든 요소를 합침

예제를 통해 유니언 실습과 결과 확인

인터섹션(Intersection) 트랜스포메이션

인터섹션의 개념과 활용

두 개의 RDD에서 공통된 요소만 추출하여 새로운 RDD 생성

중복된 요소는 한 번만 포함

예제를 통해 인터섹션 실습과 결과 확인

서브트랙트(Subtract) 트랜스포메이션

서브트랙트의 개념과 활용

첫 번째 RDD에서 두 번째 RDD에 있는 요소를 제외한 나머지 요소 추출하여 새로운 RDD 생성

예제를 통해 서브트랙트 실습과 결과 확인

샘플링(Sampling) 함수

샘플링 함수의 개념과 파라미터 이해

무작위로 RDD에서 일부 요소 추출

첫 번째 파라미터: 샘플링하는 요소 개수

두 번째 파라미터: 복원 추출 여부(기본값은 True)

세 번째 파라미터: 랜덤 시드 값 지정(기본값은 None)

예제를 통해 샘플링 함수 실습과 결과 확인

와이드 트랜스포메이션과 groupBy 트랜스포메이션

와이드 트랜스포메이션의 개념과 특징 이해

groupBy 트랜스포메이션의 활용

특정 기준으로 RDD의 데이터를 그룹화하여 새로운 RDD 생성

예제를 통해 groupBy 트랜스포메이션 실습과 결과 확인

와이드 트랜스포메이션과 키-값(Key-Value) RDD

키-값 RDD의 개념과 활용 이해

와이드 트랜스포메이션을 키-값 RDD에 적용하는 방법

기본적인 와이드 트랜스포메이션 예제를 통해 키-값 RDD 다루기

와이드 트랜스포메이션의 고급 활용

와이드 트랜스포메이션을 활용한 복잡한 계산 예제 설명

키-값 RDD를 이용하여 그룹화하고 집계하는 방법 소개

실전 예제를 통해 와이드 트랜스포메이션의 고급 활용 이해

트랜스포메이션과 액션을 활용한 데이터 처리 프로젝트 예시

유니언, 인터섹션, 서브트랙트 등의 트랜스포메이션과 액션을 조합하여 데이터 처리하는 예시 제공

트랜스포메이션과 액션의 실행 순서를 고려하여 프로젝트 개발 방법 설명

스파크의 성능을 최적화하는 데 사용되는 Cache()와 Persist()

트랜스포메이션과 액션: 스파크의 두 가지 연산 오퍼레이션

트랜스포메이션(Transformation): 새로운 RDD를 생성하는 연산으로 지연 실행됨
액션(Action): 결과를 출력하거나 저장하는 연산으로 즉시 실행됨

지연된 연산과 메모리 최적화

지연된 연산은 디스크와 네트워크 연산 최소화, 메모리 활용 최대화에 유용함
특히, 반복되는 작업에서 성능 향상 가능

캐시와 펄시스트

캐시(Cache)와 펄시스트(Persist)로 데이터를 메모리에 저장해두고 재사용 가능
캐시: 메모리에만 데이터 저장
펄시스트: 메모리와 디스크에 데이터 저장

예시: 머신러닝 모델 학습과 선형 회귀

머신러닝에서 반복적으로 데이터 처리가 필요한 경우 캐시와 펄시스트를 활용하여 성능 향상 가능
선형 회귀 예시에서 그레디언트를 연산하는 반복 작업을 캐시를 이용하여 최적화

스토리지 레벨(Storage Level)

메모리 only, 메모리 앤 디스크, SCR, 디스크 only, 등의 스토리지 레벨을 선택하여 데이터 저장

캐시는 디폴트 스토리지 레벨을 사용, 펄시스트는 사용자가 원하는 대로 지정 가능

이를 통해 스파크에서 데이터 처리 작업의 속도를 최적화하기 위해 캐시와 펄시스트를 활용하는 방법과 스토리지 레벨을 조정하여 성능을 향상시키는 방법을 배웠습니다. 다음 시간에는 분산 환경에서 스파크의 내부 구조와 설계에 대해 더 자세히 알아보겠습니다.

Apache Spark for Big Data Processing

Apache Spark is an open-source, high-speed distributed processing system used by companies like Amazon and Airbnb to address the common problem of big data. The big data problem is divided into three categories: size, speed, and diversity. Google solved the problem by developing a single ecosystem with three major parts: HDFS, Map Rhythms, and Yeon. Apache Spark is a project that replaces the calculation engine of Hadoop, and is fast due to its memory opening structure.

Spark can process data by storing it in memory across multiple nodes and processing it simultaneously. Spark has formed a cluster that can be divided into a driver program with a Spark context and a scrubber major work node. Spark can also perform exaggerated calculation while waiting, which speeds up the process. The Spark core data model is called RDD, which is an immutable, resilient distributed data cache that must be separated into multiple partitions.

Spark is slower than Pandas, but it is capable of horizontal expansion, making it faster and more efficient in processing big data. It is 100 times faster in memory and 10 times faster on disk than Hadoop Memorit.

In summary, Apache Spark is an essential tool for big data processing, providing a fast and efficient solution to the challenge of processing large amounts of data.

A Brief History of Apache Spark

In this blog post, we take a brief look at the history of Apache Spark, focusing on its development over versions 1, 2, and 3. Here are the key points:

Apache Spark version 1

Officially announced in 2014

Two-memory processing method using RDD

Data frame called "data height" added in version 1.3

Memory and CPU efficiency optimized due to engine upgrade via Project Tungsten

Apache Spark version 2

Announced in 2016

Further simplification and improved performance

Streaming processing became possible due to Project Structured Streaming on top of Spark SQL

Data caching via a certain amount of data frame called "data cache"

Support for multiple languages, including Python, Java, Scala, etc.

Apache Spark version 3

Released in 2020

Added MLlib and Spark SQL features

Significant performance improvement - about 200% faster than Spark 2.4

PySpark usage notable

Strengthened support for GPU-based workloads and distributed graph computation via Project GraphX

Python 2 support cut off, strengthening of support for Kubernetes

Spark Configuration

Spark is divided into 5 types: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX

The fundamental parts, such as Spark Core, remain unchanged even if color functions are added

In conclusion, while Apache Spark continues to add features and improve performance, the fundamental parts remain unchanged. It's best to start learning from the fundamental part, such as Spark Core, and work your way up.

Introduction to RDD in Apache Spark

In this blog post, we learn about RDD (Resilient Distributed Datasets) in Apache Spark. Here are the key points:

RDD stands for "Resilient Distributed Datasets" or "Wiseley End Distro View Load Data Three". In Korean, it means "Elastic Distributed Data".

We can create RDD by loading data from a file using the "textFile" function in SparkContext.

RDD has five characteristics:

Distributed data: Data is scattered across multiple nodes but can be used as a single object.
Resilient: It can handle failures in the network, hardware, and memory by keeping the data immutable and archiving it when needed.
Type safe: It can detect problems early by determining the compile type.
Handles both structured and unstructured data: It can handle both text data and tabular dataset.
Lazy evaluation: It does not perform calculations until the result is needed.

RDD operations can be divided into two parts: transformation and action.

Advantages of using RDD include flexibility, concise code, and more thoughtfulness in data manipulation.

Overall, RDD provides a flexible, resilient, and efficient way to handle large-scale distributed data processing in Apache Spark.

Parallel Processing and Distributed Processing

In this lecture, we learned about the differences between parallel processing and distributed processing:

Parallel Processing

Involves running a task on multiple nodes simultaneously

Splits data into multiple pieces and applies each task in multiple threads

Operates simultaneously in multiple threads to achieve faster results

May need to combine the results of each thread

Distributed Processing

Involves doing a "Distribute Data Parallel"

Splits data into multiple pieces and sends them to multiple nodes

Applies tasks independently in multiple nodes

May need to combine the results of each node

Requires consideration of communication between nodes

Spark allows us to write code as if it were parallel processing in a distributed environment by implementing and abstracting the data parallel model using RDD. However, it is important to pay attention to communication speed to achieve optimal performance.

In the next lecture, we will learn more about communication between nodes in a distributed environment.

Distributed Processing and Latency Issues

In this blog post, we will learn about the challenges associated with distributed processing and latency in Spark:

The Partial Failure Problem - Spark's RDD helps to handle the problem of partial failures by allowing data to be recomputed if some nodes fail. This process is automated and taken care of by Spark.

Speed as a Necessary Enemy - Tasks that require communication across multiple nodes can be slowed down by the latency of communication. Careful coding is needed to optimize performance.

Impact of reducebykey Function - The performance of Spark code can be affected by the reducebykey function. This function requires communication and can slow down code performance. Calling reducebykey on reduced data can improve performance.

Communication Speed - Network communication is the slowest operation in Spark, with a latency about 1 million times slower than memory operations.

Therefore, while developing Spark code, it is important to consider the impact of network communication on performance and optimize accordingly.

Key Value RDD

In this lecture, we learned about key value RDD, which is an RDD that has a key and value pair. It allows for higher-dimensional operations based on keys and is mainly used to calculate statistics based on the values.

Here are the key takeaways:

Key value RDD is an RDD that has a key and value pair.

It is used for higher-dimensional operations based on keys.

Key value RDD can be used to calculate statistics based on values.

Reduction is a function used to group and process data based on key values and is used to reduce data.

Join is an overload that combines multiple RDDs based on a key.

When using key value RDD, it is important to use map value ternary and flat value ternary as a value mandal instead of a map function for efficient use of partitions.

Key value RDD is not very different from standard RDDs.

In conclusion, key value RDD is a powerful tool that allows for efficient data manipulation and statistical analysis based on keys and values. It is an essential concept for anyone working with big data and Spark.