abeam_python_Groupby

Size Limit logo by Anton Lovchikov

abeam_python_Groupby

What is apache beam?

Apache Beam basically a data processing platform. Data processing can be either for analytics purpose or it can be ETL (Extract, Transfer, Load). Apache beam doesn’t reply on any one execution engine.The input data can be streaming data or batch data. Input data can be from some database like relational database or memory database. so apache beam is execution platform agnostic and data agnostic also programming agnostic i.e, it supports multiple programming language you can write your logic in java python,go.

Size Limit logo by Anton Lovchikov

Terminology

Quickstart

Check versions

python --version
pip --version

python must be 3.6 or higher, pip must be 7.0.0 or newer

Install apache beam

python -m pip install apache-beam

Installation for extra dependencies follow below command

pip install apache-beam[gcp,aws,test,docs]

For more detail go to this link

Google Colab

Google Colab has python preinstalled. On it, it is easy to start using apache beam.

Note: Google Colab works similar to jupyter notebook

Look at my netflixGroupBy.ipynb Colab python notebook

Sri Sudheera Colab File

Sri Sudheera Project input file

Sri Sudheera Project output file

Resources

Apache Beam Group By

Kaggle data set

Apache Beam Colab