Apache Beam basically a data processing platform. Data processing can be either for analytics purpose or it can be ETL (Extract, Transfer, Load). Apache beam doesn’t reply on any one execution engine.The input data can be streaming data or batch data. Input data can be from some database like relational database or memory database. so apache beam is execution platform agnostic and data agnostic also programming agnostic i.e, it supports multiple programming language you can write your logic in java python,go.
Pipelines
End to end data processing.Pcollection
Reading of the input data is p collection applying any transormations on that data and creating new data from that is also p collection.Ptransorm
Logic applying to data is p transform ((https://beam.apache.org/documentation/programming-guide/#transforms)PRunner
specifies where and how the pipeline should execute.python --version
pip --version
python must be 3.6 or higher, pip must be 7.0.0 or newer
python -m pip install apache-beam
Installation for extra dependencies follow below command
pip install apache-beam[gcp,aws,test,docs]
For more detail go to this link
Google Colab has python preinstalled. On it, it is easy to start using apache beam.
Note: Google Colab works similar to jupyter notebook
Look at my netflixGroupBy.ipynb Colab python notebook
Sri Sudheera Project input file
Sri Sudheera Project output file