Python_Spark

Python

Additional to pep8, here are some additional best practices specific to Data Engineering

Import only the modules you need Tag versions for modules and packages in requirements.txt Include modes like "Debug" for verbose logging to console Use logging instead of print statements

Spark

Only change default configuration on purpose
Create pipelines which are idempotent
Run Merge only if Source Dataframe is "NOT Empty"
Define Schemas explicitly and not infer
Use ZSTD over SNAPPY for tables not frequently updated
Do NOT use Spark for small workloads(<5GB)