Python_Spark
Python
Additional to pep8, here are some additional best practices specific to Data Engineering
Import only the modules you need Tag versions for modules and packages in requirements.txt Include modes like "Debug" for verbose logging to console Use logging instead of print statements
Spark
- Only change default configuration on purpose
- Create pipelines which are idempotent
- Run Merge only if Source Dataframe is "NOT Empty"
- Define Schemas explicitly and not infer
- Use ZSTD over SNAPPY for tables not frequently updated
- Do NOT use Spark for small workloads(<5GB)