About the job
Job Title: Data Engineer (PySpark)
________________________________________
About the Role
We invite you to join our dynamic data engineering team as a proficient Data Engineer specializing in PySpark and the Cloudera Data Platform (CDP). In this pivotal role, you will be tasked with architecting, developing, and sustaining robust data pipelines that guarantee exceptional data quality and accessibility throughout the organization. Your expertise in big data ecosystems, cloud-native technologies, and sophisticated data processing methodologies is essential.
The ideal candidate will possess extensive hands-on experience in data ingestion, transformation, and optimization on the Cloudera Data Platform, complemented by a strong history of applying data engineering best practices. You will collaborate closely with fellow data engineers to devise solutions that foster significant business insights.
Key Responsibilities
- Design and develop scalable ETL pipelines using PySpark on CDP, ensuring data integrity.
- Manage data ingestion processes from diverse sources (e.g., relational databases, APIs, file systems) to the data lake or warehouse on CDP.
- Utilize PySpark for processing, cleansing, and transforming vast datasets to meet analytical and business needs.
- Optimize performance by fine-tuning PySpark code and Cloudera components to enhance resource utilization.
- Establish data quality checks and validation routines to maintain data accuracy throughout the pipeline.
- Automate workflows using orchestration tools like Apache Oozie or Airflow within the Cloudera ecosystem.
- Monitor pipeline performance, troubleshoot issues, and maintain the Cloudera Data Platform and associated processes.
- Collaborate with data engineers, analysts, product managers, and other stakeholders to understand data requirements.
- Document data engineering processes, code, and pipeline configurations thoroughly.
Qualifications
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Systems, or a related discipline.
- 3+ years of experience as a Data Engineer, focusing on PySpark and the Cloudera Data Platform.

