About the job
At Databricks, we strive to revolutionize the data lifecycle from ingestion to ETL, business intelligence (BI), and machine learning (ML) with our unified platform. We envision a future where the traditional data warehouse architecture is superseded by an innovative architectural model known as the Lakehouse (CIDR 2021 paper). This cutting-edge approach integrates data warehousing with advanced analytics, effectively addressing significant challenges such as data staleness, reliability, cost of ownership, data lock-in, and limited use-case support.
A pivotal component of achieving this vision is the development of the next-gen (decoupled) query engine and structured storage system designed to surpass the performance of conventional data warehouses for relational queries while maintaining the flexibility of general-purpose systems like Apache Spark™. This will empower a wide range of workloads, from ETL processes to data science applications.
As a member of our team, you will engage in one or more of the following areas to design and implement these advanced systems that set new benchmarks:
- Query compilation and optimization
- Distributed query execution and scheduling
- Vectorized execution engine
- Data security
- Resource management
- Transaction coordination
- Efficient storage structures (encodings, indexes)
- Automatic physical data optimization

