Lead Data Engineer - remote

Tessian

Posted 4 years ago • London, UK

AmazonWebServices ApacheSpark Etl Docker

Data Engineering at TessianRead more about Engineering at Tessian: https://stackoverflow.com/jobs/companies/tessian As a high-growth scale-up, our email datasets are growing at an exponential rate. This is a great problem to have as it allows us to train best-in-class machine learning models to prevent previously unpreventable data breaches. We have scaled to the point where our current data pipelines are not where we want them to be, this is why we’re looking to hire our first Data Engineer. You will sit in our Platform team and work day-to-day with our Data Scientists to build out infrastructure and data pipelines empowering teams to iterate quickly on terabytes of data. We view this role as a hugely impactful, high-leverage role and strongly believe that if we can query all of our data, in near real-time and using scalable systems then we can deliver more value to our clients through the data breaches we prevent.

Your responsibilities will include:

Building systems to efficiently handle our ever increasing volume of data
Designing and implementing data pipelines as well as owning the vision of what our systems could achieve in the future
Working with Data Scientists to train, version, test, deploy and monitor our machine learning models in production
Designing systems to expose data to our product and engineering teams in a performant way
Mentoring the Data Science team on how to work with data at scale

We'd love to meet someone who:

Is a highly-skilled developer who understands software engineering best practices (git, CI/CD, testing, reviewing, etc)
Has experience working with distributed data systems such as Spark (or Amazon EMR)You’ve designed and deployed data pipelines and ETL systems for data-at-scale
Has a deep knowledge of the AWS ecosystem and have managed AWS production environments
Has experience with Docker and container orchestration systems like AWS ECS and Kubernetes
Ideally has been involved in machine learning infrastructure projects from automated training through to deployment

Some interesting projects we’re working on:

An ETL pipeline transforming millions of tiny files on AWS Glue to the Parquet file format for efficient querying
Designing existing pipelines to use Spark setting ourselves up to handle massive future scale
Building a new Flask app framework allowing us to standardise how we deploy all models to production
Dumping terabytes of data from production databases, in real-time in a performant, scalable and secure way
Building a lightweight data labelling system designed specifically for labelling emails at scale

Apply