Data Engineering at TessianRead more about Engineering at Tessian: https://stackoverflow.com/jobs/companies/tessian As a high-growth scale-up, our email datasets are growing at an exponential rate. This is a great problem to have as it allows us to train best-in-class machine learning models to prevent previously unpreventable data breaches. We have scaled to the point where our current data pipelines are not where we want them to be, this is why we’re looking to hire our first Data Engineer. You will sit in our Platform team and work day-to-day with our Data Scientists to build out infrastructure and data pipelines empowering teams to iterate quickly on terabytes of data. We view this role as a hugely impactful, high-leverage role and strongly believe that if we can query all of our data, in near real-time and using scalable systems then we can deliver more value to our clients through the data breaches we prevent.
Your responsibilities will include:
- Building systems to efficiently handle our ever increasing volume of data
- Designing and implementing data pipelines as well as owning the vision of what our systems could achieve in the future
- Working with Data Scientists to train, version, test, deploy and monitor our machine learning models in production
- Designing systems to expose data to our product and engineering teams in a performant way
- Mentoring the Data Science team on how to work with data at scale
We'd love to meet someone who:
- Is a highly-skilled developer who understands software engineering best practices (git, CI/CD, testing, reviewing, etc)
- Has experience working with distributed data systems such as Spark (or Amazon EMR)You’ve designed and deployed data pipelines and ETL systems for data-at-scale
- Has a deep knowledge of the AWS ecosystem and have managed AWS production environments
- Has experience with Docker and container orchestration systems like AWS ECS and Kubernetes
- Ideally has been involved in machine learning infrastructure projects from automated training through to deployment
Some interesting projects we’re working on:
- An ETL pipeline transforming millions of tiny files on AWS Glue to the Parquet file format for efficient querying
- Designing existing pipelines to use Spark setting ourselves up to handle massive future scale
- Building a new Flask app framework allowing us to standardise how we deploy all models to production
- Dumping terabytes of data from production databases, in real-time in a performant, scalable and secure way
- Building a lightweight data labelling system designed specifically for labelling emails at scale