Data Engineer - remote

Posted 3 years ago

Our client is a young crypto startup from Silicon Valley.

Currently in stealth mode. However, they are backed up by a16z, Coinbase Venture, and many others. The company aims to help economies transition to cryptocurrencies.

We are looking for the best talents in the industry to join them.

Among the openings is a position for Data Engineer. Remote is possible.

The AI &Biometrics team is building a state-of-the-art iris recognition engine that works on the 1B+ people scale. In order to do this, they use a fusion of custom optics, hardware, and on-device machine learning, combined with large-scale data collection in more than 20 countries.

Achieving these levels of accuracy demands a robust data pipeline to fuel their machine learning models. From data collected through various field tests, they receive several millions of images monthly. These images need to be pre-processed and passed through both external and in-house labeling services.

This role is responsible for building, scaling, and maintaining a stable data pipeline. The cross-disciplinary nature of this team requires interfacing with various other teams across the company including Hardware, Infrastructure, and IoT.

About the Opportunity:

  • Design data pipelines that can scale to handle this large data ingest. This includes figuring out ways to store, process, and load this data with robust features for filtering, pre-processing, post-processing, deduplication, and versioning.
  • Building and refining custom data labeling services that directly influence the quality of our iris recognition engine.
  • Work closely with other stakeholders (data contributors + consumers) to incorporate their data usage needs on a variety of tasks and domains.

About You:

  • BS/BA in a quantitative field e.g. CS, Math, Physics, or equivalent professional experience.
  • Enjoy working as part of a fast-moving team, where perfectionism can sometimes be at odds with (but sometimes directly required for) pragmatism.
  • Own problems end-to-end, and are willing to pick up whatever context is needed to get the job done.
  • A desire to dig into problems across the stack, whether networking issues, performance bottlenecks, memory leaks, or simply reading unfamiliar code to figure out where potential issues might exist.
  • Have a strong belief in the crucial need for high-quality data for producing state-of-the-art machine learning systems and are highly motivated to design workflows that effectively meet the associated challenges.
  • Care about code quality and enjoy building tools that are easy to use and extensible.

Here’s a sampling of services currently running (and planned) in production:

  • ·Languages (Python / Go)
  • ·Data Orchestration (Airflow / Dagster)
  • ·Infrastructure, Storage, and Processing (AWS)
  • ·Labelling Services for ML Models (MTurk, Flask, Streamlit, Docker)
  • ·Pipeline Monitoring (Datadog)