Senior Data Engineer - Real World Data
Company: Formation Bio
Location: New York City
Posted on: April 2, 2026
|
|
|
Job Description:
About Formation Bio Formation Bio is a tech and AI driven pharma
company differentiated by radically more efficient drug
development. Advancements in AI and drug discovery are creating
more candidate drugs than the industry can progress because of the
high cost and time of clinical trials. Recognizing that this
development bottleneck may ultimately limit the number of new
medicines that can reach patients, Formation Bio, founded in 2016
as TrialSpark Inc., has built technology platforms, processes, and
capabilities to accelerate all aspects of drug development and
clinical trials. Formation Bio partners, acquires, or in-licenses
drugs from pharma companies, research organizations, and biotechs
to develop programs past clinical proof of concept and beyond,
ultimately helping to bring new medicines to patients. The company
is backed by investors across pharma and tech, including a16z,
Sequoia, Sanofi, Thrive Capital, Sam Altman, John Doerr, Spark
Capital, SV Angel Growth, and others. You can read more at the
following links: Our Vision for AI in Pharma Our Current Drug
Portfolio Our Technology & Platform At Formation Bio, our values
are the driving force behind our mission to revolutionize the
pharma industry. Every team and individual at the company shares
these same values, and every team and individual plays a key part
in our mission to bring new treatments to patients faster and more
efficiently. About the Position We're looking for a Senior Data
Engineer to join the Scientific Data Intelligence (SDI) team at
Formation Bio to help transform Real World Data (RWD)—spanning
electronic health records, claims, and other longitudinal patient
data sources—into structured, analytics-ready assets. In this role,
you'll be partnering closely with our Data Science team not only to
model and transform data, but also to actively analyze it:
answering research questions, generating evidence, and supporting
scientific decision-making across our drug portfolio. This position
sits at the intersection of healthcare data engineering, real-world
evidence analysis, and generative AI. While a strong foundation in
building reliable, scalable pipelines is essential, you'll be
equally expected to roll up your sleeves and work directly with the
data—constructing cohorts, running analyses, and translating
findings into actionable insights for scientific and business
stakeholders. The ideal candidate is a hybrid of data engineer and
applied scientist: someone who can build the infrastructure and
then use it, with familiarity in RWD study design, GenAI fluency
(e.g., LLM-based entity extraction, summarization, classification),
and strong technical expertise with modern data tooling. You'll
play a key role in shaping how real-world patient data becomes
discoverable, structured, and impactful across the organization.
Responsibilities Model and transform raw EHR and claims data into
clean, canonical, and analytics-ready datasets using SQL, Python,
and clinical standards like OMOP. Build and manage scalable data
pipelines using Dagster for orchestration, dbt for transformation,
and Snowflake as the primary compute and storage engine. Conduct
hands-on RWD analyses to answer scientific and strategic research
questions—including disease epidemiology, treatment patterns,
patient journey characterization, and comparative effectiveness.
Partner with Data Scientists and clinical leads to design and
execute observational studies, translating scientific questions
into well-structured, reproducible analyses. Implement data
validation, completeness, and observability frameworks to ensure
real-world datasets are accurate, comprehensive, and trustworthy
for downstream research and product use. Apply Generative AI
techniques within transformation and analysis layers to accelerate
data structuring and insight generation. Communicate findings
clearly to both technical and non-technical stakeholders, including
summaries for portfolio teams and leadership. About You You have 5
years of experience in data engineering, ideally with at least 2
years working in healthcare or life sciences, including direct
exposure to EHR or claims datasets. You have experience with
ontologies and biomedical schemas (e.g. UMLS, LOINC, ICD9/10, MeSH)
and understand the modalities found within RWD — billing claims,
lab results, visit notes. You're fluent in SQL and Python, and
you've built and maintained production-grade pipelines that support
analytics or scientific workflows. You have experience building
longitudinal patient cohorts from EHR or claims data, including
index date logic, washout periods, and follow-up window
construction. You have a solid understanding of the causal
inference frameworks such as potential outcomes and target trial
emulation. You have working familiarity with real-world evidence
study design concepts—such as active comparator new user designs,
time-to-event outcomes, confounder adjustment, and causal discovery
algorithms—sufficient to partner effectively with Data Scientists
on causal inference workflows. You value clarity, documentation,
and structured thinking—especially when working with complex
healthcare data. You have hands-on expertise with modern data
infrastructure, such as Snowflake, dbt, and Dagster. You can
balance upfront design with speed to execution, slowing down when
it counts without getting stuck in the details. Bonus: You've
worked in regulated or privacy-sensitive data environments and are
familiar with governance models for PHI or sensitive data. Bonus:
You have prior experience working with commercial RWD vendors (e.g.
Truveta, Optum, Komodo, IQVIA) and understand the nuances of
licensed claims and EHR datasets, including longitudinal patient
journey construction and line-of-therapy sequencing. Formation Bio
is prioritizing hiring in key hubs, primarily the New York City and
Boston metro areas. These positions will follow a hybrid work model
with 1-3 days required at the office. Applicants from the Research
Triangle (NC) and San Francisco Bay Area may also be considered.
Please only apply if you reside in these locations or are willing
to relocate. Compensation Range: $204,500 - $267,000 Salary ranges
are informed by a number of factors including geographic location.
The range provided includes base salary only. In addition to base
salary, we offer equity, comprehensive benefits, generous perks,
hybrid flexibility, and more. If this range doesn't match your
expectations, please still apply because we may have something else
for you. You will receive consideration for employment without
regard to race, color, religion, gender, gender identity or
expression, sexual orientation, national origin, genetics,
disability, age, or veteran status.
Keywords: Formation Bio, Edison , Senior Data Engineer - Real World Data, Science, Research & Development , New York City, New Jersey