Our client needed a platform to orchestrate clinical trial data ingestion and processing workflows to enable downstream data science analyses and discovery of biomarkers. The platform had to be generalizable to other (and future) clinical trials and deployable on scalable infrastructure to allow for high volume processing workloads while managing cloud compute resources in a cost effective way. Additionally, the platform must provide high availability and regenerative infrastructure, ensuring a complete solution that supports itself in the long term without significant administrative or maintenance costs.
We designed and implemented a data analysis workflow that orchestrates automatic ingestion and processing pipelines on Kubernetes, leveraging a cloud native stack to customize resources to our needs (Fluentbit, Argo Workflows, Opensearch, and more). Ingested data is decrypted, parsed, and formatted. A multi-step feature extraction workflow handles creating key data features for various modalities, saving artifacts to cloud storage as it executes. Finally, a notification is sent to stakeholders that indicates new data has been processed and is ready for analysis. With additional functionalities such as automatically generating FDA 21 CFR Part 11 compliant lineage logs, interactive quality control interfaces, and a CLI tool for supporting ongoing data science contributions, this platform is the foundational component for future planned clinical trials.