Promote transparency in the data analysis process
Gain visibility into peer-reviewed work
Turn a static published paper into an interactive research workspace for instant exploration
Demonstrate integration of python-based machine learning algorithm into the Tag.bio platform
This project got started because our client asked us to reproduce an elastic net machine learning model from this research paper. Since the paper provided everything needed to reproduce, such as the data and code, it should have been easy to do so. However, it proved to be more difficult than anticipated with a surprising result at the end.
- To reproduce the results, we expected at most a day, however it ended up taking over 2 weeks
- We were unable to compile software to run the published pipeline
- In order to pre-process the data, we had to read the code line-by-line to pinpoint the errors, correct them, and manually re-run the data preparation steps
- The versions of reported python packages were outdated
- Issue with the number of patients – the numbers were unclear
- The git repo references another git repo that has lots of CSV files and Jupyter notebooks with no documentation
Re-running the code as published proved to be impossible. So, rather than trying to fix it, we rewrote the code from scratch and that includes data preparation and ML algorithm implementation.
We confirmed the validity of our code rewrite by reproducing the authors’ results. If we compare the percentage of variance explained by the predictive model, we have an original value of 78.99% vs our rewrite value of 78.72%.
Results from the research paper
Tag.bio’s reproduced results on the Data Portal
Rendering the paper into a live data product allowed us to investigate the robustness of the model in real time. We noted that for one patient, patient_id 2389, data had been imputed. When we re-ran the model without that patient’s data, we found a significant drop in “variance explained” — from 78.99% to 32.58%. We also checked what happens if we randomly exclude another patient, i.e. patient_id 1233, and the “variance explained” dropped even further to 11.81%.
We shared the results to our client by deploying the data into the data product with a range of analysis apps. This data product can be run through the Data Portal to allow users to change input parameters in real time so that they can examine the effectiveness of the model.
Not only were we able to reproduce the data analysis presented in the research paper, we were able to make an interactive version of it and find new information.
With the data from the paper live on our platform, we were able to show the importance of imputations, if any, on the final results. We were also able to make different cohorts of patients, show the importance of the model and re-run the analyses in real time.
Additionally, we went beyond the results presented in the paper. We were able to show the effects of each parameter on the prediction, as can be seen in the table below. These results were not available within the paper itself, but indicate the significance and importance of each predictor in the model.
What we have demonstrated here is the platform’s capability to provide complete transparency into an iterative research process. This power gives domain experts the confidence to drive reproducible data analyses without necessarily knowing the mathematical intricacies of the underlying analysis algorithms.