Derive actionable information and maximize the value of your data with scalable, decentralized, domain-specific units of architecture.
What is a Tag.bio data product?
Tag.bio data product is the fundamental building block of the data mesh architecture and it follows the domain-driven approach to ensure that each data product is usable by anyone.
Tag.bio data product has three components: data mapping, algorithms, and smart API joined into an atomic unit of architecture.
Each data product can be developed independently by a single developer or a cross-functional team, promoting agile workflows. Working on one data product has no impact on the functioning or value of the other data products – this ensures that the mesh of data products is scalable. Additionally, each data product is containerized for orchestrated deployment into the data mesh and analysis platform.
Some of the benefits of Tag.bio data products include accessibility and reusability. For example: the domain experts can access the data products using the analysis platform, and the data scientists can access the data products using their data science tools. The data scientists can also take some components of one data product and reuse them in another data product. The accessibility aspect helps standardize data access and the reusability aspect dramatically reduces redundant work.
Each data product has a single anchoring entity type (sample, patient, medical encounter, etc.) that is used to build cohorts for analysis. Data is merged and joined to map attributes, e.g. omics data and longitudinal clinical data, to these entities.
How to bring data into a data product:
- Identify tabular data source(s) for ingestion via SQL queries or delimited files. Relational databases, data lakes, and CSV files are all potential candidates.
- Define a single entity type to be used as the topic of analysis applications. Data will be merged and joined to map attributes to these entities.
- Low-code data parsers are defined and configurable as JSON templates. You can choose to map in all data from all tables, or just a subset of tables and columns.
- Mapped data is captured as a versioned, compressed, immutable snapshot – all analyses and UDATs can be attributed to a single versioned snapshot. Future data updates are re-mapped to generate additional snapshots.
Examples of data types:
- DNA-Seq (VCF, MAF)
- RNA-Seq (bulk and single-cell, spatial transcriptomics)
- Flow cytometry
- Compound screening
- Immune repertoire
- Clinical trials (outcomes and biomarkers)
- Longitudinal studies
- Annotation, ontology, pathways
- Machine behavior and maintenance
- Drug response & pKa studies
- Biomanufacturing yields
- Knockdown studies (RNAi, CRISPR)
- Meta genomics
- Gene expression
- DNA methylation
- Somatic mutations
- Germline variants
- High content screening
- Patient registries
- Clinical trials
- Medical tests
Examples of data sources:
- GEO & ArrayExpress
- Clinical trials
- UK Biobank
- Epic/Clarity EMR
- OMOP (Observational Medical Outcomes Partnership)
- Patient registries
- Any form of tabular data (CSV/TSV)
- Relational Databases and Apache Spark (SQL)
- Data Warehouses
- Data Lakes
The algorithms are computational methods invoked by analysis apps. The methods can be classical statistics, complex scripts, or your own models (R, Python, ML/AI).
Statistics currently available:
- Hypergeometric test for categorical data and sets
- Student’s T and Mann-Whitney U tests for numeric distributions
- Univariate and Multivariate Linear Regression for numeric relationships
- Paired analysis for repeated (longitudinal) measures
- Matched analysis for control of confounding variables
- Cox regression for survival analysis
- K Means and DBSCAN clustering for segmentation
- PCA, t-SNE and UMAP for projection/embedding
- Fast event sequence queries
- Pathway (systems) analysis via Hypergeometric test and GSEA
- Gene signature analysis via ssGSEA
- Chi-square test for categorical data
- Logistic Regression and Random Forest (also many other options) for prediction/classification
- Pearson/Spearman correlation for numeric distributions
Reuse your own scripts or methods by simply integrating them into the product, allowing you to leverage the enterprise features of the analysis platform.
Below are supported integrations:
- R integration (run any R algorithm/test) within an analysis app
- Python integration (run any Python algorithm/test) within an analysis app
- Machine learning libraries, such as the SMILE library
The smart API is a universal communication schema that enables information transfer from data products to one another, and from data products to the Analysis Platform.
Examples of a data product communicating with one another:
- Using a data product to annotate data sent from another data product
- Using a data product to monitor analysis apps usage by other data products
Examples of a data product communicating with the analysis platform:
- When a data product is visible and accessible by the end-users on the analysis platform
- When an end-user uses an analysis app that are part of the data product to run an analysis