Build and deliver useful domain-driven data products

Derive actionable information and maximize the value of your data with scalable, decentralized, domain-specific units of architecture.

Tag.bio data products

What is a Tag.bio data product?

Tag.bio data product is the fundamental building block of the data mesh architecture and it follows the domain-driven approach to ensure that each data product is usable by anyone.

Tag.bio accessible data products

Tag.bio data product has three components: data mapping, algorithms, and smart API joined into an atomic unit of architecture.

Each data product can be developed independently by a single developer or a cross-functional team, promoting agile workflows. Working on one data product has no impact on the functioning or value of the other data products – this ensures that the mesh of data products is scalable. Additionally, each data product is containerized for orchestrated deployment into the data mesh and analysis platform.

Some of the benefits of Tag.bio data products include accessibility and reusability. For example: the domain experts can access the data products using the analysis platform, and the data scientists can access the data products using their data science tools. The data scientists can also take some components of one data product and reuse them in another data product. The accessibility aspect helps standardize data access and the reusability aspect dramatically reduces redundant work.

Data Product Components

Data Map

Each data product has a single anchoring entity type (sample, patient, medical encounter, etc.) that is used to build cohorts for analysis. Data is merged and joined to map attributes, e.g. omics data and longitudinal clinical data, to these entities.

How to bring data into a data product:

  • Identify tabular data source(s) for ingestion via SQL queries or delimited files. Relational databases, data lakes, and CSV files are all potential candidates.
  • Define a single entity type to be used as the topic of analysis applications. Data will be merged and joined to map attributes to these entities.
  • Low-code data parsers are defined and configurable as JSON templates. You can choose to map in all data from all tables, or just a subset of tables and columns.
  • Mapped data is captured as a versioned, compressed, immutable snapshot – all analyses and UDATs can be attributed to a single versioned snapshot. Future data updates are re-mapped to generate additional snapshots.
Examples of data types:

Life sciences:

  • DNA-Seq (VCF, MAF)
  • RNA-Seq (bulk and single-cell, spatial transcriptomics)
  • Proteomics
  • Flow cytometry
  • Compound screening
  • Immune repertoire
  • Clinical trials (outcomes and biomarkers)
  • Longitudinal studies
  • Annotation, ontology, pathways
  • Machine behavior and maintenance
  • Drug response & pKa studies
  • Biomanufacturing yields
  • Knockdown studies (RNAi, CRISPR)
  • Meta genomics
  • Gene expression
  • DNA methylation
  • Somatic mutations
  • Germline variants
  • High content screening

Healthcare:

  • EMR/EHR
  • Patient registries
  • Claims
  • Clinical trials
  • Medical tests
  • Administrative
Examples of data sources:

Life sciences:

  • cBioPortal
  • TCGA
  • dbGaP
  • GEO & ArrayExpress
  • Clinical trials
  • UK Biobank

Healthcare:

  • Epic/Clarity EMR
  • EPSi
  • OMOP (Observational Medical Outcomes Partnership)
  • Cerner
  • REDCap
  • FHIR
  • Patient registries

Generic:

  • Any form of tabular data (CSV/TSV)
  • Relational Databases and Apache Spark (SQL)
  • Data Warehouses
  • Data Lakes
Tag.bio data product - data mapping

Algorithms

The algorithms are computational methods invoked by analysis apps. The methods can be classical statistics, complex scripts, or your own models (R, Python, ML/AI).

Statistics currently available:
  • Hypergeometric test for categorical data and sets
  • Student’s T and Mann-Whitney U tests for numeric distributions
  • Univariate and Multivariate Linear Regression for numeric relationships
  • Paired analysis for repeated (longitudinal) measures
  • Matched analysis for control of confounding variables
  • Cox regression for survival analysis
  • K Means and DBSCAN clustering for segmentation
  • PCA, t-SNE and UMAP for projection/embedding
  • Fast event sequence queries
  • Pathway (systems) analysis via Hypergeometric test and GSEA
  • Gene signature analysis via ssGSEA
  • Chi-square test for categorical data
  • Logistic Regression and Random Forest (also many other options) for prediction/classification
  • Pearson/Spearman correlation for numeric distributions

Reuse your own scripts or methods by simply integrating them into the product, allowing you to leverage the enterprise features of the analysis platform.

Below are supported integrations:

  • R integration (run any R algorithm/test) within an analysis app
  • Python integration (run any Python algorithm/test) within an analysis app
  • Machine learning libraries, such as the SMILE library

More on Data Science Integration

Tag.bio data product - algorithms

Smart API

The smart API is a universal communication schema that enables information transfer from data products to one another, and from data products to the Analysis Platform.

Examples of a data product communicating with one another:

  • Using a data product to annotate data sent from another data product
  • Using a data product to monitor analysis apps usage by other data products

Examples of a data product communicating with the analysis platform:

  • When a data product is visible and accessible by the end-users on the analysis platform
  • When an end-user uses an analysis app that are part of the data product to run an analysis

More on Smart APIs

Tag.bio data product - smart api

More on the Data Science Impact

Let’s get the conversation started

From a 30-minute demo to an inquiry about our 4-week pilot project, we are here to answer all of your questions!