Derive actionable information and maximize the value of your data with scalable, decentralized, domain-specific units of architecture.
What is a Tag.bio data node?
A data node is a domain-driven data product – the fundamental building block of the data mesh architecture.
A data node represents an application layer on top of a dataset, providing domain-specific functionality and a smart API which connects it into the data mesh.
Each data node can be prototyped and developed independently by a single developer or a cross-functional team, promoting agile workflows. Work on one data node has no impact on the functioning or value of the other data nodes – this ensures that the mesh of data nodes is scalable. Additionally, each data node is containerized for orchestrated deployment into the data mesh so end users can easily access it from the data portal.
How to bring data into a node:
- Identify tabular data source(s) for ingestion via SQL queries or delimited files. Relational databases, data lakes, and CSV files are all potential candidates.
- Define a single entity type to be used as the topic of analysis applications. Data will be merged and joined to map attributes to these entities.
- Low-code data parsers are defined and configurable as JSON templates. You can choose to map in all data from all tables, or just a subset of tables and columns.
- Mapped data is captured as a versioned, compressed, immutable snapshot – all analyses and UDATs can be attributed to a single versioned snapshot. Future data updates are re-mapped to generate additional snapshots.
Examples of data types:
- DNA-Seq (VCF, MAF)
- RNA-Seq (bulk and single-cell, spatial transcriptomics)
- Flow cytometry
- Compound screening
- Immune repertoire
- Clinical trials (outcomes and biomarkers)
- Longitudinal studies
- Annotation, ontology, pathways
- Machine behavior and maintenance
- Drug response & pKa studies
- Biomanufacturing yields
- Knockdown studies (RNAi, CRISPR)
- Meta genomics
- Patient registries
- Clinical trials
- Medical tests
Examples of data sources:
- GEO & ArrayExpress
- Clinical trials
- UK Biobank
- Epic/Clarity EMR
- OMOP (Observational Medical Outcomes Partnership)
- Patient registries
- Any form of tabular data (CSV/TSV)
- Relational Databases and Apache Spark (SQL)
- Data Warehouses
- Data Lakes
The algorithms are computational methods invoked by analysis apps. The methods can be classical statistics, complex scripts, or your own models (R, Python, ML/AI).
Statistics currently available:
- Hypergeometric test for categorical data and sets
- Student’s T and Mann-Whitney U tests for numeric distributions
- Univariate and Multivariate Linear Regression for numeric relationships
- Paired analysis for repeated (longitudinal) measures
- Matched analysis for control of confounding variables
- Cox regression for survival analysis
- K Means and DBSCAN clustering for segmentation
- PCA, t-SNE and UMAP for projection/embedding
- Fast event sequence queries
- Pathway (systems) analysis via Hypergeometric test and GSEA
- Gene signature analysis via ssGSEA
- Chi-square test for categorical data
- Logistic Regression and Random Forest (also many other options) for prediction/classification
- Pearson/Spearman correlation for numeric distributions
Reuse your own scripts or methods by simply integrating them into the node, allowing you to leverage the enterprise features of the portal.
Below are supported integrations:
- R integration (run any R algorithm/test) within an analysis app
- Python integration (run any Python algorithm/test) within an analysis app
- Machine learning libraries, such as the SMILE library
The smart API allows node-to-node communication and user-to-node interactions using a standard language.
- Examples of the node-to-node communication are:
- Using a node to annotate data sent from another node
- Using a node to monitor analysis apps usage by other nodes
- The user-to-node interaction is through the analysis apps which are embedded within the smart API.