The 10,000 Immunomes


The 10,000 Immunomes Project is a reference dataset for human immunology, derived from over 10,000 control subjects in the NIAID ImmPort Database .


Available data include flow cytometry, CyTOF, multiplex ELISA, gene expression, HAI titers, clinical lab tests, HLA type, and others. Check out the Visualize page to view the available data. Visit the Download page to download a subset of the data or the entire dataset for your own use. More information can be found in our Cell Reports Publication . Contact us with queries and bug reports. This app updated 11/13/2018.


Please acknowledge the 10,000 Immunomes in your publications by citing the following reference:

Zalocusky KA, Kan MJ, Hu Z, Dunn P, Thomson E, Wiser J, Bhattacharya S, Butte AJ. The 10,000 Immunomes Project: Building a Resource for Human Immunology. Cell reports. 2018 Oct 9;25(2):513-22. PMID:30304689


Select Data Type:

Select Analyte:

Select Analyte:

Select Analyte:

Select Analyte:

Enter HUGO Gene Symbol:

Enter HUGO Gene Symbol:

Select Analyte:

Select Analyte:

Select Analyte:

Select Analyte:

Select Analyte:

Select Analyte:

Enter HUGO Gene Symbol:

Age Range:

Sex:

Ethnicities:

Plot By:


Please acknowledge the 10,000 Immunomes in your publications by citing the following reference:

Zalocusky KA, Kan MJ, Hu Z, Dunn P, Thomson E, Wiser J, Bhattacharya S, Butte AJ. The 10,000 Immunomes Project: Building a Resource for Human Immunology. Cell reports. 2018 Oct 9;25(2):513-22. PMID:30304689


Select Data Type:

Download README
Download Zip
*This file type does not support selection for age, race, and sex.

Age Range:

Sex:

Ethnicities:


Please acknowledge the 10,000 Immunomes in your publications by citing the following reference:

Zalocusky KA, Kan MJ, Hu Z, Dunn P, Thomson E, Wiser J, Bhattacharya S, Butte AJ. The 10,000 Immunomes Project: Building a Resource for Human Immunology. Cell reports. 2018 Oct 9;25(2):513-22. PMID:30304689


A guide to data normalization procedures from the 10,000 Immunomes


CYTOF

CyTOF data of healthy human blood samples were downloaded from ImmPort web portal. Every .fcs file was pared down to 5000 events. These .fcs files constitute the “raw” CyTOF:PBMC data. All data were arcsinh transformed. For CyTOF data, the formula f(x) = arcsinh (x/8) was used. Transformation and compensation were done using the preprocessing.batch function in MetaCyto (1). The cell definitions from the Human ImmunoPhenotyping Consortium (2) were used to identify 24 types of immune cells using the searchClster.batch function in MetaCyto. Specifically, each marker in each cytometry panels was bisected into positive and negative regions. Cells fulfilling the cell definitions are identified. For example, the CD14+ CD33+ CD16- (CD16- monocytes) cell subset corresponds to the cells that fall into the CD14+ region, CD33+ region and CD16- region concurrently. The proportion of each cell subsets in the PBMC were then calculated by dividing the number of cells in a subset by the total number of cells in the blood. These steps together produce the “formatted” CyTOF: PBMC data. These data were then batch-corrected with an established empirical Bayes method (3), using study accession for batch and age, sex, and race as known covariates to produce the “formatted and normalized” CyTOF: PBMC data.


ELISA

Parsed ELISA data were downloaded from ImmPort. Analyte names were standardized to HUGO gene names where appropriate, and measurements were standardized to a common unit of measurement (pg/mL). These steps produced the “formatted” ELISA data. Because ELISA data is low-throughput, and most subjects only have measurements for one analyte, batch correction was conducted with a simple linear model for each analyte, mean correcting by study accession while accounting for age, sex, and race. These steps produced the “formatted and normalized” ELISA data.


FLOW CYTOMETRY

Meta-analysis of Cytometry data is conducted using the MetaCyto package (1). Briefly, flow cytometry data were downloaded from ImmPort web portal. Every .fcs file was pared down to 5000 events. These .fcs files constitute the “raw” Flow Cytometry:PBMC data. Flow cytometry data from ImmPort were compensated for fluorescence spillovers using the compensation matrix supplied in each fcs file. All data from ImmPort were arcsinh transformed. For flow cytometry data, the formula f(x) = arcsinh (x/150) was used. Transformation and compensation were done using the preprocessing.batch function in MetaCyto (1). The cell definitions from the Human ImmunoPhenotyping Consortium (2) were used to identify 24 types of immune cells using the searchClster.batch function in MetaCyto. Specifically, each marker in each cytometry panels was bisected into positive and negative regions. Cells fulfilling the cell definitions are identified. For example, the CD14+ CD33+ CD16- (CD16- monocytes) cell subset corresponds to the cells that fall into the CD14+ region, CD33+ region and CD16- region concurrently. The proportion of each cell subsets in the PBMC or whole blood were then calculated by dividing the number of cells in a subset by the total number of cells in the blood. These steps together produce the “formatted” Flow Cytometry: PBMC data. Because the Flow Cytometry data are sparse, batch correction was conducted with a simple linear model for each cell type, mean correcting by study accession while accounting for age, sex, and race. These steps produced the “formatted and normalized” Flow Cytometry data.


GENE EXPRESSION

Gene expression array data were obtained in three formats. Data in their original formats (.CEL files, series matrix files, etc) constitute the “raw” gene expression data. For data collected on Affymetrix platforms, we utilized the ReadAffy utility in the affy Bioconductor package to read in raw .CEL files. The rma utility was used to conduct Robust Multichip Average (rma) background correction (as in (4)), quantile normalization, and log2 normalization of the data. For data collected on Illumina platforms and stored in the Gene Expression Omnibus (GEO) database, we utilized the getGEO utility in the GEOquery Bioconductor package to read the expression files and the preprocessCore package to conduction rma background correction, quantile normalization, and log2 normalization of the gene expression data. Finally, for data collected on Illumina platforms but not stored in GEO, we utilized the read.ilmn utility of the limma Bioconductor package to read in the data, and the neqc function to rma background correct, quantile normalize, and log2 normalize the gene expression data. In all instances, probe IDs were converted to Entrez Gene IDs. Where multiple probes mapped to the same Entrez Gene ID, the median value across probes was used to represent the expression value of the corresponding gene. The background-corrected and normalized datasets were combined based on common Entrez IDs, missing values were imputed with a k-nearest neighbors algorithm (R package: impute, function: impute.knn) using k = 10 and default values for rowmax, colmax, and maxp. Enter Gene IDs were then converted to HUGO gene names. These steps together produced the “formatted” gene expression files. To create the “formatted and normalized” datasets, we utilized established empirical Bayes algorithm for batch correction (2), compensating for possible batch effects while maintaining potential effects of age, race, and sex across datasets.


HAI TITER

Parsed HAI data were downloaded from ImmPort. Names were standardized to WHO viral nomenclature where necessary. These steps produced the “formatted” HAI data. Because HAI data is low-throughput, and most subjects only have measurements for one-to-three of the viruses, batch correction was conducted with a simple linear model for each analyte, mean correcting by study accession while accounting for age, sex, and race. These steps produced the “formatted and normalized” HAI data.


LAB TESTS

Parsed lab test data were downloaded from ImmPort and organized into three standard panels: Complete Blood Count (CBC), Fasting Lipid Profile (FLP), and Comprehensive Metabolic Panel (CMP). Because FLP and CMP data are derived from only one study, no further standardization was required. These parsed data constitute the “formatted” lab test data for these two types, and no “normalized” table is available. CBC data were derived from 12 different studies. As such, names of individual tests as well as units of measurement needed to be standardized for the data to be directly comparable. For example, cells reported as thousands of cells per microliter were variously described as “K/mi”, “K/“, “cells/mm3”, “thou/mcL”, ”per”, “1000/microliter”, “10^3/mm3”, “10^3”, “1e3/uL”, “10*3/ul”, “/uL”, or “10^3 cells/uL”, and the names of assays were comparably variable. These standardization steps produced the “formatted” Lab Test: Blood Count data. These data were then batch corrected with a simple linear model for each analyte, mean correcting by study accession while accounting for age, sex, and race to produce the “formatted and normalized” CBC data.


MULTIPLEX ELISA

Secreted protein data measured on the multiplex ELISA platform were collected from ImmPort studies SDY22, SDY23, SDY111, SDY113, SDY180, SDY202, SDY305, SDY311, SDY312, SDY315, SDY420, SDY472, SDY478, SDY514, SDY515, SDY519, and SDY720. Data were drawn from the ImmPort parsed data tables using RMySQL or loaded into R from user-submitted unparsed data tables. Across the studies that contribute data, there are disparities in terms of the dilution of samples and units of measure in which the data are reported. We corrected for differences in dilution factor and units of measure across experiments and standardized labels associated with each protein as HUGO gene symbols. This step represents the “formatted” Multiplex ELISA data table. Compensation for batch effects was conducted using an established empirical Bayes algorithm (2), with study accession representing batch and a model matrix that included age, sex, and race of each subject. Data were log2 transformed before normalization to better fit the assumption that the data are normally distributed. The effectiveness of the log2 transform, as well as our batch correction efforts, are detailed in the manuscript associated with this resource (5). This batch-corrected data represents the “formatted and normalized” Multiplex ELISA data.


VIRUS NEUTRALIZATION TITER

Parsed VNT data were downloaded from ImmPort. Names were standardized to WHO viral nomenclature where necessary. These steps produced the “formatted” VNT data. Because VNT data is low-throughput, and most subjects only have measurements for one-to-three of the viruses, batch correction was conducted with a simple linear model for each analyte, mean correcting by study accession while accounting for age, sex, and race. These steps produced the “formatted and normalized” VNT data.


REFERENCES


1) Hu Z, Jujjavarapu C, Hughey JJ, Andorf S, Lee H, Gherardini PF, Spitzer MH, et al. Meta-analysis of Cytometry Data Reveals Racial Differences in Immune Cells. Cell Reports. 2018 Jul 31;24(5):1377-88.


2) Finak G, Langweiler M, Jaimes M, Malek M, Taghiyar J, Korin Y, et al. Standardizing Flow Cytometry Immunophenotyping Analysis from the Human ImmunoPhenotyping Consortium. Scientific Reports. 2016 Aug 10;6(1):20686.


3) Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostat. 2007 Jan 1;8(1):118–27.


4) Irizarry RA, Hobbs B, Collin F, Beazer‐Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003 Apr 1;4(2):249–64.


5) Zalocusky KA, Kan MJ, Hu Z, Dunn P, Thomson E, Wiser J, Bhattacharya S, Butte AJ. The 10,000 Immunomes Project: Building a Resource for Human Immunology. Cell reports. 2018 Oct 9;25(2):513-22. PMID:30304689