Adult mouse brain has 3D multi-omic atlas and single-cell DNA methylome

Analysis of Clusters of Hindbrain Neurons in Astro-Epen and Purkinje Class using MERFISH Dataset

Note that the precision of this analysis is highly dependent on the accuracy of CCFv3 registration. We are actively working to improve the accuracy of the information we provide in CCFv3 regions.

We realized that some of the existing clusters in Astro–Epen class did not fully capture the rich spatial gradient present in the dataset after carefully examining the spatial distribution of each cell type. Some hindbrain neuronal clusters had high heterogeneity. There are factors that may contribute to the presence of heterogeneous clusters. First, sampling of these cell types was not comprehensive enough. These cells can be difficult to find due to the large amount of non-neuronal cells that are present in these areas. The hindbrain has some cell types that are vulnerable to tissue dissociation, and the cells that survive tend to leak more transcripts. This is what we observed in Purkinje neuron population, which was very small in our dataset. Two Purkinje neuron clusters with high or low gene count were identified after initial clustering. The low cluster was used to make a new low-quality cluster. The transcriptomic differences between the hindbrain and other brain regions seem to be subtle. These subtler differences make the hindbrain neuronal types more difficult to categorize. To address the problems stated above, we re-clustered cells from the Astro–Epen class and the few highly heterogeneous hindbrain neuronal clusters with more relaxed threshold: de.score.th = 80, min.cells = 8. The clusters that were refined were mapped to the MERFISH dataset with distinct spatial distribution and different expression of marker genes.

We predicted the cell type of its spatial KNNs based on the majority of votes from the class, subclass and supertype levels of the cell. The confusion matrix between the actual cell-type identity and the predicted cell type (Extended Data Fig. 15) indicates the cell types that colocalized in the same spatial location.

We also developed a hierarchical version of this approach to assign cell-type identities for MERFISH, Multiome snRNA-seq, or any external datasets to the 10xv3 dataset as reference, using different gene lists based on the contexts. We sample 80% genes from the marker list randomly and do a hundred mappings of them. The mapping probability is how long it takes a given cell type to be assigned to it.

To measure the complexity of brain regions with respect to cell type we used the Shannon diversity index. The Shannon diversity index can be used to determine the uncertainty in a system by considering the distribution of cell types. It combines richness (number of distinct cell types) and evenness to provide a measure of diversity, reflecting the information content or disorder in the system. The higher the value, the more uniform the distribution of species is. We used the diversity function from the R package vegan. We used Shannon diversity index for CCFv3_level2 regions for the composition of subclasses and supertypes.

We discovered that of the 556 terminal regions, only seven were completely missed by the MERFISH database.

Alignment to the Allen CCFv3 was performed by matching the above-mentioned scRNA-seq derived region labels to their corresponding anatomical parcellation of the CCFv3. A label map was generated for each region by aggregating the cells assigned to that region within a 10×10 µm grid, transformed to the initial configuration using the computed rigid transforms. Using the corresponding anatomical labels, the ANTS registration framework was used to establish a 2.5D deformable spatial mapping between the MERFISH data and the CCFv3 via three major steps: (1) A 3D global affine (12 dof) mapping was performed to align the CCFv3 into the MERFISH space. This resampled sections gave section-wise 2D target space for each of the MERFISH sections. Since the CCFv3 is a continuous label set, it avoids artifacts that can result if resampling is performed on the MERFISH data instead, which can contain missing sections. (2) After establishing a resampled CCFv3 section for each MERFISH section, 2D affine registrations were performed to align each MERFISH section to match the global anatomy of the CCFv3 brain. This addressed misalignments from the initial manual stacking of the MERFISH sections using the midline and provided a global mapping to initialize the local deformable mappings. (3) Finally, a 2D multi-scale, symmetric diffeomorphic registration (step size = 0.2, sigma = 3) was used on each section to map local anatomic differences between the corresponding MERFISH and CCFv3 structures in each section. Global and section-wise mappings from each of these registration steps were preserved and concatenated (with appropriate inversions) to allow point-to-point mapping between the original MERFISH coordinate space and the CCFv3 space.

sections were washed twice in a sample wash buffer. Each section was applied with zinc dip and PolyT stain for 15 min and then washed with a formamide wash buffer. Formamide wash buffer was removed and replaced with sample prep wash buffer during MERSCOPE set up. 100 µl of RNAse Inhibitor (New England BioLabs M0314L) was added to 250 µl of Imaging Buffer Activator (VIZGEN, 203000015) and this mixture was added via the cartridge activation port to a pre-thawed and mixed MERSCOPE Imaging cartridge (VIZGEN, 1040004). The MERSCOPE fluidics system was primed with 15 gallons of mineral oil and added to the activation port. The flow chamber was assembled with the hybridized and cleared section coverslip according to VIZGEN specifications and the imaging session was initiated after collection of a 10× mosaic DAPI image and selection of the imaging area. The processing of specimen that passed the minimum count threshold was carried out according to VIZGEN proprietary protocol.

The sections were placed in a room for 1.5 h to give them time to fully Polymerize. The section was prepared in 5l of VIZGEN Clearing Solution with the instructions to wait 24h or until it was clear in a humidified oven.

Frozen mouse brains were sectioned using a cryostat with the cryochamber temperature set at −20 °C and the object temperature set at −22 °C. Brains were securely mounted by the cerebellum or by the olfactory region onto cryostat chucks using OCT (Sakura FineTek 4583). A thickness of 20–50 m was used to cut the tissue and a thickness of 300 m was used to generate slices for dissection. The dissection is in the cutout section. Nuclei were isolated using the RAISINs method125 with a few modifications as described in a nuclei isolation protocol developed at AIBS126. cised tissue dissectates were transferred to a plate containing a buffer. Spring scissors were used to chop the body using ice-cold CST buffer for 10 min. The entire volume of the well was then transferred to a 50-ml conical tube while passing through a 100-µm filter and the walls of the tube were washed using ST buffer. Next the suspension was gently transferred to a 15-ml conical tube and centrifuged in a swinging-bucket centrifuge for 5 min at 500 rcf and 4 °C. The majority of pellets were thrown out after centrifugation, and resuspended in a 0.1 lysis buffer on ice for two minutes. Following addition of 1 ml wash buffer, samples were gently filtered using a 20-µm filter and centrifuged as before. After centrifugation most of the supernatant was discarded, pellets were resuspended in 10 µl chilled nuclei buffer and nuclei were counted to determine the concentration. 5,000 nuclei per l is what the concentration of the nucleiei was.

Identifying Slc6a1 in the subcortical GABAergic and astrocytes using the taxonomy trees and clustering of supertypes

There is a report that suggests the co-transmission of dopamine and GABA by dopaminergic neurons in the STR does not depend on cell-free aggregation of the twoneurotransmitters but instead on presynaptic uptake from the extracellular space. The mechanism for Slc6a1 to be expressed in all GABAergic and astrocytes is unclear because it’s not a common mechanism among many subcortical glutamatergic neuron. We did not consider Slc6a1 in our criteria due to the risk of having some false positives.

We first annotated each subclass to its most representative area and then named it after it using a combination of the region, major neurotransmitter and marker genes. We then ordered the subclasses based on the taxonomy tree and assigned subclass IDs accordingly. Supertype names within each subclass were defined by combining the subclass name and the grouping numbers of supertypes within the subclass. Supertype IDs were assigned sequentially based on the taxonomy tree order of subclasses and the group order of supertypes within each subclass. The clustering IDs were assigned based on order of subclasses and supertypes. The final names were assigned by combining the cluster identification with the supertype of the cluster that it belonged to. We assigned accession numbers to cell types as a result of the Allen Institute proposal.

The Imputation results for 10xv2, 10xMulti and MERFISH were used to generate the integrated UMAP as the anchor. 7a.

Source: A high-resolution transcriptomic and spatial atlas of cell types in the whole mouse brain

An Integration Method for Computing the Average Bias Vector Between mC and RNA through PCs of snmC-seq

We have tested other methods as well. To make it work for this large dataset with high complexity, we down-sampled the datasets and increased the size of neuronal network model dramatically to achieve reasonable performance. We need to further investigate the huge space we need to explore to further maximize the performance.

The PCs derived from the first step were then integrated using the same method as Seurat30 through these anchors. Rather than working on the raw feature space in Seurat, our integration step projected the PCs of scRNA-seq (query, denoted as Ur) to the PCs of the snmC-seq (reference, denoted as Um) while keeping the PCs of the reference dataset unchanged. This approximation considerably reduced the time and memory consumption for computing the corrected high-dimensional matrix and redoing the dimension reduction. For anchor k pairing mC cell km and RNA cell kr, ({B}{k}={{U}{{\rm{m}}}}{{k}{{\rm{m}}}}-{{U}{rmr}}{{k}_{{\rm{r}}}}) was considered the bias vector between mC and RNA. The weighted average bias is computed by using the 100 nearest anchors to determine how far away a cell is from the mC space. The distance between the query RNA cell and an anchor was defined as the Euclidean distance on the RNA dimension reduction space between the query RNA cell and the RNA cell of the anchor. The weights for the average bias were determined by the distances between the queryRNA cell and the anchor.

Consensus Cell-Type Taxonomy Using a Common Graph and a Permutation-based Statistical Framework

We started with a tree with root, classes, subclasses and clusters. At each internal node, we selected markers that best discriminate the clusters from different child nodes and assign the query cells to the child node with the nearest cluster centroids based on the selected markers. The process was repeated at each level until the query cells were mapped to the leaf levels. This algorithm is implemented in the scrattch-mapping package and publicly accessible (v0.2, https://github.com/AllenInstitute/scrattch.mapping).

We made the clustering method accessible via scrattch.bigcat package in order to give one consensus cell-type taxonomy based on both 10xv2 and 10xv3 datasets. Minor changes were made to the analysis. To build the common graph that incorporates samples from all the datasets, both 10xv2 and 10xv3 were used as the reference datasets. The key steps in the pipeline are: (1) select anchor cells for each reference dataset; (2) select high variance genes in each reference dataset, prioritizing shared high variance genes; (3) compute KNN both within modality and cross modality; (4) compute Jaccard similarity based on shared neighbours; (5) perform Leiden clustering based on Jaccard similarity; (6) merge clusters based on total number and significance of conserved DEGs across modality between similar cell types; (7) repeat steps 1–6 for cells within a cluster to gain finer-resolution clusters until no clusters can be found; (8) concatenate all the clusters from all the iterative clustering steps and perform final merging as in step 6. If a cluster had fewer cells than the minimal number in the dataset, this was not used for gene computation for all pairs in the cluster. Detection of unique clusters in some data types is allowed by this step.

Using the HVF clusters to calculate cluster enriched features. The framework was adapted from the package. We first performed clustering based on variable features and then used these clusters to select CEFs with stronger marker gene signatures of potential clusters. The concept of CEF was introduced in ref. There is a maximum of 77. The CEF calling and permutation-based statistical tests were used for the ALL Cools.clustering.cluster enriched features.

For the data analysis of MERFISH, the instrument software was used. We removed abnormal cells (artificial segmentation and doublets) from the cell-by-gene matrix in each experiment based on the following criteria: (1) cell volume <30 μm3 or > 2,000 μm3; (2) total RNA count <10 or >4,000; (3) total RNA counts normalized by cell volume <0.05 or >5; (6) total gene detected <3; and (5) cells with >5 blank probes detected (negative control probe included in the gene panel). ALLcools integration functions were used to make use of the pass-QC MERFISH cells with the scRNA-seq datasets6.

An Improved Method for Parallel Computation of KNN based on Annoy Index and Low-Dimensional Datasets

Compared to the previous version, the key improvement is step 3 for computing KNN. We used BiocNeighbor package (v1.16.0, https://github.com/LTLA/BiocNeighbors) for computing KNN using Euclidean distance within modality and cosine distance across modality using the Annoy algorithm (v1.17.1, https://github.com/spotify/annoy). The Annoy index was built based on anchor cells for the reference dataset, and KNNs were computed in parallel for all the query cells. Due to significantly increased dataset sizes, the Jaccard similarity graph can be extremely large, impossible to fit in memory. If the cluster membership of each modality was given as an input for the integration method, we down-sampled cells by within-modality clusters to ensure preservation of rare cell types. The anchor cells were put into the down-sampled datasets. The cluster membership of other cells were imputed based on KNNs computed in the third stage of the clustering.

We identified 933 noise clusters with 153,598 cells in 10xv3, and 201 noise clusters with 38,073 cells in 10xv2. 10xv3 noise clusters were removed from integration analysis but 10xv2 noise clusters were included accidentally. Fortunately, most of the cells from 10xv2 noise clusters were excluded in further QC steps after integration.

The clustering analysis has two datasets where we match cell or cell clusters at the highest granularity. We first separated the pass-QC datasets into integration groups based on independent cell-type annotation (described above or provided by data generators) and dissection information. For instance, non-neuronal cells, IMNs and granule cells (‘DG Glut’ and ‘CB Granule Glut’) were separated from neurons because they were showing large global methylation differences from other neurons and unbalanced in cell numbers across datasets owing to different sampling and sorting strategies. We did the integration iteratively within each group. We used the co-clustering from the integrated low-dimensional space to match cells or clusters between the two datasets (see below). We performed the next integration until the matched cells or clusters fulfilled the criteria. Below is a list of details about the integrations. Supplementary Table 4 contains a map between the mC and m3C cluster annotations. Information about a set of Jupyter Notebooks for a single integration process between each pair is provided in the Code availability section.

To remove low-quality cells, we developed a stringent QC process. Cells were first classified into broad cell classes after mapping to an existing, preliminary version of taxonomy, and cell quality was assessed based on gene detection, QC score, and doublet score. The log-transformed expression of a set of genes that is decreased in poor quality cells has an effect on the QC score. These are housekeeping genes that are strongly expressed in nearly all cells with a very tight co-expression pattern that is anti-correlated with the nucleus localized gene Malat1 (Supplementary Table 4). Out of the 62 genes chosen, only 30 are located on the mitochondrial chromosome, which is a different part of the cell. Evidence shows that some genes may have their genes translocated to the mitochondrial surface. We used the QC score to measure the integrity of the cytoplasmic content. We removed the cells at the lowest end because they were very similar to single nuclei. Doublets were identified using a modified version of the DoubletFinder algorithm132 (available in scrattch.hicat, https://github.com/AllenInstitute/scrattch.hicat, v1.0.9) and removed when doublet score >0.3. Using QC score and gene-count thresholds that were tailored to different cell classes, we filtered out 43% and 29% of cells and kept 2,546,319 cells and 1,769,304 cells for 10xv3 and 10xv2 data, respectively (Extended Data Fig. 1). Threshold parameters and number of cells filtered are summarized in Supplementary Table 4. For example, for neurons (excluding granule cells) we used gene counts cutoff of 2,000 and QC score cutoff of 200.

10x Genomics Multiome (10xMulti) libraries were sequenced on the Illumina NovaSeq6000, and sequencing reads were aligned to the mouse references downloaded from 10x Genomics, which includes ensembl GRCm38 (v98) fasta and gencode (vM23) gtf file, using the 10x Genomics CellRanger Arc (v2.0) workflow with default parameters.

The Chromium Single Cell 3′ Reagent Kit v3 was used for 10xv3 processing. We followed the manufacturer’s instructions for cell capture, barcoding, reverse transcription, cDNA amplification and library construction128. We loaded 13,408 cells per port. We targeted a sequencing depth of 120,000 reads per cell; the actual average achieved was 83,190 ± 85,142 reads per cell across 482 libraries.

A cell isolation procedure is difficult to use for some types of brain tissue. We collected additional single-nucleus 10x Multiome data in midbrain and hindbrain regions to supplement cell types lost due to technical limitations.

The Transgenic driver lines were used for isolation and enrichment of cells. The majority of cells were isolated from the Pan-neuronal SNAP25-IRES2-cre line. There are a small number of donors and mice. For unbiased sampling without FACS, we used either Snap25-IRES2-cre/wt;Ai14/wt or Ai14/wt mice.

Dissected tissue pieces were digested with 30 U ml−1 papain (Worthington PAP2) in ACSF for 30 min at 30 °C. Due to the short incubation period in a dry oven, we set the oven temperature to 35 °C to compensate for the indirect heat exchange, with a target solution temperature of 30 °C. Enzymatic digestion was quenched by exchanging the papain solution three times with quenching buffer (ACSF with 1% FBS and 0.2% BSA). 5 min before trituration, samples were incubated on ice. The tissue in the quenching buffer was triturated via a fire-polished pipette. The tissue pieces were freed and the supernatant was transferred to a new tube. Fresh quenching buffer was added to the tissue pieces and trituration and supernatant transfer were done using fire-polished pipettes. The single-cell suspension was passed through a 70 m sieve and into a conical tube with 500 l of high-BSA buffer at the bottom to help cushion the cells during centrifugation. The cell pellet was resuspended, after the supernatant was discarded. We collected 1,508,284 cells without performing FACS. The concentration of the resuspended cells was quantified, and cells were immediately loaded onto the 10x Genomics Chromium controller.

Single cells were isolated following a cell-isolation protocol developed at AIBS23,121. The brain was dissected, submerged in artificial cerebrospinal fluid (ACSF), embedded in 2% agarose, and sliced into 350-μm coronal sections on a compresstome (Precisionary Instruments). The images were taken during slicing. ROIs were then microdissected from the slices and dissociated into single cells as previously described23. Fluorescent images of each slice before and after ROI dissection were taken at the dissection microscope. The images were used to show the exact location of theROIs using annotated coronal plates.

During the dark phase of the circadian cycle, mice were assigned to different time groups based on their hormones and housed on a reversed 12:12 h light:dark cycle. All groups had their brain dissections in the morning. The cells were collected from 267 donors in the light phase of the cycle. For 50 donors, 1,121,542 cells across the whole brain were collected during the dark phase of the light:dark cycle (Supplementary Table 2).

The average number of mice in each cluster is 19 with a median of 14. There are 23 clusters with fewer than 4 donor animals. Individual mouse variability doesn’t should affect cell-type identities.

We used 95 mice (41 female, 54 male) to collect 2,492,084 cells for 10xv2 and 222 mice (112 female, 110 male) to collect 4,466,283 cells for 10xv3. There were three options for animals to be euthanized: at the postmortem day, P50–52, or P60– 71. No statistical methods were used to estimate the sample size. All of the donor animals are listed in Supplementary Table 2.

In accordance with the guidelines of the National Institute of Neurological Disorders and Stroke, the Institutional Animal Care and Use Committee approved all experimental procedures related to the use of mice. Mice were housed in a room with temperature (21–22 °C) and humidity (40–51%) control within the vivarium of the AIBS at no more than five adult animals of the same sex per cage. The mice had food and water and were kept drinking it and eating it on the regular light:dark cycle or reversed light:dark cycle. Mice were maintained on the C57BL/6 J background. We didn’t allow mice with anophthalmia or microphthalmia.

Training Random Forest Models for Highly Variable Transcriptomes and Exons Using f-Regression and All Cools Dataset

After collecting all the features, we selected genes with highly variable transcripts and exons among cell groups for model training. Highly variable transcripts were selected on the basis of the following criteria: (1) mean TPM across cell groups of >0.2; (2) TPM standard deviation of >0.3; and (3) transcript body (TSS to TTS) length of >30 kb. A selection of highly variable exons was based on the standard deviation and percentage difference. We trained modellers for each Gene to predict transcript TPMs using mC or 3C features. The training contains two steps. The first thing that we did was use the score function f-regression to select the top 100 features for each transcript. All the features we used had been selected before. We used selected features and a learning method to train random forest models. Random FOREST REGENT. In each cross-validation run, we calculated the PCC between predicted values and true values as the model performance. We also shuffled the selected features within each sample (Fig. 6c) to train the model and calculate PCC values again as the shuffled model performance.

First, we quantified mC and m3C intragenic features for predicting the alternative isoform and exon usage. We used the exon, exon-flanking region and intragenic DMRs as the mC features of the genes. The exon-flanking region can be defined as upstream or downstream. We removed features with variance <0.01 and combined features with >90% overlap in their genome coordinates. In order to make 3C features, we used the F statistics from the above section.

The snmC-seq and snm3C-sex datasets were stored in the ALLC format, a tab-separated table compressed and indexed by bgzip. The All Cools package allows for the generation of a dataset called thegenerate-dataset. We used non-overlapping chromosome 100-kb (chrom100k) bins of the mm10 genome to perform clustering analysis; gene body regions ±2 kb for clustering annotation and integration with the companion transcriptome dataset; and non-overlapping chromosome 5-kb (chrom5k) bins for integration with the chromatin accessibility dataset. The information about the integration analysis has been described.

The Taiji framework94 was used to analyze the grn for each cell subclass. This framework uses the personalized PageRank algorithm to propagate node and edge weight information across the network, calculating the importance of each TF. To add subclass information as network weights, we simplified the network by including only TF and target gene nodes and weighting the gene node by inverted gene-body mCH value in the subclass. Specifically, we first performed quantile normalization across all subclasses. We then performed a robust scale of the matrix using sklearn.preprocessing. The RobustScaler has a quantile range. We inverted the mCH fraction.

The edge weights had to be added to include the mCG fraction. Specifically, we performed the same quantile normalization and robust scale on all the mCG fractions of DMRs involved in the network and calculated the inverted DMR mCG value by

Source: Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

A Framework for Building a Neuronal Receptor Network Based on DNA Methylome and Chromatin Conformation Profiles

We used an ensemble motif database from SCENIC+ (ref. 43), which contained 49,504 motif position weight matrices (PWMs) collected from 29 sources. Redundant motifs (highly similar PWMs) were combined into 8,045 motif clusters through clustering based on PWM distances calculated using TOMTOM92 by the authors of SCENIC+ (ref. 43). Each cluster was annotated with one or more mouse genes. We used the Cluster-Buster93 implementation in SCENIC+ to calculate the occurrence of motifs in a cluster.

We presented a framework for building a GRN based on the DNA methylome and chromatin conformation profiles at the cell subclass level. We used 212 neuronal cell subclasses requiring them to have >100 cells in both snmC and snm3C datasets. The same framework can be applied to other brain cell types or a subset of cells. The GRN was composed of relationships between genes and binding elements. Pairwise edges were constructed between DMRs and target genes (DMR–target), TFs and target genes (TF–target) and TFs and DMRs (TF–DMR). Each pairwise edge was based upon the correlation between the two genome elements across each cell subclass. We performed quantile normalization along subclasses using the Python package qnorm (v.0.8.0)88 to normalize the two matrices involved in calculating the correlation. Gene-body mCH fraction and mCG fractions were used as proxy for targets and status. Variable genes and TFs can be picked if they are identified as CEFs in any subclass.

To investigate the relationship between genes and the surrounding nature, we first normalized transcript body mCH fractions using thePython package qnorm. We then calculated the PCC between the transcript body mCH fraction and the highly variable interactions if any anchor of the interactions had overlapped with the gene body. Similar to the domain boundary correlation analysis, we shuffled the contact strengths and mCH fractions within each sample and used the shuffled matrix to calculate null distribution and estimate FDR. We select FDR < 0.001 as a significant correlation.

We used the imputed cell-level contact to perform the highly variable interaction analysis by representing one 10 kb-by- 10 kbpixel in the conformation matrix. We filtered out any interactions that had one of the anchors overlapping with ENCODE blacklist (v.2)75. We used ANOVA to see if the contact strength of the interaction was significant across the subclasses. The F statistics were used to show the variation in interactions across the brain. After visually inspecting the contact maps, we used F > 3 as the cut-off value for highly variable contacts to fulfill the FDR 0.001 criteria. ANOVA was only performed on interactions for which anchor distance was between 50 kb and 5 Mb, given that increasing the distance only led to a limited increase in the number of significantly variable and gene-correlated interactions (Extended Data Fig. 10b).

We used the imputed cell-level contact matrices at 25-kb resolution to identify domain boundaries within each cell using the TopDom algorithm89. We first filtered out the boundaries that overlapped with ENCODE blacklist (v.2)75. The boundary probability of a bin was defined as the proportion of cells having the bin called a domain boundary among the total number of cells from the group or subclass. To identify differential domain boundaries between n cell subclasses, we derived an n × 2 contingency table for each 25-kb bin. The values in each row show how many cells are in the group that has the bin called a boundary. FDR 1 10–3 was used as the cut-off for calling 25kb bin with differential boundary probability due to Chi-square statistic and P value computed.

We first performed quantile normalization along subclasses using the Python package qnorm (v.0.8.0)88 to normalize the mC fractions and compartment scores. We calculated the PCC between the compartment scores of the non-overlapping chromosomes, and the corresponding mCH or mCG fraction of the bin. The 300 most negatively correlated bins were selected to perform GO enrichment analysis, because of their higher standard deviation. We randomly selected gene-length-matched background genes to adjust the long-gene bias in all the GO enrichment analyses36. To investigate the developmental relevance indicated by the GO enrichment result, we used the developmental mouse brain scRNA-seq atlas37 at the subtype level (approximate granularity of subclass in this study). Genes overlapping 300 of the most negatively correlated bins, 300 of the mostly positively correlated bins and 300 of the low-correlation bins were used to construct the plot in Extended Data Fig. 9d.

Source: Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

Using mC and RNA to integrate downsampling and CCA in the context of TF, IDF and CCV

CCA was also performed with the downsampling framework using 100,000 cells from each dataset as a reference and the others as query, but taking the TF–IDF transformed matrices as input. The query cells were projected to the same space using the IDF and CCV of the reference cells. B_rma_rmref was converted to. (B_rma_rmqry) were converted to X_rmm. The subsequent steps to find anchors and align Um and Ua were the same as integrating the mC and RNA data.

$${{X}{{\rm{a}}}}{ij}=\log ({{{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{{\rm{a}}}}{ij}\times \mathrm{100,000}+1)\times {{\rm{I}}{\rm{D}}{\rm{F}}}_{j}$$

mathop limits_j

({X}{ij}=\log ({{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{ij}\times \mathrm{100,000}+1)\times {{\rm{I}}{\rm{D}}{\rm{F}}}{j}), where ({{\rm{T}}{\rm{F}}{\rm{r}}{\rm{e}}{\rm{q}}}{ij}={B}{ij}/{\sum }{{j}^{{\prime} }\,=1}^{{\rm{N}}{\rm{o}}.\,{\rm{b}}{\rm{i}}{\rm{n}}{\rm{s}}}{B}{i{j}^{{\prime} }}) and ({{\rm{IDF}}}{j}=\log (1+{\rm{no.}}\,{\rm{cells}}/\mathop{\sum }\limits_{{i}^{{\prime} }=1}^{{\rm{No.}}\,{\rm{cells}}}{B}_{{i}^{{\prime} }j})). The embedded cells U were computed using the singular value decomposition of X. We fit the LSI model with mCG data Bm to derive Um. The intermediate matrices S and V and vector IDF were used to transform the ATAC data Ba to Ua, by

LSI with log term frequency was used to compute the embedding. The conversion from B to X was done by dividing the row sum of the matrix into two parts, one for the term frequencies and the other for the inverse document frequencies.

PCA on gene-body signals was insufficient to capture the open chromatin heterogeneity in snATAC-seq data10,30. It has been demonstrated that adhering to semantic index application, binarized cell-by-5-kb bin matrices had promising results. Therefore, to align snATAC-seq data with snmC-seq data at a high resolution, we developed an extended framework based on the previously described approach30 to utilize binary sparse cell-by-5-kb bin matrices as input.

It is hard to scale up to millions of cells because the mC cell-by-RNA matrix is used as the input to CCA. To handle this limitation, we randomly selected 100,000 cells from each dataset (({X}{{\rm{ref}}}) and ({Y}{{\rm{ref}}})) as a reference to fit the CCA and transformed the other cells (({X}{{\rm{qry}}}) and ({Y}{{\rm{qry}}})) onto the same CC space. Specifically, the canonical correlation vectors (CCVs) of ({X}{{\rm{ref}}}) and ({Y}{{\rm{ref}}}) (denoted as ({U}{{\rm{ref}}}) and ({V}{{\rm{ref}}}), respectively) were computed using ({U}{{\rm{ref}}}S{V}{{\rm{ref}}}^{T}={X}{{\rm{ref}}}{Y}{{\rm{ref}}}^{T}), where ({U}{{\rm{ref}}}^{T}{U}{{\rm{ref}}}=I) and ({V}{{\rm{ref}}}^{T}{V}{{\rm{ref}}}=I). Then the CCV of ({X}{{\rm{qry}}}) and ({Y}{{\rm{qry}}}) (denoted as ({U}{{\rm{qry}}}) and ({V}{{\rm{qry}}}), respectively) were computed using ({U}{{\rm{qry}}}={X}{{\rm{qry}}}({Y}{{\rm{ref}}}^{T}{V}{{\rm{ref}}})/S) and ({V}{{\rm{qry}}}={Y}{{\rm{qry}}}({X}{{\rm{ref}}}^{T}{U}{{\rm{ref}}})/S), respectively. The embeddings from the reference and query cells were concatenated for anchor identification.

We applied the DMR calling framework across subclasses of the whole mouse brain and cell clusters within each major region. The diversity of the CpG fraction was captured by combined sources of DMRs. There were around 10 million unique yet overlapping DMRs after the combination. We merged the DMR’s to get a final non-overlapping list with 2.56 million DMRs. The data availability section includes reports on all the DMRs that are not overlapping. We started with 10 million overlap DMRs after calculating correlation or scanning motif occurrence. We selected the DMR with the strongest value (that is, most significant PCC or highest motif score) among the overlapping ones. The results were not overlapping.

For each clustering round, we assessed whether a cluster required additional clustering in the next iteration based on two criteria: (1) the final prediction model accuracy exceeded 0.9, and (2) the cluster size surpassed 20. 61 (L 1), 411 (L2), 1,346 (L3) and 2,533 (L4) were the cluster numbers that were produced in four iterative clustering rounds. We further separated cells within L4 clusters by considering their brain dissection region data. We first divided dissection regions with more than 20 cells by clustering them with their nearest regions based on the average distance between them from the PC space of L4 clustering. The final 4,673 cell groups combined L4 clusters and dissection regions. Enhancing the flexibility of analysis was made possible by using dissection region data which gave insights into the physical location of a cell. Furthermore, we acknowledged that generating pseudo-bulk profiles from cell-level data demanded substantial computational resources. Aggregating cells at a higher granularity initially facilitated more straightforward merging later, such as combining them at the subclass level during subsequent analyses.

Consensus clustering. We first performed Leiden clustering82 200 times, using different random seeds. We then combined these result labels to establish preliminary cluster labels, which were typically larger than those derived from a single Leiden clustering owing to its inherent randomness82. We began training models in the PC space to compute the confusion matrix and predict labels. Finally, we merged clusters with high similarity to minimize confusion. The confusion matrix was applied to the R1 and R2 normalizations of the cluster selection.

Source: Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain

Data and Code availability section for tensor analysis of the selected cell-by-CEF matrices: synthesis and visualization of UMAP79 embeddings

Calculate principal components (PCs) in the selected cell-by-CEF matrices and generate the t-SNE78 and UMAP79 embeddings for visualization. The openTSNE80 package was used for t-SNE.

Both snmC-seq3 and snm3C-seq samples followed the same library preparation protocol (detailed in Supplementary Methods). This protocol was automated using a Beckman Biomek i7 and Tecan Freedom Evo instrument to facilitate large-scale applications. The snmC-seq3 and snm3C-seq libraries were sequenced on an Illumina NovaSeq 6000 instrument, using one S4 flow cell per 16 384-well plates and using 150 bp paired-end mode. The following software were used: BD Influx, FreedomEVOware, Illumina MiSeq control and NovaSeq 6000 control.

The large tensor dataset was stored by chunked and compressed Zarr format 66, within the object storage of the Google Cloud Platform. Data analysis was performed using several packages. To facilitate large-scale computation, the Snakemake package65 was used to construct pipelines, whereas the SkyPilot package69 was utilized to set up cloud environments. Additionally, the ALLCools package (v.1.0.8) was updated to perform methylation-based cellular and genomic analyses, and the scHiCluster70 package (v.1.3.2) was updated for chromatin conformation analyses. In the Data and Code availability sections, we provide information for these tensor storage and reproducibility-related details (package version, analysis notebook and pipeline files). Key analysis steps and parameters are mainly the subject of the description.

Primary QC for DNA methylome cells included the following criteria: (1) overall mCCC level of <0.05; (2) overall mCH level of <0.2; (3) overall mCG level of >0.5; (4) total final reads of >500,000 and <10,000,000; and (5) bismark mapping rate of >0.5. Note that the mCCC level estimates the upper bound of the cell-level bisulfite non-conversion rate. Additionally, we calculated lambda DNA spike-in methylation levels to estimate the non-conversion rate for each sample. All samples demonstrated a low non-conversion rate (<0.01; Extended Data Fig. 2i). The primary filtering step prevented the potential cell or cluster loss. The potential doublets and low-quality cells were accessed by the clustering-based QC. The required cis-long-range contacts for the 3C option were between2,500 and50,000.

All experimental procedures using live animals were approved by the Salk Institute Animal Care and Use Committee under protocol number 18-00006. The C57BL/6J male mice that were purchased from the Jackson Laboratory at 7 weeks old were maintained in the animal barrier facility for up to 10 days with food ad libitum. Brains were extracted (between P56 and P63), sliced and dissected in an ice-cold dissection buffer as previously described9. For snmC-seq3 samples, brains were sliced coronally at 600-μm intervals from the frontal pole across the whole brain, producing 18 slices, and dissected brain regions were obtained according to the Allen Brain Reference Atlas CCF (version 3)59 The extended data fig is called CCFv3. 1b) For all the snm3C-seq samples, brains were sliced coronally at 1,200 μm, which resulted in a total of 9 slices and dissected 2–6 combined brain regions according to the CCFv3 (Extended Data Fig. 1b. For nuclei isolation, each dissected region was pooled from 3–30 animals, and 2–3 biological replicas were processed per region. Comprehensive brain dissection metadata are provided in Supplementary Table 1. Random methods were not used to decide sample sizes. Minimum reproducibility is achieved by using two to three biological experiments for all single-cell epigenomic experiments. During handling of the tissue samples, blind and randomization weren’t done. The data availability section contains details about the annotated voxel file that was used to register all of the dissected regions.

Previous post Four researchers from the southern part of the world make cross-border collaborations count
Next post A single-cell analysis of the adult mouse brain is being done