Deciphering chromatin domain, domain community and chromunity for 3D genome maps with Mactop

Pairwise decomposition of multi-way data

In the pairwise decomposition process, given a set of multi-way reads (R=left{{r}_{1},{r}_{2},ldots ,{r}_{n}right}) consisting of n reads, we decompose it into all possible pairs of reads P. For every read r_i and r_j in R that interact, we define a pairwise relation p_ij. Subsequently, we construct an interaction matrix M, where M_ij represents the frequency of interaction between r_i and r_j. For each ({p}_{{ij}}in P), we update the matrix M as follows: ({M}_{{ij}}={M}_{{ij}}+1).

Constructing consistency matrix

The construction of a consistency matrix from a Hi-C matrix involves a process of iterative clustering with added noise, quantifying the co-occurrence of bin pairs in clusters across multiple iterations. Initially, the original Hi-C matrix is denoted as M, and n is defined as the number of clustering iterations with noise. For each iteration i (where I = 1,2,…,n), noise is added to M to create a perturbed matrix M_i, and clustering is performed to yield the result C_i. The consistency matrix CM is then initialized, with the dimension equals that of M, and for each pair of bins a,b, the frequency Count_a,b of them being clustered together across all n results is counted. The consistency value for each bin pair is calculated (C{M}_{a,b}=frac{{{mbox{Count}}}_{a,b}}{n}). After evaluating all bin pairs, CM emerges as the final consistency matrix, representing the proportion of times each pair of bins was co-clustered, thereby revealing the stable clustering patterns and underlying genomic region relationships in the Hi-C data.

Generate the randomized TADs

To generate randomized TADs, the lengths of original TADs are preserved while their positions are randomized. Initially, each TAD i in the original set is represented by its length ({L}_{i}={en}{d}_{i}-{{{{rm{start}}}}}_{i}). A list of indices (I=[1,2,ldots ,n]) corresponding to these TADs is created, where n is the total number of TADs. This list of indices is then randomized to produce a shuffled sequence I’. New TADs are generated by sequentially positioning them along the genomic region, starting from position 0. For each index j in I’, a new TAD is defined with a start position currentPos and an end position ({endPos}={currentPos}+{L}_{{I}_{j}^{{prime} }}), and currentPos is updated for the next TAD. The initial position of currentPos is defined as the first bin of the data. The randomized TADs comprise tuples (currentPos, endPos), maintaining the original lengths but with randomized genomic locations. This approach ensures that the randomized TAD set mimics the length distribution of the original ones, providing a basis for comparative analyses while varying their spatial positioning.

Markov clustering algorithm (MCL)

Let (G=(V,E)) be a graph where V is the set of vertices and E is the set of edges. Define A as the adjacency matrix of G, where A_ij represents the weight of the edge between vertices i and j. If i and j are not directly connected, ({A}_{{ij}}=0). Firstly, transform A into a stochastic matrix M by normalizing each row to sum to one, so that it represents transition probabilities of a random walk on the graph:

$${M}_{{ij}}=frac{{A}_{{ij}}}{{sum }_{k=1}^{n}{A}_{{ik}}}$$

(1)

The MCL algorithm then iteratively applies two operations, expansion and inflation, to matrix M:

1.

Expansion (Matrix Squaring): M is replaced by M², effectively simulating two steps of a random walk.
2.

Inflation: Each element of M is raised to the power γ (inflation parameter) and renormalized to maintain stochasticity. Inflation controls the granularity of clustering by reinforcing intra-cluster transitions and weakening inter-cluster transitions.

$${M}_{{ij}}=frac{{M}_{{ij}}^{gamma }}{{sum }_{k=1}^{n}{M}_{{ik}}^{gamma }}$$

(2)

After the inflation step, M is normalized to keep the matrix stochastic. These two steps are repeated until M converges, which means that multiple iterations do not significantly change M. The converged matrix M indicates the cluster structure of the graph, where dense blocks along the diagonal represent clusters.

Construct a TAD interaction network

For a set of TADs, denoted as (T=left{{t}_{1},{t}_{2},ldots ,{t}_{n}right}), where each t_i represents a TAD on the genome. We build a weighted graph (G=(T,E,w)) to represent TAD-TAD interaction, where the weight (w(e)) is defined as follows:

$$wleft(eright)=left{begin{array}{c}{mean}(I({t}_{i},{t}_{j})),{if , Z} , > , 0.05\ qquad 0 qquadquad , {if , Z} , < , 0.05end{array}right.$$

(3)

where Z is the z-score, calculated as (frac{bar{{I}_{{{{rm{tads}}}}}}-mu }{sigma }), where (bar{{I}_{{tads}}}) is the mean interaction strength between TADs t_i and t_j, μ is the mean of the background distribution of interactions, and σ is the standard deviation of the background distribution.

Construct a muti-way reads similar network

Given a subset of multi-way reads (tilde{D}), we build a weighted graph (G=(tilde{D},E,w)) of bin-wise concatemer overlap, where the weight (w(e)in [0,1]) on each edge (e=left{{d}_{1},{d}_{2}right}in E) represents the Jaccard similarity of concatemer pair ({d}_{1},{d}_{2}in tilde{D}). A second weighted graph (hat{G}=(tilde{D},hat{E},hat{w})) is then built with weights (hat{w}(e)) that represent the number of (k=25) nearest neighbors (that is, most similar) that are shared by each concatemer pair (ein hat{E}).

Recommended parameters

Mactop has two key parameters: Inflation and Variance. Inflation determines the granularity of clustering, while Variance affects the consistency of the results. We calculated the Silhouette coefficient across various parameter settings for both 10 kb and 50 kb data. The results suggest that the optimal settings are an Inflation value of 1.4 for 10 kb resolution and 1.5 for 50 kb resolution, with a consistent Variance of 0.2 for both (Supplementary Fig. S2G). Additional parameter recommendations are provided in Supplementary Table S1.

Hi-C matrix normalization

We used the iterative correction and eigenvector decomposition method (ICE) for Hi-C data normalization. ICE normalizes Hi-C data by iteratively eliminating biases introduced by experimental procedures and intrinsic genomic properties, assuming equal visibility for all genomic loci. This process results in a corrected matrix of relative contact probabilities, allowing for more accurate and unbiased comparisons within and between datasets. The parameters for the ICE method are listed in Supplementary Table S1.

TAD-calling methods

We compared Mactop with three representative TAD-calling methods: the directionality index (DI) method, the insulation score (IS) method, and TopDom. For all methods, we used default or recommended parameters. The parameter for those methods are listed in Supplementary Table S1.

Directionality index (DI)

The DI method calculates the DI score for each genomic bin to determine its tendency for more frequent interactions with either upstream or downstream regions, thus indicating its directional bias. These scores are then analyzed using a hidden Markov model to ascertain the state of each bin (such as upstream-biased, downstream-biased, or unbiased). Ultimately, bins exhibiting significant directional bias are identified as TAD boundaries.

Insulation score (IS)

The IS method defines a sliding window to calculate an insulation score for each genomic bin on a chromatin contact map. The score is derived from the average number of DNA-DNA interactions within the window. Local minima in the vector of these insulation scores are considered indicators of TAD boundaries, indicating regions where chromatin’s internal interactions exceed interactions with external regions, thereby defining the boundaries of TADs.

TopDom

TopDom applies a fixed-size sliding window on the Hi-C contact matrix to detect interaction patterns between chromatin regions. TopDom identifies local maxima of interaction frequencies within each window as candidate TAD boundaries, followed by calculating scores for these boundaries to determine the actual TAD borders. The scoring system in TopDom is based on the variation in interaction frequency near the boundaries with higher scores indicating a higher likelihood of being TAD boundaries.

TAD evaluation criteria

We evaluate the TAD quality through diverse enrichment metrics alongside internal validation metrics typically employed.

Internal validation metrics

Based on the characteristics of TADs in the Hi-C interaction matrix, various metrics have been designed to assess the properties of chromatin segments. We have chosen the Insulation score, Directionality Index, Contrast index, and the Silhouette Coefficient, a metric for evaluating clustering results, to measure the properties of TAD boundaries.

Insulation score ratio

For a specific bin b, we calculate the insulation scores for defined regions upstream (left({I}_{u}right)) and downstream (left({I}_{d}right)) of it. The length of these regions is denoted as l. Insulation scores for each bin x within these regions are computed using the formula ({I}_{x}=frac{{sum }_{k=x-w}^{x+w}{sum }_{l=y-w}^{y+w}{M}_{k,l}}{{w}^{2}}), where M_k,l is the interaction value in the interaction matrix, and w is the window size. The slope S is then determined as (S=frac{{I}_{max }-{I}_{min }}{2l}), where ({I}_{max }) and ({I}_{min }) are the maximum and minimum insulation scores in the combined upstream and downstream regions. This slope quantifies the variation in insulation scores and serves as a metric to evaluate TAD boundary characteristics in the specified region.

Directionality index (DI)

DI is computed for a defined length ((l)) around the bin b using the formula: (D{I}_{i}=frac{{sum }_{j=i+1}^{i+l}{M}_{i,j}-{sum }_{j=i-l}^{i-1}{M}_{i,j}}{{sum }_{j=i+1}^{i+l}{M}_{i,j}+{sum }_{j=i-l}^{i-1}{M}_{i,j}}), where M_i,j represents the interaction frequency between bins in the Hi-C matrix. The slope of the DI is then determined by calculating the difference between the maximum and minimum DI values within this region, divided by the total length of the region ((2l)). This slope provides a quantitative measure of the change in interaction directionality and is particularly useful for identifying and characterizing TAD boundaries in the genomic region surrounding the selected bin.

Contrast index (CI)

CI is calculated by assessing interactions between a chosen number of upstream and downstream bins (n) around a specified bin b. The formula incorporates the sum of interactions U and D across these n bins upstream and downstream, respectively, and is defined as (U={sum }_{i=b-n}^{b-1}{sum }_{j=b-n}^{b-1}{M}_{i,j}) and (D={sum }_{i=b+1}^{b+n}{sum }_{j=b+1}^{b+n}{M}_{i,j}), where ({M}_{i,j}) denotes the interaction frequency in the Hi-C matrix. CI is then calculated as ({CI}=frac{U+D}{2times left({sum }_{i=b-n}^{b}{sum}_{j=b}^{b+n}{M}_{i,i}right)}), with the denominator summing the internal interactions within the region around b. This index is especially useful for quantitatively analyzing interaction strengths and patterns, aiding in identifying and understanding the dynamics of chromatin interactions around TAD boundaries.

Enrichment analysis

Enrichment of known architectural proteins

We calculate the enrichment of three known architectural proteins (CTCF, RAD21, and SMC3) in the TAD boundaries of five cell lines from Rao et al.¹². TAD boundaries are defined by the starting bin and the ending bin of each predicted TAD, along with one preceding the starting bin and one following the ending bin. Let N be the total number of bins in a chromosome, n_bind be the number of bins with one or more ChIP-seq peaks or accessible motif sites, n_tad be the number of TAD boundary bins, and ({n}_{{tad}-{bind}}) be the number of TAD-boundary bins with a binding event (ChIP-seq peak or accessible motif match site). The fold enrichment for a particular protein is calculated as: (frac{{n}_{{tad}-{bind}}/{n}_{{tad}}}{{n}_{{bind}}/n}). Within each cell line, the fold enrichment across all chromosomes is averaged and then the mean across cell lines is used to rank the TAD-calling methods.

Histone modification enrichment

We use the proportion of predicted TADs that are significantly enriched in histone modification signals (compared to the “null” histone-modification signal distribution of randomly shuffled TADs) as a validation metric to assess the quality of TADs, similar to Roy et al.⁴⁴. For each TAD, we calculate the mean histone modification ChIP-seq signal within the TAD. Next, we find the “null” histone-modification signal distribution from randomly shuffled TADs. To generate randomly shuffled TADs, we take the lengths of all predicted TADs within a chromosome, as well as the lengths of interspersed stretches between the TADs (i.e., “non-TAD” stretches) if a TAD-calling method skips over regions of the genome. Next, we randomly move around the TAD and non-TAD stretches within the chromosome to preserve the TAD length distribution. We repeat this procedure ten times. Then, we compute the mean histone modification ChIP-seq signal within these randomly shuffled TADs, generating the null or background distribution of histone modification signals. The empirical p-value of a predicted TAD’s histone modification signal is calculated as the proportion of randomly shuffled TADs with higher ChIP-seq signal than that of the given TAD. A TAD was significantly enriched if its empirical p-value was less than 0.05, i.e., more than 95% of randomly shuffled TADs have a lower histone modification signal. Finally, we get the proportion of predicted TADs with significant histone modification signals.

Histone modification enrichment across the two TAD categories

We explored whether the TADs demarcated by different types of boundaries showed enrichment of various histone modifications. Thus, we first divided the TADs into different clusters according to their boundary types. We call a TAD stable if it is surrounded by two stable boundaries, and we call a TAD dynamic if its two boundaries contain at least one dynamic boundary. Furthermore, using the method mentioned in the Histone Modification Enrichment section, we examined the number of histone modification enrichments in different types of TADs and divided this by the total number of TADs of that type to obtain the histone modification enrichment proportion for each type of TAD.

Downsampling of the Hi-C interaction matrix

The downsampling of Hi-C data aims at simulating lower sequencing depths. This process methodically decreases the original interaction frequencies, maintaining a specific percentage, denoted as p, of the data. This is achieved through the following:

Define sampling depth percentage

Set the percentage of data to be retained, denoted as p, which is a value between 0 and 100.

Original interaction matrix

The original Hi-C interaction matrix is represented as M, where each element M_i,j indicates the frequency between genomic loci i and j.

Detailed downsampling process

For each interaction element M_i,j, we calculate the expected number of interactions to retain: ({E}_{i,j}={M}_{i,j}times frac{p}{100}), and generate a random number for each interaction, denoted as R_i,j, uniformly distributed between 0 and 1. The downsampled interaction ({M}_{i,j}^{{prime} }) is determined using a thresholding method:

$$left{begin{array}{c}{{lceil }} {E}_{i,j} {{rceil }},quad \ {{{lfloor }} {E}_{i,j}}{{rfloor }} ,hfill end{array}begin{array}{c}{{{rm{if}}}},{R}_{i,j}, < ,{E}_{i,j}-{lfloor {E}_{i,j}rfloor }\ {{{rm{otherwise}}}}hfill end{array}right.$$

(4)

where ({{lceil }}cdot {{rceil }}) is the ceiling function and ({{lfloor }}cdot {{rfloor }}) is the floor function. The matrix M’ represents the downsampled Hi-C data, with each element ({M}_{i,j}^{{prime} }) reflecting the interaction frequency adjusted for the chosen sampling depth p.

Community conservation score

Let C_i represent the community structure in the i-th cell line, and let B_j represent a specific bin on the chromosome. Define the conservation score S_i for bin B_j as:

$${S}_{j}=frac{1}{n}{sum }_{i=1}^{n}I({B}_{i}in {C}_{i})$$

(5)

where n is the total number of cell lines. (I({B}_{j}in {C}_{i})) is an indicator function, which equals 1 if bin B_j is part of a community in the i-th cell line (({B}_{j}in {C}_{i})), and 0 otherwise. For a specific community C within a cell line, define the Community Conservation Score S_C as the average of the conservation scores of all bins within that community:

$${S}_{C}=frac{1}{left|Cright|}{sum}_{{B}_{j}in C}{S}_{j}$$

(6)

where (left|Cright|) is the number of bins in community C.

Statistics and reproducibility

All data were presented as boxplots, showing the median and interquartile range (IQR), based on at least three independent analyses. The normality of data distribution was assessed using the Shapiro-Wilk test for each dataset. For the comparisons between any two groups, an independent two-sample t-test was used, and if the normality assumption was violated, a Mann-Whitney U test was applied. Statistical significance was defined as p ≤ 0.05. All hypothesis tests were performed using the SciPy Python library.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Deciphering chromatin domain, domain community and chromunity for 3D genome maps with Mactop

Pairwise decomposition of multi-way data

Constructing consistency matrix

Generate the randomized TADs

Markov clustering algorithm (MCL)

Construct a TAD interaction network

Construct a muti-way reads similar network

Recommended parameters

Hi-C matrix normalization

TAD-calling methods

Directionality index (DI)

Insulation score (IS)

TopDom

TAD evaluation criteria

Internal validation metrics

Insulation score ratio

Directionality index (DI)

Contrast index (CI)

Enrichment analysis

Enrichment of known architectural proteins

Histone modification enrichment

Histone modification enrichment across the two TAD categories

Downsampling of the Hi-C interaction matrix

Define sampling depth percentage

Original interaction matrix

Detailed downsampling process

Community conservation score

Statistics and reproducibility

Reporting summary

Continue Reading

More posts

Volcanic sulfur on Mars could have been helpful for life

At the Gates’ Tomas Lindberg’s introspective lyricism broke new ground in death metal | Music

The struggle in licensing video streamers is possible with flexibility

Vaccines Are A Hard Business. RFK Jr.’s CDC Is Making It Even Harder