Pairwise decomposition of multi-way data
In the pairwise decomposition process, given a set of multi-way reads (R=left{{r}_{1},{r}_{2},ldots ,{r}_{n}right}) consisting of n reads, we decompose it into all possible pairs of reads P. For every read ri and rj in R that interact, we define a pairwise relation pij. Subsequently, we construct an interaction matrix M, where Mij represents the frequency of interaction between ri and rj. For each ({p}_{{ij}}in P), we update the matrix M as follows: ({M}_{{ij}}={M}_{{ij}}+1).
Constructing consistency matrix
The construction of a consistency matrix from a Hi-C matrix involves a process of iterative clustering with added noise, quantifying the co-occurrence of bin pairs in clusters across multiple iterations. Initially, the original Hi-C matrix is denoted as M, and n is defined as the number of clustering iterations with noise. For each iteration i (where I = 1,2,…,n), noise is added to M to create a perturbed matrix Mi, and clustering is performed to yield the result Ci. The consistency matrix CM is then initialized, with the dimension equals that of M, and for each pair of bins a,b, the frequency Counta,b of them being clustered together across all n results is counted. The consistency value for each bin pair is calculated (C{M}_{a,b}=frac{{{mbox{Count}}}_{a,b}}{n}). After evaluating all bin pairs, CM emerges as the final consistency matrix, representing the proportion of times each pair of bins was co-clustered, thereby revealing the stable clustering patterns and underlying genomic region relationships in the Hi-C data.
Generate the randomized TADs
To generate randomized TADs, the lengths of original TADs are preserved while their positions are randomized. Initially, each TAD i in the original set is represented by its length ({L}_{i}={en}{d}_{i}-{{{{rm{start}}}}}_{i}). A list of indices (I=[1,2,ldots ,n]) corresponding to these TADs is created, where n is the total number of TADs. This list of indices is then randomized to produce a shuffled sequence I’. New TADs are generated by sequentially positioning them along the genomic region, starting from position 0. For each index j in I’, a new TAD is defined with a start position currentPos and an end position ({endPos}={currentPos}+{L}_{{I}_{j}^{{prime} }}), and currentPos is updated for the next TAD. The initial position of currentPos is defined as the first bin of the data. The randomized TADs comprise tuples (currentPos, endPos), maintaining the original lengths but with randomized genomic locations. This approach ensures that the randomized TAD set mimics the length distribution of the original ones, providing a basis for comparative analyses while varying their spatial positioning.
Markov clustering algorithm (MCL)
Let (G=(V,E)) be a graph where V is the set of vertices and E is the set of edges. Define A as the adjacency matrix of G, where Aij represents the weight of the edge between vertices i and j. If i and j are not directly connected, ({A}_{{ij}}=0). Firstly, transform A into a stochastic matrix M by normalizing each row to sum to one, so that it represents transition probabilities of a random walk on the graph:
$${M}_{{ij}}=frac{{A}_{{ij}}}{{sum }_{k=1}^{n}{A}_{{ik}}}$$
(1)
The MCL algorithm then iteratively applies two operations, expansion and inflation, to matrix M:
-
1.
Expansion (Matrix Squaring): M is replaced by M2, effectively simulating two steps of a random walk.
-
2.
Inflation: Each element of M is raised to the power γ (inflation parameter) and renormalized to maintain stochasticity. Inflation controls the granularity of clustering by reinforcing intra-cluster transitions and weakening inter-cluster transitions.
$${M}_{{ij}}=frac{{M}_{{ij}}^{gamma }}{{sum }_{k=1}^{n}{M}_{{ik}}^{gamma }}$$
(2)
After the inflation step, M is normalized to keep the matrix stochastic. These two steps are repeated until M converges, which means that multiple iterations do not significantly change M. The converged matrix M indicates the cluster structure of the graph, where dense blocks along the diagonal represent clusters.
Construct a TAD interaction network
For a set of TADs, denoted as (T=left{{t}_{1},{t}_{2},ldots ,{t}_{n}right}), where each ti represents a TAD on the genome. We build a weighted graph (G=(T,E,w)) to represent TAD-TAD interaction, where the weight (w(e)) is defined as follows:
$$wleft(eright)=left{begin{array}{c}{mean}(I({t}_{i},{t}_{j})),{if , Z} , > , 0.05\ qquad 0 qquadquad , {if , Z} , < , 0.05end{array}right.$$
(3)
where Z is the z-score, calculated as (frac{bar{{I}_{{{{rm{tads}}}}}}-mu }{sigma }), where (bar{{I}_{{tads}}}) is the mean interaction strength between TADs ti and tj, μ is the mean of the background distribution of interactions, and σ is the standard deviation of the background distribution.
Construct a muti-way reads similar network
Given a subset of multi-way reads (tilde{D}), we build a weighted graph (G=(tilde{D},E,w)) of bin-wise concatemer overlap, where the weight (w(e)in [0,1]) on each edge (e=left{{d}_{1},{d}_{2}right}in E) represents the Jaccard similarity of concatemer pair ({d}_{1},{d}_{2}in tilde{D}). A second weighted graph (hat{G}=(tilde{D},hat{E},hat{w})) is then built with weights (hat{w}(e)) that represent the number of (k=25) nearest neighbors (that is, most similar) that are shared by each concatemer pair (ein hat{E}).
Recommended parameters
Mactop has two key parameters: Inflation and Variance. Inflation determines the granularity of clustering, while Variance affects the consistency of the results. We calculated the Silhouette coefficient across various parameter settings for both 10 kb and 50 kb data. The results suggest that the optimal settings are an Inflation value of 1.4 for 10 kb resolution and 1.5 for 50 kb resolution, with a consistent Variance of 0.2 for both (Supplementary Fig. S2G). Additional parameter recommendations are provided in Supplementary Table S1.
Hi-C matrix normalization
We used the iterative correction and eigenvector decomposition method (ICE) for Hi-C data normalization. ICE normalizes Hi-C data by iteratively eliminating biases introduced by experimental procedures and intrinsic genomic properties, assuming equal visibility for all genomic loci. This process results in a corrected matrix of relative contact probabilities, allowing for more accurate and unbiased comparisons within and between datasets. The parameters for the ICE method are listed in Supplementary Table S1.
TAD-calling methods
We compared Mactop with three representative TAD-calling methods: the directionality index (DI) method, the insulation score (IS) method, and TopDom. For all methods, we used default or recommended parameters. The parameter for those methods are listed in Supplementary Table S1.
Directionality index (DI)
The DI method calculates the DI score for each genomic bin to determine its tendency for more frequent interactions with either upstream or downstream regions, thus indicating its directional bias. These scores are then analyzed using a hidden Markov model to ascertain the state of each bin (such as upstream-biased, downstream-biased, or unbiased). Ultimately, bins exhibiting significant directional bias are identified as TAD boundaries.
Insulation score (IS)
The IS method defines a sliding window to calculate an insulation score for each genomic bin on a chromatin contact map. The score is derived from the average number of DNA-DNA interactions within the window. Local minima in the vector of these insulation scores are considered indicators of TAD boundaries, indicating regions where chromatin’s internal interactions exceed interactions with external regions, thereby defining the boundaries of TADs.
TopDom
TopDom applies a fixed-size sliding window on the Hi-C contact matrix to detect interaction patterns between chromatin regions. TopDom identifies local maxima of interaction frequencies within each window as candidate TAD boundaries, followed by calculating scores for these boundaries to determine the actual TAD borders. The scoring system in TopDom is based on the variation in interaction frequency near the boundaries with higher scores indicating a higher likelihood of being TAD boundaries.
TAD evaluation criteria
We evaluate the TAD quality through diverse enrichment metrics alongside internal validation metrics typically employed.
Internal validation metrics
Based on the characteristics of TADs in the Hi-C interaction matrix, various metrics have been designed to assess the properties of chromatin segments. We have chosen the Insulation score, Directionality Index, Contrast index, and the Silhouette Coefficient, a metric for evaluating clustering results, to measure the properties of TAD boundaries.
Insulation score ratio
For a specific bin b, we calculate the insulation scores for defined regions upstream (left({I}_{u}right)) and downstream (left({I}_{d}right)) of it. The length of these regions is denoted as l. Insulation scores for each bin x within these regions are computed using the formula ({I}_{x}=frac{{sum }_{k=x-w}^{x+w}{sum }_{l=y-w}^{y+w}{M}_{k,l}}{{w}^{2}}), where Mk,l is the interaction value in the interaction matrix, and w is the window size. The slope S is then determined as (S=frac{{I}_{max }-{I}_{min }}{2l}), where ({I}_{max }) and ({I}_{min }) are the maximum and minimum insulation scores in the combined upstream and downstream regions. This slope quantifies the variation in insulation scores and serves as a metric to evaluate TAD boundary characteristics in the specified region.
Directionality index (DI)
DI is computed for a defined length ((l)) around the bin b using the formula: (D{I}_{i}=frac{{sum }_{j=i+1}^{i+l}{M}_{i,j}-{sum }_{j=i-l}^{i-1}{M}_{i,j}}{{sum }_{j=i+1}^{i+l}{M}_{i,j}+{sum }_{j=i-l}^{i-1}{M}_{i,j}}), where Mi,j represents the interaction frequency between bins in the Hi-C matrix. The slope of the DI is then determined by calculating the difference between the maximum and minimum DI values within this region, divided by the total length of the region ((2l)). This slope provides a quantitative measure of the change in interaction directionality and is particularly useful for identifying and characterizing TAD boundaries in the genomic region surrounding the selected bin.
Contrast index (CI)
CI is calculated by assessing interactions between a chosen number of upstream and downstream bins (n) around a specified bin b. The formula incorporates the sum of interactions U and D across these n bins upstream and downstream, respectively, and is defined as (U={sum }_{i=b-n}^{b-1}{sum }_{j=b-n}^{b-1}{M}_{i,j}) and (D={sum }_{i=b+1}^{b+n}{sum }_{j=b+1}^{b+n}{M}_{i,j}), where ({M}_{i,j}) denotes the interaction frequency in the Hi-C matrix. CI is then calculated as ({CI}=frac{U+D}{2times left({sum }_{i=b-n}^{b}{sum}_{j=b}^{b+n}{M}_{i,i}right)}), with the denominator summing the internal interactions within the region around b. This index is especially useful for quantitatively analyzing interaction strengths and patterns, aiding in identifying and understanding the dynamics of chromatin interactions around TAD boundaries.
Enrichment analysis
Enrichment of known architectural proteins
We calculate the enrichment of three known architectural proteins (CTCF, RAD21, and SMC3) in the TAD boundaries of five cell lines from Rao et al.12. TAD boundaries are defined by the starting bin and the ending bin of each predicted TAD, along with one preceding the starting bin and one following the ending bin. Let N be the total number of bins in a chromosome, nbind be the number of bins with one or more ChIP-seq peaks or accessible motif sites, ntad be the number of TAD boundary bins, and ({n}_{{tad}-{bind}}) be the number of TAD-boundary bins with a binding event (ChIP-seq peak or accessible motif match site). The fold enrichment for a particular protein is calculated as: (frac{{n}_{{tad}-{bind}}/{n}_{{tad}}}{{n}_{{bind}}/n}). Within each cell line, the fold enrichment across all chromosomes is averaged and then the mean across cell lines is used to rank the TAD-calling methods.
Histone modification enrichment
We use the proportion of predicted TADs that are significantly enriched in histone modification signals (compared to the “null” histone-modification signal distribution of randomly shuffled TADs) as a validation metric to assess the quality of TADs, similar to Roy et al.44. For each TAD, we calculate the mean histone modification ChIP-seq signal within the TAD. Next, we find the “null” histone-modification signal distribution from randomly shuffled TADs. To generate randomly shuffled TADs, we take the lengths of all predicted TADs within a chromosome, as well as the lengths of interspersed stretches between the TADs (i.e., “non-TAD” stretches) if a TAD-calling method skips over regions of the genome. Next, we randomly move around the TAD and non-TAD stretches within the chromosome to preserve the TAD length distribution. We repeat this procedure ten times. Then, we compute the mean histone modification ChIP-seq signal within these randomly shuffled TADs, generating the null or background distribution of histone modification signals. The empirical p-value of a predicted TAD’s histone modification signal is calculated as the proportion of randomly shuffled TADs with higher ChIP-seq signal than that of the given TAD. A TAD was significantly enriched if its empirical p-value was less than 0.05, i.e., more than 95% of randomly shuffled TADs have a lower histone modification signal. Finally, we get the proportion of predicted TADs with significant histone modification signals.
Histone modification enrichment across the two TAD categories
We explored whether the TADs demarcated by different types of boundaries showed enrichment of various histone modifications. Thus, we first divided the TADs into different clusters according to their boundary types. We call a TAD stable if it is surrounded by two stable boundaries, and we call a TAD dynamic if its two boundaries contain at least one dynamic boundary. Furthermore, using the method mentioned in the Histone Modification Enrichment section, we examined the number of histone modification enrichments in different types of TADs and divided this by the total number of TADs of that type to obtain the histone modification enrichment proportion for each type of TAD.
Downsampling of the Hi-C interaction matrix
The downsampling of Hi-C data aims at simulating lower sequencing depths. This process methodically decreases the original interaction frequencies, maintaining a specific percentage, denoted as p, of the data. This is achieved through the following:
Define sampling depth percentage
Set the percentage of data to be retained, denoted as p, which is a value between 0 and 100.
Original interaction matrix
The original Hi-C interaction matrix is represented as M, where each element Mi,j indicates the frequency between genomic loci i and j.
Detailed downsampling process
For each interaction element Mi,j, we calculate the expected number of interactions to retain: ({E}_{i,j}={M}_{i,j}times frac{p}{100}), and generate a random number for each interaction, denoted as Ri,j, uniformly distributed between 0 and 1. The downsampled interaction ({M}_{i,j}^{{prime} }) is determined using a thresholding method:
$$left{begin{array}{c}{{lceil }} {E}_{i,j} {{rceil }},quad \ {{{lfloor }} {E}_{i,j}}{{rfloor }} ,hfill end{array}begin{array}{c}{{{rm{if}}}},{R}_{i,j}, < ,{E}_{i,j}-{lfloor {E}_{i,j}rfloor }\ {{{rm{otherwise}}}}hfill end{array}right.$$
(4)
where ({{lceil }}cdot {{rceil }}) is the ceiling function and ({{lfloor }}cdot {{rfloor }}) is the floor function. The matrix M’ represents the downsampled Hi-C data, with each element ({M}_{i,j}^{{prime} }) reflecting the interaction frequency adjusted for the chosen sampling depth p.
Community conservation score
Let Ci represent the community structure in the i-th cell line, and let Bj represent a specific bin on the chromosome. Define the conservation score Si for bin Bj as:
$${S}_{j}=frac{1}{n}{sum }_{i=1}^{n}I({B}_{i}in {C}_{i})$$
(5)
where n is the total number of cell lines. (I({B}_{j}in {C}_{i})) is an indicator function, which equals 1 if bin Bj is part of a community in the i-th cell line (({B}_{j}in {C}_{i})), and 0 otherwise. For a specific community C within a cell line, define the Community Conservation Score SC as the average of the conservation scores of all bins within that community:
$${S}_{C}=frac{1}{left|Cright|}{sum}_{{B}_{j}in C}{S}_{j}$$
(6)
where (left|Cright|) is the number of bins in community C.
Statistics and reproducibility
All data were presented as boxplots, showing the median and interquartile range (IQR), based on at least three independent analyses. The normality of data distribution was assessed using the Shapiro-Wilk test for each dataset. For the comparisons between any two groups, an independent two-sample t-test was used, and if the normality assumption was violated, a Mann-Whitney U test was applied. Statistical significance was defined as p ≤ 0.05. All hypothesis tests were performed using the SciPy Python library.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.