Analysing community-level spending behaviour contributing to high carbon emissions using stochastic block models

We obtain financial transaction data from ekko20, a sustainable banking FinTech company, alongside several government datasets. These include the Index of Multiple Deprivation (IMD)21 , Living Cost and Food Survey (LCFS)22, and UK Multi-Region Input-Output (UKMRIO) data23. These datasets provide socio-economic context and environmental impact metrics for the customers of the banking initiative. In this section, we describe the dataset and also outline the network construction approach and community detection methodology, aimed at uncovering patterns in consumer behaviour and carbon emissions associated with everyday spending. The code used to apply the methodology for this research is freely available 24.

Financial transaction dataset

We obtained a debit card transaction dataset from ekko20, a sustainable banking FinTech we partnered with, containing tens of thousands of transactions spanning from 2021 to 2023 from 1,362 customers based in the UK. A distinctive feature of these customers is their higher level of environmental consciousness compared to the general population. This is demonstrated by their engagement with ekko, which focuses on promoting environmentally friendly practices and rewards customers for their transaction activity. These rewards are specifically designed to help customers consider the carbon footprint of their everyday transactions. The FinTech’s mobile application allows customers to view in real-time the environmental impact of their spending, along with personalized insights and streamlined features that make it easier to adopt greener choices in their daily lives.

The financial transaction dataset includes key customer metrics such as customer ID, postcode, age, and transaction details including transaction time, Merchant Category Code (MCC), and amount spent. We summarize the age and IMD distributions of the customers and discuss how they compare to national levels in Appendix 1.

The spending categories of customers in our analysis are defined by the Merchant Category Codes (MCCs) assigned by the debit card provider, in this case, Mastercard. Each transaction is linked to a merchant category, described by spending type. An extensive list of merchant categories and the transactions that fall within each is available in Mastercard’s reference documentation25.

However, it is important to note that some spending categories are incompletely represented in the dataset due to the nature of the available transaction data. In particular, utility payments are frequently made via direct debit, which is not included in this dataset. As a result, spending in the MCC 4900 category (“Utilities—Electric, Gas, Heating Oil, Sanitary, Water”) is less than expected. Other services commonly paid via direct debit, such as subscriptions (MCC 4899—“Cable, Satellite, and Other Pay Television and Radio Services”) and rent payments (MCC 6513—“Real Estate Agents and Managers—Rentals”), are also underrepresented in this data. This limitation should be kept in mind when interpreting the absence of energy-related or subscription-based spending and emissions in the figures and cluster analysis.

For the Stochastic Block Modelling (SBM) analysis, we focus on customers who have made a minimum of 30 transactions across 10 distinct categories to ensure we capture consistent and diverse everyday spending patterns. This threshold is set to include only those customers with a sufficiently broad spending behaviour, preventing the inclusion of infrequent or niche spenders that could skew the analysis. With this criterion, we retain a sample of 272 customers, ensuring statistical significance. In Fig. 1 we present the percentage of spending behaviour across MCCs of this sample. To further validate our approach, we apply the model to larger, artificially generated transaction datasets in subsection Applying the model on larger dataset. The results show that although the set of customers may vary with different minimum thresholds for transactions and MCCs, the use of stochastic block modelling still yields clusters with consistent underlying patterns (see subsection Applying the model on larger dataset for more details). The choice of thresholds does not fundamentally alter the overall structure uncovered by the analysis, but it does change the specific composition of the clusters based on the number of customers we allow for.

Fig. 1

Percentage of transactions in the top 10 most frequent Merchant Category Codes (MCCs) across the entire sample (N = 272 customers).

Government datasets: IMD, LCFS, and UKMRIO

This study relies on several openly available government datasets to contextualise spending patterns and their associated carbon emissions within broader socio-economic and environmental contexts. The datasets used are the Index of Multiple Deprivation (IMD), the Living Cost and Food Survey (LCFS), and the UK Multi-Region Input-Output (UKMRIO) data.

The Index of Multiple Deprivation (IMD) provides a detailed measure of deprivation at a small area level across England, covering indicators such as income, employment, education, health, crime, housing, and the living environment. We use the English indices of deprivation 2019 dataset, published by the UK Ministry of Housing, Communities & Local Government21, to understand the socio-economic profile of customers and analyse correlations between deprivation levels and spending behaviour.

The Living Cost and Food Survey (LCFS), collected annually by the Office for National Statistics (ONS), provides detailed data on household expenditure, income, and demographics. The UK Multi-Region Input-Output (UKMRIO) dataset offers a model of the flow of goods and services across UK regions, detailing inter-industry interactions, consumption, and environmental impacts. Together, these datasets allow us to estimate the carbon emissions linked to different categories of expenditure6.

Data integration and fusion

To integrate the financial transaction data with these government datasets, we follow a structured process. First, carbon emissions for each transaction are estimated using the third approach described in Trendl et al.6, which derives emissions from financial transaction data combined with MRIO-based carbon multipliers. Specifically, we apply carbon intensity multipliers developed by Trendl et al.6, based on the LCFS and UKMRIO datasets. These multipliers are linked to COICOP categories, which are then mapped to MCCs in the transaction data using established mappings2,26,27,28. This approach allows us to estimate emissions for each transaction based on its MCC and spending amount, with adjustments for inflation using Consumer Price Index (CPI) values.

Second, the IMD is linked to the transaction data using customer postcodes, matching each postcode to a Lower Layer Super Output Area (LSOA) in the IMD dataset. This provides deprivation statistics for each customer, enabling an analysis of socio-economic influences on spending behaviour and carbon emissions. We managed to map every customer to the respective level of deprivation of their environment.

By combining these datasets, we ensure a comprehensive analysis of customer spending patterns, socio-economic contexts, and associated carbon emissions.

Bipartite network creation

Bipartite networks are used to represent systems consisting of two distinct types of nodes. In simple bipartite networks, connections only form between nodes of different types, which makes them ideal for modelling relationships in complex datasets. For example, in ecology, bipartite networks can illustrate interactions between species and their environments, helping researchers understand ecosystem dynamics29,30. Similarly, in recommendation systems, these networks can connect users with items, allowing for tailored recommendations based on user preferences31. In economics, bipartite networks can represent relationships between different economic agents, such as consumers and products, facilitating insights into market behaviour32.

By highlighting these applications, we can see that bipartite networks are not just theoretical constructs; they are a practical tool that helps us analyse and understand complex interactions between distinct groups across various fields. They can be used for data mining, pattern recognition, and identifying relationships within complex systems, making them useful in both research and applied contexts. In the context of this research, their role extends to identifying consumer spending patterns and estimating carbon emissions, thereby providing valuable insights for policymakers and financial institutions seeking to implement effective carbon reduction strategies. In this research context, these networks allow customers to be linked with the transaction categories they engage with.

Fig. 2
figure 2

An example of a transaction bipartite network, showing the connection between customers (top) and MCCs (bottom). Each edge represents that the customer has had at least one transaction in that merchant category.

To understand customer spending behaviour, we construct a bipartite network with customers as one type of node and MCCs as the other. An edge is created between a customer and an MCC if the customer has made a transaction in that category. See Fig. 2 for an illustration of this bipartite network structure. This approach allows us to present all the data from a large transaction dataset in a single network, connecting customers and categories based on transaction patterns. We can then apply network analysis techniques to analyse the dataset, which provides additional insights compared to classical statistics and machine learning approaches.

This network approach also accommodates classification systems different from the MCC used by financial institutions, allowing for the exploration of connections and the identification of consumer communities based on their spending patterns. By incorporating various classifications, such as COICOP used by the UN and multiple policymaking institutions, this approach provides broad analysis of consumer behaviour across different contexts.

Additionally, we define two alternative edge-weighting schemes for the bipartite network to better capture different characteristics of the data. These weight assignments provide different ways to quantify the strength of the relationship between customers and merchant categories (MCCs) based on their transaction behaviour.

First, we use the number of transactions between a customer and an MCC as the edge weight. This means that for each customer-category connection, the weight is simply the count of transactions made by that specific customer in that category. This approach ensures that the network structure reflects not only the set of categories a customer transacts in but also the frequency of transactions within each category. By incorporating transaction counts directly as weights, we preserve the raw behavioural patterns without introducing external assumptions.

Second, we define an alternative weighting based on relative spending per category. Here, the edge weight represents the total amount a customer has spent in a given category, normalised by the average spending of all customers in that category. This normalisation ensures that spending behaviour is interpreted in the context of overall category trends, allowing us to identify customers who spend significantly more or less than average in a given category.

These weighting schemes do not introduce arbitrary modifications but rather directly derive from the transaction data itself–either as transaction counts or spending amounts–ensuring a transparent and interpretable network representation. By applying network analysis to these weighted structures, we uncover different aspects of consumer spending behaviour. The results of these analyses and their implications are explored in the Discussion section.

Stochastic block modelling

A Stochastic Block Model (SBM) is a probabilistic model used to analyse the structure of networks by dividing nodes into distinct groups or “blocks” based on connection patterns. Its probabilistic nature makes it well-suited for community detection, allowing us to identify connection patterns within and between groups. Introduced in the 1980s by33, the canonical SBM views networks as structures composed of blocks of nodes. Connections between nodes are determined by their block memberships and predefined network parameters.

There have been many recent advancements in SBMs expanding their applicability to various networks across different scientific fields. Modifications to the canonical SBM structure include approaches that account for weights in networks34 and hierarchical communities35. Studies have also developed degree-corrected variations36 and overlapping communities37.

These models have been effectively applied to study community formation in various fields, such as in US Senate political cohesion and co-voting networks38, connections in healthy human gut microbiomes39, and relationships in ecology and ethnobiology40.

SBMs are particularly useful for analysing large-scale networks encountered in real-world applications, such as large financial transaction datasets. While our study focuses on UK consumption data, this approach is not limited to a single region. SBMs can be applied to transaction datasets from other countries, enabling cross-country comparisons of consumer behaviour and spending patterns by identifying similarities and differences in community structures across regions. Their high resolution limits allow for the identification of numerous communities with specific characteristics35,41. The probabilistic basis of SBMs ensures reliable community detection results based on observed data, making them an efficient tool for revealing insights into community organization in complex networks.

In contrast to methods like k-means or modularity maximisation, which may struggle with high-dimensional, sparse, or complex network data, SBMs are better suited for detecting meaningful community structures. SBMs identify statistically significant assortative modules by modelling the full probabilistic structure of the network, enabling them to avoid common pitfalls like resolution limits and overfitting35,42. Moreover, SBM’s hierarchical and nonparametric properties allow the detection of multi-scale community structures and fine-grained behavioural patterns without prior assumptions about the number or shape of the clusters19. This makes them particularly appropriate for analysing behavioural networks derived from financial transactions.

In this study, we apply a degree-corrected nonparametric hierarchical SBM on the bipartite network we previously described to identify communities of customers based on spending categories. We implement the model using Python and the graph-tool package. For reproducibility, the full code is openly available24. This specific SBM method was initially developed for topic modelling through word clusters in documents43,44. We apply this method to financial transaction datasets so that we can find communities of customers with similar spending behaviour in bipartite networks of customers and categories. Additionally, we introduce modifications to the code, so that we can use any arbitrary vector of weights between the customer and category nodes instead of solely relying on the number of repeated transactions as the weight.

However, due to its probabilistic nature, running the SBM algorithm only once does not guarantee finding the optimal partition consistently19. To address this variability, we run the algorithm 100 times, as this is typically sufficient for the entropy in the system to stabilise—meaning that the uncertainty in the community assignments reaches a consistent level. Stabilisation of entropy is desirable because it indicates that the algorithm has converged to a reliable partitioning of the network. However, we can run additional iterations if necessary. We then plot the change in entropy and select the iteration with the highest posterior probability. We choose the clusters generated by the SBM which are optimal because they achieve the lowest possible entropy, indicating a well-defined community structure.

Continue Reading