Advancing computational evaluation of adsorption via porous materials by artificial intelligence and computational fluid dynamics

Local outlier factor (LOF)

LOF is a robust and effective algorithm for identifying outliers within a dataset. It operates under the assumption that outliers are data points that significantly deviate from their local neighborhood. By using LOF, one can uncover and subsequently remove outliers from a dataset, enhancing the overall data quality. The local density of a data point (:{x}_{i}) is defined by LOF in the following manner23:

$$:LOFleft({x}_{i}right)=:frac{{sum:}_{forall:{x}_{j}in:Nleft({x}_{i}right)}frac{text{density}left({x}_{i}right)}{text{density}left({x}_{j}right)}}{mid:Nleft({x}_{i}right)mid:}$$

Here, (:Nleft({x}_{i}right)) represents the neighborhood of data point (:{x}_{i}), and (:text{density}left({x}_{i}right)) denotes the density of (:{x}_{i}). The LOF for a data point quantifies how its density compares to the densities of its neighbors. A LOF value significantly greater than 1 suggests that the data point is an outlier24.

GPR (Gaussian process regression)

GPR stands as a resilient and adaptable non-parametric Bayesian technique employed in the realm of regression analysis. Unlike traditional parametric regression methods, GPR does not make explicit assumptions about the functional form of the underlying data distribution. Instead, it models the data as a distribution over functions, allowing for uncertainty quantification and robust predictions19.

The predictive distribution of GPR is derived through Bayesian inference. Given a set of observed data points (X, y), where X shows the input data and y stands for the corresponding output data, the goal is to make predictions for new input points (:{X}^{*}), yielding predictions (:{y}^{*}). The forecasted or estimated distribution for the variable (:{y}^{text{*}}) is expressed as follows25:

$$:pleft({y}^{*}|X,y,{X}^{*}right)=mathcal{N}left({{upmu:}}^{*},{{upsigma:}}^{*}right)$$

where (:{{upmu:}}^{*}) denotes the mean of the predictive distribution, and (:{{upsigma:}}^{*}) stands for its standard deviation. These quantities can be computed as follows25:

$$:{{upmu:}}^{*}={upmu:}left({X}^{*}right)+Kleft({X}^{*},Xright){left[Kleft(X,Xright)+{{upsigma:}}_{n}^{2}Iright]}^{-1}left(y-{upmu:}left(Xright)right)$$

$$:{{upsigma:}}^{*}=Kleft({X}^{*},{X}^{*}right)-Kleft({X}^{*},Xright){left[Kleft(X,Xright)+{{upsigma:}}_{n}^{2}Iright]}^{-1}Kleft(X,{X}^{*}right)$$

In the equations above, K(X, X) indicates the covariance matrix associated with the training inputs, K(X*, X) denotes the covariance between the test and training inputs, (:{sigma:}_{n}^{2}) represents the variance of noise, and I stands as the identity matrix.

MLP regression (Multi-layer perceptron regression)

MLP Regression is a variant of artificial neural networks characterized by its multi-layered architecture, where nodes (neurons) are interconnected across these layers. It is a versatile and powerful regression technique capable of modeling complex, nonlinear relationships between inputs and outputs26.

The key equation for MLP Regression includes the forward propagation equation for a single neuron26:

$$:{z}_{j}={sum}_{i=1}^{n}{w}_{ij}{x}_{i}+{b}_{j}$$

$$:{a}_{j}={upsigma:}left({z}_{j}right)$$

In this context, (:{z}_{j}) signifies the weighted summation of inputs corresponding to neuron j, where (:{w}_{ij}) represents the weight associated with the connection linking neuron i to neuron j.

PR (Polynomial regression)

PR is commonly employed in statistics and ML to model relationships between variables when a polynomial relationship is suspected. Unlike the assumption of linearity in Linear Regression, Polynomial Regression allows for the modeling of more complex, nonlinear relationships27. In PR, the correlation between the dependent variable (typically represented as y) and the independent variable (typically represented as x) is expressed as a polynomial function of a chosen degree, often denoted as n. The general form of a polynomial regression equation is as follows21:

$$:y={{upbeta:}}_{0}+{{upbeta:}}_{1}x+{{upbeta:}}_{2}{x}^{2}+dots:+{{upbeta:}}_{n}{x}^{n}+epsilon:$$

In this context, y denotes the dependent variable, which serves as the target variable we seek to predict or elucidate, while x signifies the independent variable or predictor, representing the variable upon which y relies. Also, (:{beta:}_{0}), (:{beta:}_{1}), and so on are coefficients that need to be estimated from the data.

Continue Reading