Statistical issues in general (mixed effects LR)
October 8, 2025 Mixed effects logistic regression models require individual level data. Sometimes data needs to be on the population level and it can be aggregated into counts for example but this could also lead to loss of other information. Federated learning has emerged as a better solution to fit models using data from multiple hospitals without requiring sending individual level data. For generalized linear mixed models (GLMM), including logistic regression with random intercept, federated learning algorithms have been developed. These algorithms rely on strategies like distributed versions of the penalized quasi-likelihood method. Another approach which can further reduce need for iterative communication with data providers is a noniterative approach which accommodates both categorical and continuous covariates which is based on artificial data generation using a Gaussian copula. The authors have proposed their own noniterative method based on previous ideas by generated pseudo-data to match summary statistics of the original data and then allowing for estimation of logistic mixed models without requiring iterative communication. Their approach can also accommodate multiple covariates and it uses GaussHermite quadrature (implemented in the R software) to estimate the likelihood of a GLMM which they feel is more accurate than using a penalized quasi-likelihood method. A requirement of their strategy is that the data providers (i.e. hospitals) supply the data analyst with privacy-preserving summary statistics. Also it is assumed these were generated with no missing data. They refer to summary statistics that could be exact sufficient statistics or polynomial-approximate sufficient statistics. For polynomial approximation one can use the Taylor expansion and then can express the log-likelihood as a K-degree Taylor polynomial plus an error term. These polynomial-approximate sufficient statistics are the summary statistics expected from each data provider to proceed with model estimation and not having to disclosure individual level data. For a binary variable, one can follow the same formulas they showed by compute central moments by the Taylor polynomial expansion while with categorical predictors with more than two levels, one must convert them into dummy variables before computing central moments. They then proposed generating pseudo-data from the polynomial-approximate sufficient statistics and then using these in place of actual data. They did not aim to reconstruct the actual data as they want to protect privacy but to still in a way mimic the actual data without giving