Main articles in data science

Submitted in Biostatistics : Ribaud,M., Labbe, A., Oualkacha, K. Matrix completion in genetic methylation studies: LMCC, a Linear Model of Coregionalization with informative Covariates (2023).

DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density Whole Genome Bisulfite Sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows - which is usually the case in methylation data analysis. Finally, we apply and compare  our proposed method with alternative approaches  on 15 methylation samples measured at 10^6 sites and  obtained by whole genome bisulfite sequencing across different cell and tissue types.

Link to associated ZIprop R package

Classical supervised methods like linear regression and decision trees are not completely adapted for identifying impacting factors on a response variable corresponding to zero-inflated proportion data (ZIPD) that are dependent, continuous and bounded. In this article we propose a within-block permutation-based methodology to identify factors (discrete or continuous) that are significantly correlated with ZIPD, we propose a performance indicator quantifying the percentage of correlation explained by the subset of significant factors, and we show how to predict the ranks of the response variables conditionally on the observation of these factors. The methodology is illustrated on simulated data and on two real data sets dealing with epidemiology. In the first data set, ZIPD correspond to probabilities of transmission of Influenza between horses. In the second data set, ZIPD correspond to probabilities that geographic entities (eg, states and countries) have the same COVID-19 mortality dynamics.

In the robust shape optimization context, the evaluation cost of numerical models is reduced by the use of a response surface. Multi-objective methodologies for robust optimization that consist in simultaneously minimizing the expectation and variance of a function have already been developed to answer to this question. However, efficient estimation in the framework of time-consuming simulation has not been completely explored. In this paper, a robust optimization procedure based on Taylor expansion, kriging prediction and a genetic NSGA-II algorithm is proposed. The two objectives are the Taylor expansion of expectation and variance. The kriging technique is chosen to surrogate the function and its derivatives. Afterwards, NSGA-II is performed on kriging response surfaces or kriging expected improvements to construct a Pareto front. One point or a batch of points is chosen carefully to enrich the learning set of the model. When the budget is reached the non-dominated points provide designs that make compromises between optimization and robustness. Seven relevant strategies based on this main procedure are detailed and compared in two test functions (2D and 6D). In each case, the results are compared when the derivatives are observed and when they are not. The procedure is also applied to an industrial case study where the objective is to optimize the shape of a motor fan. 

In the context of computer experiments, metamodels are largely used to represent the output of computer codes. Among these models, Gaussian process regression (kriging) is very efficient see e.g Snelson (Flexible and efficient Gaussian process models for machine learning. ProQuest LLC, Ann Arbor, MI. Thesis (Ph.D.)–University of London, University College London, London, 2008). In high dimension that is with a large number of input variables, but with few observations, the estimation of the parameters with a classical anisotropic kriging can be completely inaccurate. Because there are equal numbers of ranges and input variables the optimization space becomes too large compared to available information. One way to overcome this drawback is to use an isotropic kernel that only depends on one parameter. However this model is too restrictive. The aim of this paper is twofold. Our first objective is to propose a smooth kernel with as few parameters as warranted. We introduce a kernel which is a tensor product of few isotropic kernels built on well-chosen subgroup of variables. The main difficulty is to find the number and the composition of the groups. Our second objective is to propose algorithmic strategies to overcome this difficulty. Four forward strategies are proposed. They all start with the simplest isotropic kernel and stop when the best model according to BIC criterion is found. They all show very good accuracy results on simulation test cases. But one of them is more efficient. Tested on a real data set, our kernel shows very good prediction results.

In the context of robust shape optimization, the estimation cost of some physical models is reduce with the use of a response surface. A procedure that requires the estimation of moment 1 and 2 is set up for the robust optimization. The step of the optimization procedure and the partitioning of Pareto front are already developed in the literature. However, the research of a criteria to estimate the robustness of each solution at each iteration is not much explored. The function, the first and second derivatives is given by the majority of industrial code. We propose a robust optimization procedure that based on the prediction of the function and its derivatives predicted by a kriging with a Matern 5/2 covariance kernel. The modeling of the second derivative and consequently the prediction of first and the second derivatives are possible with this kernel. In this context we propose to consider the Taylor theorem calculated in each point of the conception space to approximate the variation around these points. This criterion is used as the replacement of the moment 2 usually employed. A Pareto front of the robust solutions (minimization of the function and the robustness criteria) is generated by a genetic algorithm named NSGA-II. This algorithm gives a Pareto front in an reasonable time of calculation. We show the motivations of this method with an academic example.

Collaboration in data science

Link to associated Shiny App 

Discrepancies in population structures, decision making, health systems and numerous other factors result in various COVID-19-mortality dynamics at country scale, and make the forecast of deaths in a country under focus challenging. However, mortality dynamics of countries that are ahead of time implicitly include these factors and can be used as real-life competing predicting models. We precisely propose such a data-driven approach implemented in a publicly available web app timely providing mortality curves comparisons and real-time short-term forecasts for about 100 countries. Here, the approach is applied to compare the mortality trajectories of second-line and front-line European countries facing the COVID-19 epidemic wave. Using data up to mid-April, we show that the second-line countries generally followed relatively mild mortality curves rather than fast and severe ones. Thus, the continuation, after mid-April, of the COVID-19 wave across Europe was likely to be mitigated and not as strong as it was in most of the front-line countries first impacted by the wave (this prediction is corroborated by posterior data).

Collaboration in other fields


To determine if commercially available mouthwash with β-cyclodextrin and citrox (bioflavonoids) (CDCM) could decrease the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) salivary viral load.


In this randomized controlled trial, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) PCR-positive patients aged 18–85 years with asymptomatic to mild coronavirus disease 2019 (COVID-19) symptoms for <8 days were recruited. A total of 176 eligible patients were randomly assigned (1:1) to CDCM or placebo. Three rinses daily were performed for 7 days. Saliva sampling was performed on day 1 at 09.00 (T1), 13.00 (T2) and 18.00 (T3). On the following 6 days, one sample was taken at 15.00. Quantitative RT-PCR was used to detect SARS-CoV-2.


The intention-to-treat analysis demonstrated that, over the course of 1 day, CDCM was significantly more effective than placebo 4 hours after the first dose (p 0.036), with a median percentage (log10 copies/mL) decrease T1–T2 of –12.58% (IQR –29.55% to –0.16%). The second dose maintained the low median value for the CDCM (3.08 log10 copies/mL; IQR 0–4.19), compared with placebo (3.31 log10 copies/mL; IQR 1.18–4.75). At day 7, there was still a greater median percentage (log10 copies/mL) decrease in salivary viral load over time in the CDCM group (–58.62%; IQR –100% to –34.36%) compared with the placebo group (–50.62%; IQR –100% to –27.66%). These results were confirmed by the per-protocol analysis.


This trial supports the relevance of using CDCM on day 1 (4 hours after the initial dose) to reduce the SARS-CoV-2 viral load in saliva. For long-term effect (7 days), CDMC appears to provide a modest benefit compared with placebo in reducing viral load in saliva.

Felidae species show a great diversity in their diet, foraging and hunting strategies, from small to large prey. Whether they belong to solitary or group hunters, the behavior of cats to subdue resisting small or large prey presents crucial differences. It is assumed that pack hunting reduces the per capita risk of each individual. We hypothesize that the sacroiliac articulation plays a key role in stabilizing the predator while subduing and killing prey. Using CT-scan from 59 felid coxal bones, we calculated the angle between both iliac articular surfaces. Correlation of this inter-iliac angle with body size was calculated and ecological stressors were evaluated on inter-iliac angle. Body size significantly influences inter-iliac angle with small cats having a wider angle than big cats. Arboreal species have a significantly larger angle compared to cursorial felids with the smallest value, and to scansorial and terrestrial species with intermediate angles. Felids hunting large prey have a smaller angle than felids hunting small and mixed prey. Within the Panthera lineage, pack hunters (lions) have a larger angle than all other species using solitary hunting strategy. According to the inter-iliac angle, two main groups of felids are determined: (i) predators with an angle of around 40° include small cats (i.e., Felis silvestris, Leopardus wiedii, Leptailurus serval, Lynx Canadensis, L. rufus; median = 43.45°), the only pack-hunting species (i.e., Panthera leo; median = 37.90°), and arboreal cats (i.e., L. wiedii, Neofelis nebulosa; median = 49.05°), (ii) predators with an angle of around 30° include solitary-hunting big cats (i.e., Acinonyx jubatus, P. onca, P. pardus, P. tigris, P. uncia; median = 31.80°). We suggest different pressures of selection to interpret these results. The tightening of the iliac wings around the sacrum probably enhances big cats’ ability for high speed and large prey control. In contrast, pack hunting in lions reduced the selective pressure for large prey.

Felids show remarkable phenotypic similarities and are conservative in behavioral and ecological traits. In contrast, they display a large range in body mass from around 1 kg to more than 300 kg. Body size and locomotory specializations correlate to skull, limb and vertebral skeleton morphology. With an increase in body mass, felids prey selection switches from small to large, from using a rapid skull or spine lethal bite for small prey, to sustained suffocating bite for large prey. Dietary specialization correlates to skull and front limbs morphology but no correlation was found on the spine or on the hind limb. The morphology of the sacroiliac junction in relation to ecological factors remained to be described. We are presenting a study of the overall shape of the iliac auricular surface with qualitative and quantitative analyses of its morphology. Our results demonstrate that body mass, prey selection, and bite type, crucially influence the auricular surface, where no significant effect of locomotor specialization was found. The outline of the surface is significantly more elevated dorso-caudally and the joint surface shows an irregular W-shape topography in big cats whereas the surface in small cats is smoother with a C-shape topography and less of an elevated ridge. Biomechanically, we suggest that a complex auricular surface increases joint stiffness and provides more support in heavier cats, an advantage for subduing big prey successfully during a sustained bite.