Rémi Flamary

Professional website



I am currently a Monge Assistant Professor at the Applied Mathematics department and CMAP Laboratory from École Polytechnique. I am on leave from Université Côte d'Azur in the Department of Electronics and in the Lagrange Laboratory that is part of the Observatoire de la Côte d'Azur. I was a PhD student and teaching assistant at the LITIS Laboratory and my PhD advisor was Alain Rakotomamonjy at Rouen University.

On this website, you can find a list of my publications and download the corresponding software/code. Some of my french teaching material is also available.

Research Interests

  • Machine learning and statistical signal processing
    • Classification, supervised learning
    • Kernel methods, Support Vector Machines
    • Optimization with sparsity, variable selection, mixed norms, non convex regularization
    • Feature learning, data representation, kernel learning
    • Convolutional neural networks, filter learning, image reconstruction
    • Optimal transport, domain adaptation
  • Applications
    • Biomedical engineering, Brain-Computer Interfaces
    • Remote sensing and hyperspectral Imaging
    • Energy and climate
    • Astronomical image processing

Wordcloud of my research interests.

Recent work

I. Redko, T. Vayer, R. Flamary, N. Courty, CO-Optimal Transport, Neural Information Processing Systems (NeurIPS), 2020.
Abstract: Optimal transport (OT) is a powerful geometric and probabilistic tool for finding correspondences and measuring similarity between two distributions. Yet, its original formulation relies on the existence of a cost function between the samples of the two distributions, which makes it impractical for comparing data distributions supported on different topological spaces. To circumvent this limitation, we propose a novel OT problem, named COOT for CO-Optimal Transport, that aims to simultaneously optimize two transport maps between both samples and features. This is different from other approaches that either discard the individual features by focussing on pairwise distances (e.g. Gromov-Wasserstein) or need to model explicitly the relations between the features. COOT leads to interpretable correspondences between both samples and feature representations and holds metric properties. We provide a thorough theoretical analysis of our framework and establish rich connections with the Gromov-Wasserstein distance. We demonstrate its versatility with two machine learning applications in heterogeneous domain adaptation and co-clustering/data summarization, where COOT leads to performance improvements over the competing state-of-the-art methods.
author = {Ivegen Redko and Titouan Vayer and Rémi Flamary and Nicolas Courty},
title = {CO-Optimal Transport},
booktitle = { Neural Information Processing Systems (NeurIPS)},
year = {2020}
D. Marcos, R. Fong, S. Lobry, R. Flamary, N. Courty, D. Tuia, Contextual Semantic Interpretability, Asian Conference on Computer Vision (ACCV), 2020.
Abstract: Convolutional neural networks (CNN) are known to learn an image representation that captures concepts relevant to the task, but do so in an implicit way that hampers model interpretability. However, one could argue that such a representation is hidden in the neurons and can be made explicit by teaching the model to recognize semantically interpretable attributes that are present in the scene. We call such an intermediate layer a \emphsemantic bottleneck. Once the attributes are learned, they can be re-combined to reach the final decision and provide both an accurate prediction and an explicit reasoning behind the CNN decision. In this paper, we look into semantic bottlenecks that capture context: we want attributes to be in groups of a few meaningful elements and participate jointly to the final decision. We use a two-layer semantic bottleneck that gathers attributes into interpretable, sparse groups, allowing them contribute differently to the final output depending on the context. We test our contextual semantic interpretable bottleneck (CSIB) on the task of landscape scenicness estimation and train the semantic interpretable bottleneck using an auxiliary database (SUN Attributes). Our model yields in predictions as accurate as a non-interpretable baseline when applied to a real-world test set of Flickr images, all while providing clear and interpretable explanations for each prediction.
author = {Diego Marcos and Ruth Fong and Sylvain Lobry and Remi Flamary and Nicolas Courty and Devis Tuia},
title = {Contextual Semantic Interpretability},
booktitle = { Asian Conference on Computer Vision (ACCV)},
year = {2020}
K. Fatras, Y. Zine, R. Flamary, R. Gribonval, N. Courty, Learning with minibatch Wasserstein : asymptotic and gradient properties, International Conference on Artificial Intelligence and Statistics (AISTAT), 2020.
Abstract: Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches \em i.e. they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, GANs or color transfer that highlight the practical interest of this strategy.
author = {Kilian Fatras and Younes Zine and Rémi Flamary and Rémi Gribonval and Nicolas Courty},
title = {Learning with minibatch Wasserstein : asymptotic and gradient properties},
booktitle = { International Conference on Artificial Intelligence and Statistics (AISTAT)},
year = {2020}
T. Vayer, R. Flamary, R. Tavenard, L. Chapel, N. Courty, Sliced Gromov-Wasserstein, Neural Information Processing Systems (NeurIPS), 2019.
Abstract: Recently used in various machine learning contexts, the Gromov-Wasserstein distance (GW) allows for comparing distributions that do not necessarily lie in the same metric space. However, this Optimal Transport (OT) distance requires solving a complex non convex quadratic program which is most of the time very costly both in time and memory. Contrary to GW, the Wasserstein distance (W) enjoys several properties (e.g. duality) that permit large scale optimization. Among those, the Sliced Wasserstein (SW) distance exploits the direct solution of W on the line, that only requires sorting discrete samples in 1D. This paper propose a new divergence based on GW akin to SW. We first derive a closed form for GW when dealing with 1D distributions, based on a new result for the related quadratic assignment problem. We then define a novel OT discrepancy that can deal with large scale distributions via a slicing approach and we show how it relates to the GW distance while being O(n^2) to compute. We illustrate the behavior of this so called Sliced Gromov-Wasserstein (SGW) discrepancy in experiments where we demonstrate its ability to tackle similar problems as GW while being several order of magnitudes faster to compute
author = { Vayer, Titouan and Flamary, Rémi and Tavenard, Romain and Chapel, Laetitia  and  Courty, Nicolas},
title = {Sliced Gromov-Wasserstein},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2019}
T. Vayer, L. Chapel, R. Flamary, R. Tavenard, N. Courty, Optimal Transport for structured data with application on graphs, International Conference on Machine Learning (ICML), 2019.
Abstract: This work considers the problem of computing distances between structured objects such as undirected graphs, seen as probability distributions in a specific metric space. We consider a new transportation distance (i.e. that minimizes a total cost of transporting probability masses) that unveils the geometric nature of the structured objects space. Unlike Wasserstein or Gromov-Wasserstein metrics that focus solely and respectively on features (by considering a metric in the feature space) or structure (by seeing structure as a metric space), our new distance exploits jointly both information, and is consequently called Fused Gromov-Wasserstein (FGW). After discussing its properties and computational aspects, we show results on a graph classification task, where our method outperforms both graph kernels and deep graph convolutional networks. Exploiting further on the metric properties of FGW, interesting geometric objects such as Fréchet means or barycenters of graphs are illustrated and discussed in a clustering context.
author = { Vayer, Titouan and Chapel, Laetitia and Flamary, Rémi and Tavenard, Romain and  Courty, Nicolas},
title = {Optimal Transport for structured data with application on graphs},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2019}
I. Redko, N. Courty, R. Flamary, D. Tuia, Optimal Transport for Multi-source Domain Adaptation under Target Shift, International Conference on Artificial Intelligence and Statistics (AISTAT), 2019.
Abstract: In this paper, we propose to tackle the problem of reducing discrepancies between multiple domains referred to as multi-source domain adaptation and consider it under the target shift assumption: in all domains we aim to solve a classification problem with the same output classes, but with labels' proportions differing across them. We design a method based on optimal transport, a theory that is gaining momentum to tackle adaptation problems in machine learning due to its efficiency in aligning probability distributions. Our method performs multi-source adaptation and target shift correction simultaneously by learning the class probabilities of the unlabeled target sample and the coupling allowing to align two (or more) probability distributions. Experiments on both synthetic and real-world data related to satellite image segmentation task show the superiority of the proposed method over the state-of-the-art.
author = { Redko, I. and Courty, N. and Flamary, R. and Tuia, D.},
title = {Optimal Transport for Multi-source Domain Adaptation under Target Shift},
booktitle = { International Conference on Artificial Intelligence and Statistics (AISTAT)},
year = {2019}
B. B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, N. Courty, DeepJDOT: Deep Joint distribution optimal transport for unsupervised domain adaptation, European Conference in Computer Visions (ECCV), 2018.
Abstract: In computer vision, one is often confronted with problems of domain shifts, which occur when one applies a classifier trained on a source dataset to target data sharing similar characteristics (e.g. same classes), but also different latent data structures (e.g. different acquisition conditions). In such a situation, the model will perform poorly on the new data, since the classifier is specialized to recognize visual cues specific to the source domain. In this work we explore a solution, named DeepJDOT, to tackle this problem: through a measure of discrepancy on joint deep representations/labels based on optimal transport, we not only learn new data representations aligned between the source and target domain, but also simultaneously preserve the discriminative information used by the classifier. We applied DeepJDOT to a series of visual recognition tasks, where it compares favorably against state-of-the-art deep domain adaptation methods.
author = { Damodaran, Bharath B. and Kellenberger, Benjamin and Flamary, Rémi and Tuia, Devis and Courty, Nicolas},
title = {DeepJDOT: Deep Joint distribution optimal transport for unsupervised domain adaptation},
booktitle = {European Conference in Computer Visions (ECCV)},
year = {2018}
R. Flamary, M. Cuturi, N. Courty, A. Rakotomamonjy, Wasserstein Discriminant Analysis, Machine learning , Vol. 107, pp 1923-1945, 2018.
Abstract: Wasserstein Discriminant Analysis (WDA) is a new supervised method that can improve classification of high-dimensional data by computing a suitable linear map onto a lower dimensional subspace. Following the blueprint of classical Linear Discriminant Analysis (LDA), WDA selects the projection matrix that maximizes the ratio of two quantities: the dispersion of projected points coming from different classes, divided by the dispersion of projected points coming from the same class. To quantify dispersion, WDA uses regularized Wasserstein distances, rather than cross-variance measures which have been usually considered, notably in LDA. Thanks to the the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples scale) interactions between classes. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; We show that the optimization of WDA can be tackled using automatic differentiation of Sinkhorn iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real life datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.
author = {Flamary, Remi and Cuturi, Marco and Courty, Nicolas and Rakotomamonjy, Alain},
title = {Wasserstein Discriminant Analysis},
journal = { Machine learning },
volume = {107},
pages = {1923-1945},
year = {2018}
N. Courty, R. Flamary, M. Ducoffe, Learning Wasserstein Embeddings, International Conference on Learning Representations (ICLR), 2018.
Abstract: The Wasserstein distance received a lot of attention recently in the community of machine learning, especially for its principled way of comparing distributions. It has found numerous applications in several hard problems, such as domain adaptation, dimensionality reduction or generative models. However, its use is still limited by a heavy computational cost. Our goal is to alleviate this problem by providing an approximation mechanism that allows to break its inherent complexity. It relies on the search of an embedding where the Euclidean distance mimics the Wasserstein distance. We show that such an embedding can be found with a siamese architecture associated with a decoder network that allows to move from the embedding space back to the original input space. Once this embedding has been found, computing optimization problems in the Wasserstein space (e.g. barycenters, principal directions or even archetypes) can be conducted extremely fast. Numerical experiments supporting this idea are conducted on image datasets, and show the wide potential benefits of our method.
author = {Courty, Nicolas and Flamary, Remi and Ducoffe, Melanie},
title = {Learning Wasserstein Embeddings},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2018}


Optimal Transport for Machine Learning Workshop at NeurIPS 2019


We are organizing with Alexandra Suvorikova, Marco Cuturi and Gabriel Peyré the third OTML Workshop at NeurIPS 2019 on 13/14 December 2019.

The list of invited speakers and the call for contribution are both available on the Workshop website.

Optimal Transport for Machine Learning Tutorial at ISBI 2019


I have given a 3h tutorial about Optimal transport for machine learning at ISBI 2019.

You can find the presentation slides and the practical session Python notebook here.

Optimal Transport at Data Science Summer School 2018


We will be giving two one day courses with Marco Cuturi and Nicolas Courty about Optimal transport for machine learning for the Data Science Summer School 2018 (DS3) at Ecole Polytechnique in Paris/Saclay, France.

You can find the presentation slides and the practical session Python notebook on Github.