Projects

2023

Master Thesis
Transformers are powerful models that can capture long-range dependencies between data using an attention mechanism. However, applying transformers to medical problems poses challenges due to the high-resolution and complexity of pathology images, as well as the scarcity and noise of labels. In this thesis, we have provided a comprehensive overview of the state-of-the- art transformers method for analyzing high-resolution pathology images, using the glioblastoma dataset IvyGAP and the renal cancer dataset as case studies. We have discussed how to pre- train transformers using self-supervised learning methods that can learn useful representations from unlabeled data. We have also explored how to find regions of interest within a whole slide image using different levels of supervision: self-supervised, weakly supervised and strongly supervised. We have demonstrated that transformers can provide a semantic understanding of the data and outperform convolutional-based models on downstream tasks such as classification and segmentation. Finally, we employ posterior networks to estimate the aleatoric and epistemic uncertainty of ViT predictions and evaluate their usefulness for clinical decision making. We propose a novel method to filter potentially mislabeled data in order to make more accurate and confident predictions. Our experiments show that our methods achieve state-of-the-art results on various glioblastoma tasks and provide meaningful insights into the behavior and limitations of ViTs for medical machine learning.

Tags: Computer Vision, Data Science, Machine Learning

2022

BERT fine-tuning for Sentimental Analysis with deployment using Ray Serve

Personal Project, 2022
This project has for purpose to practice the full pipeline of solving a problem using modern MLOPS tools. The pre-trained BERT weights have been fine-tuned on a sentimental analysis task on the iMDB dataset. Logging and hyper-parameter tunning has been done using W&B. A Docker image is available for production and deployment is done using Ray Serve to obtain a REST API.

Tags: Machine Learning, Natural Language Processing

Improving Domain Generalization for Deep Learning-Powered Cancer Cell Detectiong

Internship, 2022
In 2022, I have worked as a research & software intern at NEC Laboratories in Princeton, a brilliant research company along with world-renowned scientists. I have implemented and researched the various techniques of data augmentation to improve results on a complex multi-class segmentation problem. In the Medical Machine Learning department, I motivated the use of GANs to produce new samples for cancer detection, explored the use of self-supervised method such as MixMatch, and studied ways to make model more robust to domain generalization.

Tags: Computer Vision, GANs, Machine Learning, Medical, Self-Supervised

2021

Implementation of GANWriting, Content-Conditioned Generation of Styled Handwritten Word Images

Internship, 2021
Worked on the deployment of a state-of-the-art GAN model for generating hand written words, with the goal of writing a full library in Pytorch Lightning. This framework relied on a deep understanding of Computer Vision and complex ML system. GANWriting propose a novel method that is able to produce credible handwritten word images by conditioning the generative process with both calligraphic style features and textual content. The generator is guided by three complementary learning objectives: to produce realistic images, to imitate a certain handwriting style and to convey a specific textual content.

Tags: Computer Vision, Machine Learning

Exploration of D-Cliques Variations and Edges Cases for Decentralized Federated Learning

Research Project, 90/100, 2021
Researched the decentralized implementation of a Federated Learning algorithm in the Scalable Computing Systems Laboratory at EPFL. Supervised by Prof. Karmarec, I explored and identify the computational implications of a Decentralized approach to a state-of-the-art topology (D-Cliques) in a research environment. D-Cliques (Bellet et al. 2021) is a recent approach to coordinate and structure a network for DFL. While a fully connected topology is not practical due to the number of edges increasing quadratically, D-Cliques provide an intuitive approach in providing locally fully connected neighborhoods.

Tags: Federated Learning, Machine Learning

Towards Robust and Adaptable Diagnosis of Pneumonia from Chest X-ray Data

Visual Intelligence, 90/100, 2021
Artificial intelligence (AI) researchers and radiologists have recently reported AI systems that accurately diagnose pneumonia from a chest X-Ray images using deep neural networks when trained on a sufficient large and homogeneous amount of labelled images. However, the robustness and adaptability of these systems, trained minimizing the empirical risk (ERM), remains far way. In fact, ERM have no way of discard environment specific features creating an alarming situation in which the systems appear accurate, but fail when tested in new hospitals.

Tags: Computer Vision, GANs, Machine Learning, Medical, Self-Supervised

What is the role of the media coverage in explaining stock market fluctuations?

Applied Data Analysis class, 100/100
During this semester long project, we studied in a group of 4 the fluctuation of the Apple stock price, volume and its relationship with quotes in media. Using Yahoo API, Quotebank Dataset and web parsing, we show a visible and quantifiable correlation in Apple coverage and the price of its stock. This project contains a full introduction of the datasets, a study of meaningful events in Apple recent history and a fitted time series prediction taking into account the multiple results of the project. The results have been presented on an interactive website, which required knowledge in data visualization to tell our data story.

Tags: Data Science

Reinforcement Learning for moon landing in OpenGym

Artificial Neural Networks, 90/100, 2021
Artificial intelligence (AI) researchers and radiologists have recently reported AI systems that accurately diagnose pneumonia from a chest X-Ray images using deep neural networks when trained on a sufficient large and homogeneous amount of labelled images. However, the robustness and adaptability of these systems, trained minimizing the empirical risk (ERM), remains far way. In fact, ERM have no way of discard environment specific features creating an alarming situation in which the systems appear accurate, but fail when tested in new hospitals.

Tags: Machine Learning

Movie Recommendation System in Spark for Big Data

Systems for Data Science, 100/100, 2021
Developed and deployed a movie recommendation system in Scala with Spark. The movie recommender is modeled with an approximate k-NN system that can predict efficiently over several machines the best movie for a user. The personalized recommender is implemented using modern libraries, such as Spark and written in Scala. The efficiency and economics are evaluated, and the model has been tested on multiple scale data sets and benchmarks, such as MovieLens 10M.

Tags: Data Science

Robust Journey Planning for CFF Zurich

Labs in Data Science, 100/100, Best Class Project, 2021
Developed and deployed a journey planner using Spark to compute and visualize the best transportation and path using Zurich Transportation System. Using CFF data, we built a predictive model that solved efficiently the transportation problem. Given a desired arrival time, our route planner will compute the fastest route between departure and arrival stops within a provided confidence tolerance expressed as interquartiles. For instance, “what route from A to B is the fastest at least Q% of the time if I want to arrive at B before instant T”. We used the Connection Scan Algorithm, handled data with and Hive PySpark and presented resutls with ipywidget.

Tags: Data Science

Study of California Highway Patrol data on 10 millions records using Oracle SQL

Databases Systems, 82/100, 2021
In this project we design the database schema and implement the relational schema, load the data into a DBMS, evaluate and optimize queries with query plan analysis in order to analyze the performance impact in Oracle SQL. The dataset contains a subset of data from the California Highway Patrol collected from the Statewide Integrated Traffic Records System (SWITRS), covering traffic collisions in the state of California in 2018. These reports have been collected by the Highway Patrol, recorded electronically for archival and preservation based on the forms filed by officers. This data has an immense value for urban planners and scientists that may want to analyze risks and how to improve the traffic situation and which factors or locations may be prevalent.

Tags: Data Science

Discussion on the viability of a modern Second Order Method in Non-Convex Optimization training a Deep Convolutional Neural Network

Optimization for ML, 100/100, 2021
Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. However computing or approximating the curvature matrix can be very expensive both in per-iteration computation time and memory cost. In this study we analyze the convenience in using a state-of-the-art Second Order Method (AdaHessian) in Non-Convex Optimization training a Deep Convolutional Neural Network (ResNet18) on MNIST database comparing with traditional First Order Methods. Advantages and disadvantages of both the methods are discussed and a final hybrid method combining the advantages of both is proposed.

Tags: Machine Learning

2020

Detecting Rooftop Area for installing PV modules with Deep Learning

Machine Learning, 100/100, 2021
Won the 2nd place for best project presentation, presented at the CISBAT 2021 and published in the Journal of Physics. With Prof. Castello, we worked on Deep-learning for the detection of available for solar panel installation from satellite images. This computer vision problem relied on U-net architecture of neural networks done in Pytorch, and reached state of the art performance. You can find out more in the Publication section

Tags: Computer Vision, Machine Learning

Higgs Boson Challenge: Prediction of generated Higgs Bosons

Machine Learning, 100/100, 2020
The task of the Challenge was to predict whether a collision event can generate a Higgs boson, an elementary particle in the Standard Model of physics, or not. In fact, scientists cannot really observe this particle, but only its decay signature through different measurements. We analyzed, preprocessed, transformed those decay signatures and use them as features in our Machine Learning algorithms. Furthermore we present the 6 models that we built and how we selected the best one with the best parameters combining Grid Search and Cross Validation. In a competing against another 200 teams, we reached the 7th rank on AICrowd. </a>

Tags: Data Science, Machine Learning

Study on flag memorization, comparing self learned or taught memorization techniques.

Digital Education, 100/100, Best Class Project, 2021
This report is a study about memorisation. Memorisation is a task that is often required during schooling to master various skills, it is often used to speed up a task by quickly retrieving an information of our memory instead of looking for it on an external knowledge source. This study specifically compares two different ways to make a student memorize flags and countries, which are essentially pairs of word-image. There exists two groups, the first will get the help of a teacher whereas the second group will be split into pairs of students for them to cooperate.

Tags: Data Science

Raphaël Attias

Projects

2023

Analyzing Hospital Data: Associations Between Birth Location and Categorical Variables

Vision Transformers for Analyzing High-Resolution Pathology Images

2022

BERT fine-tuning for Sentimental Analysis with deployment using Ray Serve

Improving Domain Generalization for Deep Learning-Powered Cancer Cell Detectiong

2021

Implementation of GANWriting, Content-Conditioned Generation of Styled Handwritten Word Images

Exploration of D-Cliques Variations and Edges Cases for Decentralized Federated Learning

Towards Robust and Adaptable Diagnosis of Pneumonia from Chest X-ray Data

What is the role of the media coverage in explaining stock market fluctuations?

Reinforcement Learning for moon landing in OpenGym

Movie Recommendation System in Spark for Big Data

Robust Journey Planning for CFF Zurich

Study of California Highway Patrol data on 10 millions records using Oracle SQL

Discussion on the viability of a modern Second Order Method in Non-Convex Optimization training a Deep Convolutional Neural Network

2020

Detecting Rooftop Area for installing PV modules with Deep Learning

Higgs Boson Challenge: Prediction of generated Higgs Bosons

Study on flag memorization, comparing self learned or taught memorization techniques.