Data Science Projects

Tags: , ,

Vision Transformers for Analyzing High-Resolution Pathology Images

Master Thesis
Transformers are powerful models that can capture long-range dependencies between data using an attention mechanism. However, applying transformers to medical problems poses challenges due to the high-resolution and complexity of pathology images, as well as the scarcity and noise of labels. In this thesis, we have provided a comprehensive overview of the state-of-the- art transformers method for analyzing high-resolution pathology images, using the glioblastoma dataset IvyGAP and the renal cancer dataset as case studies. We have discussed how to pre- train transformers using self-supervised learning methods that can learn useful representations from unlabeled data. We have also explored how to find regions of interest within a whole slide image using different levels of supervision: self-supervised, weakly supervised and strongly supervised. We have demonstrated that transformers can provide a semantic understanding of the data and outperform convolutional-based models on downstream tasks such as classification and segmentation. Finally, we employ posterior networks to estimate the aleatoric and epistemic uncertainty of ViT predictions and evaluate their usefulness for clinical decision making. We propose a novel method to filter potentially mislabeled data in order to make more accurate and confident predictions. Our experiments show that our methods achieve state-of-the-art results on various glioblastoma tasks and provide meaningful insights into the behavior and limitations of ViTs for medical machine learning.

Tags: , ,

What is the role of the media coverage in explaining stock market fluctuations?

Applied Data Analysis class, 100/100
During this semester long project, we studied in a group of 4 the fluctuation of the Apple stock price, volume and its relationship with quotes in media. Using Yahoo API, Quotebank Dataset and web parsing, we show a visible and quantifiable correlation in Apple coverage and the price of its stock. This project contains a full introduction of the datasets, a study of meaningful events in Apple recent history and a fitted time series prediction taking into account the multiple results of the project. The results have been presented on an interactive website, which required knowledge in data visualization to tell our data story.

Tags:

Movie Recommendation System in Spark for Big Data

Systems for Data Science, 100/100, 2021
Developed and deployed a movie recommendation system in Scala with Spark. The movie recommender is modeled with an approximate k-NN system that can predict efficiently over several machines the best movie for a user. The personalized recommender is implemented using modern libraries, such as Spark and written in Scala. The efficiency and economics are evaluated, and the model has been tested on multiple scale data sets and benchmarks, such as MovieLens 10M.

Tags:

Robust Journey Planning for CFF Zurich

Labs in Data Science, 100/100, Best Class Project, 2021
Developed and deployed a journey planner using Spark to compute and visualize the best transportation and path using Zurich Transportation System. Using CFF data, we built a predictive model that solved efficiently the transportation problem. Given a desired arrival time, our route planner will compute the fastest route between departure and arrival stops within a provided confidence tolerance expressed as interquartiles. For instance, “what route from A to B is the fastest at least Q% of the time if I want to arrive at B before instant T”. We used the Connection Scan Algorithm, handled data with and Hive PySpark and presented resutls with ipywidget.

Tags:

Study of California Highway Patrol data on 10 millions records using Oracle SQL

Databases Systems, 82/100, 2021
In this project we design the database schema and implement the relational schema, load the data into a DBMS, evaluate and optimize queries with query plan analysis in order to analyze the performance impact in Oracle SQL. The dataset contains a subset of data from the California Highway Patrol collected from the Statewide Integrated Traffic Records System (SWITRS), covering traffic collisions in the state of California in 2018. These reports have been collected by the Highway Patrol, recorded electronically for archival and preservation based on the forms filed by officers. This data has an immense value for urban planners and scientists that may want to analyze risks and how to improve the traffic situation and which factors or locations may be prevalent.

Tags:

Higgs Boson Challenge: Prediction of generated Higgs Bosons

Machine Learning, 100/100, 2020
The task of the Challenge was to predict whether a collision event can generate a Higgs boson, an elementary particle in the Standard Model of physics, or not. In fact, scientists cannot really observe this particle, but only its decay signature through different measurements. We analyzed, preprocessed, transformed those decay signatures and use them as features in our Machine Learning algorithms. Furthermore we present the 6 models that we built and how we selected the best one with the best parameters combining Grid Search and Cross Validation. In a competing against another 200 teams, we reached the 7th rank on AICrowd. </a>

Tags: ,

Study on flag memorization, comparing self learned or taught memorization techniques.

Digital Education, 100/100, Best Class Project, 2021
This report is a study about memorisation. Memorisation is a task that is often required during schooling to master various skills, it is often used to speed up a task by quickly retrieving an information of our memory instead of looking for it on an external knowledge source. This study specifically compares two different ways to make a student memorize flags and countries, which are essentially pairs of word-image. There exists two groups, the first will get the help of a teacher whereas the second group will be split into pairs of students for them to cooperate.

Tags: