I am a Computer Science Ph.D. student at Simon Fraser University (SFU) advised by Ke Wang. The focus of my current research is on Natural Language Processing (NLP) and Differential Privacy (DP).
I was also fortunate to intern at Amazon Web Services (AWS) as Applied Scientist with the SageMaker Deep Learning Team in Vancouver, Canada, mentored by Theodore Vasiloudis and managed by Vishaal Kapoor.
I have almost 10 years of experience as a Data Engineer, leading implementations of data solutions from ideation to deployment to production, with a strong focus on business goals and KPIs.
Dec 27, 2022 | I am honored to have been nominated by SFU to attend the 10th Heidelberg Laureate Forum in Germany. There is only one nomination from the entire university. Students meet the recipients of the most prestigious awards in mathematics and computer science. |
Dec 26, 2022 | Paper "TEM: High Utility Metric Differential Privacy on Text" was accepted to SDM 2023. |
Aug 04, 2022 | I have presented my Thesis Proposal. |
Feb 03, 2022 | I have been awarded a student scholarship for AAAI-22. |
Dec 07, 2021 | I will be serving as a PC member for the research track of KDD 2022. |
Dec 01, 2021 | Paper "Incorporating Item Frequency for Differentially Private Set Union" was accepted to AAAI 2022. |
Oct 11, 2021 | Paper "Differentially Private Ensemble Classifiers for Data Streams" was accepted to WSDM 2022. |
Aug 19, 2021 | I will be serving as a PC member for the Fifteenth International Conference on Web Search and Data Mining (WSDM 2022). |
Jul 12, 2021 | Paper "Training Differentially Private Neural Networks With Lottery Tickets" was accepted to ESORICS 2021. |
Jul 08, 2021 | Paper "TEM: High Utility Metric Differential Privacy on Text" was accepted to the ICML 2021 Workshop on Theory and Practice of Differential Privacy. |
Jul 01, 2021 | Paper "BRR: Preserving Privacy of Text Data Efficiently on Device" was accepted to the ICML 2021 Workshop on Machine Learning for Data. |
Nov 30, 2020 | I will be serving as a PC member for the research track of KDD 2021. |
Nov 24, 2020 | I am giving a talk about differential privacy for the graduate data mining class at Simon Fraser University (SFU). |
Sep 21, 2020 | Our paper "BRR: Preserving Privacy of Text Data Efficiently on Device" received a best paper award at the Shareable NLP Technologies workshop at Amazon's internal ML conference (AMLC 2020)! |
May 14, 2020 | Paper "Differentially Private Top-k Selection via Stability on Unknown Domain" was accepted to UAI 2020! |
Feb 17, 2020 | I will be joining Amazon during this Summer as an Applied Scientist Intern! |
Dec 11, 2019 | I presented my current work on differentially private GANs at the BC AI Student Showcase 2019. |
Nov 24, 2019 | I have been selected as student volunteer for NeurIPS 2019. |
Oct 01, 2019 | Poster on differentially private lottery ticket accepted to Workshop on Machine Learning with Guarantees at NeurIPS 2019! |
Please see complete list of publications here.
TEM: High Utility Metric Differential Privacy on Text Ricardo Silva Carvalho*, Theodore Vasiloudis, Oluwaseyi Feyisetan, Ke Wang --- * Work done during an internship at Amazon SIAM International Conference on Data Mining (SDM 2023) Workshop on Theory and Practice of Differential Privacy (TPDP 2021) at ICML 2021 Paper Poster Amazon Science |
Incorporating Item Frequency for Differentially Private Set Union Ricardo Silva Carvalho, Ke Wang, Lovedeep Gondara AAAI Conference on Artificial Intelligence (AAAI 2022) Paper Poster Code |
Differentially Private Ensemble Classifiers for Data Streams Lovedeep Gondara, Ke Wang, Ricardo Silva Carvalho ACM International WSDM Conference (WSDM 2022) Paper Code |
BRR: Preserving Privacy of Text Data Efficiently on Device Ricardo Silva Carvalho*, Theodore Vasiloudis, Oluwaseyi Feyisetan --- * Work done during an internship at Amazon Workshop on Machine Learning for Data (ML4Data 2021) at ICML 2021 Workshop on Shareable NLP Technologies at Amazon's Machine Learning Conference (AMLC 2020) - best paper award Paper Poster Amazon Science |
Differentially Private Top-k Selection via Stability on Unknown Domain Ricardo Silva Carvalho, Ke Wang, Lovedeep Gondara, Chunyan Miao Conference on Uncertainty in Artificial Intelligence (UAI 2020) Paper Supplementary File Video Code |
Training Differentially Private Neural Networks With Lottery Tickets Lovedeep Gondara, Ricardo Silva Carvalho, Ke Wang European Symposium on Research in Computer Security (ESORICS 2021) Workshop on Machine Learning with Guarantees at NeurIPS 2019 Paper Extended Abstract |
Privacy-Preserving Publication of Sensitive Data using Differentially Private GANs Ricardo Silva Carvalho BC AI Student Showcase 2019 Paper Poster Code |
Project Reviewer | Udacity | Reviewer of 7 projects on the Data Analyst and Data Scientist nanodegrees, involving hypothesis testing with digital advertising, querying databases, exploratory analysis, and machine learning. |
Technical Writer | Medium.com | Towards Data Science | Writer of technical content related to data science. The most popular post has almost 40,000 views. |
Open Source Contributor | Privacy libraries | Contributor to PipelineDP by Google and OpenMined. |
Conference Reviewer (PC Member) | Multiple AI/ML/DM conferences | Reviewer or PC member for AI/ML/DM conferences, such as KDD 2022, AISTATS 2022, WSDM 2022, ICDM 2021, KDD 2021, WSDM 2021, IEEE BigData 2020, KDD 2020, ICDM 2020, ICDE 2020. |
Teaching Assistant | Data Mining at SFU | Only Teaching Assistant (TA) for the Data Mining course at Simon Fraser University (SFU) with more than 80 undergraduate students enrolled. |
To answer SQL queries to a database while satisfying Differential Privacy (DP), my research focused on the ubiquitous problem of performing GROUP BY queries - more specifically the privacy aspects of releasing the partitions of such aggregate queries.
Consider the following table tbl_patients with sensitive data about patients, and suppose you want to perform the following SQL query:
The query above will group the results by "diagnosis", which are our partitions, and then count the number of "user_id" per partition.
If we use standard DP mechanisms to answer this query, we would basically add noise to the total counts of each partition/diagnosis, and release e.g. the following result on the right:
(without differentiating noise added and total count, here shown just for illustration)
What is the problem with releasing this result on the right?
How do we solve this problem?
Consider we have a dataset/table
Moreover, consider each user
In DPSU, our goal is to output as many items as possible as the existing items in the dataset/table from the users, while satisfying DP.
In other words, we want to use a DP mechanism on
Previous work related to DPSU [1, 2] designed mechanisms with two phases:
Our work introduced the following improvements over previous DPSU mechanisms:
In experiments, compared to the previous state-of-the-art in DPSU, our mechanisms had utility improvements of up to 25%.
In this topic, my research focuses on making DP top-k selection viable and easy-to-use as part of scalable systems.
Consider a dataset from users, where each user has a set of binary elements, where a value of 1 indicates the presence of the element. Moreover, we can define a "score" for every element, which denotes its importance in the dataset, as the sum of the values of the element for all the users.
Below we an example of such dataset:
Elements can be for example:
The histogram we see above on the right is then showing the score of the elements (i.e. sum of each column).
There are a few options for DP selection mechanisms, with the Exponential Mechanism (EM) [1] being one of the most used.
Now consider a dataset with 1 million elements to select the top-10. How do we do this with DP?
How can we improve the computational efficiency of DP top-k selection?
Restricted domain top-k mechanisms select k elements by choosing from the elements that have the top-
What is the problem with DP top-k selection on Restricted domain?
How can we solve this problem?
Our "framework" for differentially private top-k selection aims to be highly practical.
The framework can be built on top of any regular database system without internal modification, leveraging existing optimized data infrastructure. See the image below.
Aside from being practical, our framework also have better utility (formally proven) compared to previous restricted domain mechanisms [2].
In summary, our framework has FOUR main advantages that make it highly practical:
In Natural Language Processing (NLP), my research focuses on improving the utility of DP text generation algorithms.
Consider we have a dataset of users, where each user has a set of words.
Below we an example of such dataset:
Thus, in this setting, we are protecting users at the individual word level.
It is no surprise that disclosing just a single word is enough to impact a user's privacy, as it may represent a password, disease diagnosis or political preference. So this is a really important problem to address.
For text privatization, standard DP approaches do not work well in the context above.
To deal with this issue, previous research [1] has proposed two modifications:
See below an example of word embeddings where each user's data is a word with their home city.
Using mDP, previous methods [1] generally work with three main steps as we see in the image below.
What are the problems with the privatization approach above?
How do we solve the problems above?
Our approach to text privatization is completely different from previous work: we pose the problem as a selection task.
See the following image as an example:
Moreover, as usual on selection problems, we do not release the scores, just the selected element.
As mentioned above we also use binary embeddings to represent the words.
Therefore, our approach has the following THREE main practical advantages:
Under construction! - Come back soon..