THE CAUSALITY PROJECT: May 2022

Friday, May 27, 2022

Week 1 Day 4+5: Dive into Because module and begin with RcoT algorithm

Dive into Because Module

On Thursday we had a weekly team meeting. Roger continued with the lecture on Causality and HPCC system causality toolkit. We shared our progress and specifically I shared a test result which didn't meet expectations. After the meeting I tried to explore the problem myself, which is also a good start to understand how Because module calculates conditional probability in detail.

Problem description

The causal model is a typical inverted-V structure for IVA, IVB and IVC. It is easy to know that given IVB, IVA and IVC should be independent.

The result did not meet the expectations. When given IVB, the expectation for IVC is still highly influenced by the value of IVA. Tests were done repeatedly for data size of 100000 and 1000000, and the results are against expectation consistently.

This problem intrigues me to know how conditional calculations are done by Because module.

Tackle of the problem

I did it by using debug mode and to see the intermediate variables in probTest.py.

Specifically, I examined the code line ivcGupper = ps.E('IVC', [('IVA', upper), 'IVB'], power=pwr), where upper=0.901. Since the power is set to 2, we sampled 5 IVB to approximate the conditionalized expectation, and in the first loop we use IVB=-0.009. Since we know that IVC is determined by IVB plus a logistic distribution with mean of 0, then we should expect the distribution of IVC to have a mean around 0. And we can check the result distribution:

Other part of the PDF is all 0. We can see that the distribution is skewed to 1, which is highly influenced by the value of IVA. After examine the SubSpace we get for this condition, the problem shows up that we actually have large portion of IVC have value around 0.60. The reason behind it is how SubSpace gets filtered data.

From my understanding, the data generating process in this example is too determined (noise is too small compared to the causal effect), which means when IVA is at upper bound value, we just don't have enough data when IVB is around 0. In this situation, we will skew the distribution since we will expand the interval to acquire enough data. And in the end, the result is skewed.

Conclusion

Before the work above, I actually don't have a understanding of why calculating conditional probability is a very difficult problem. Now I begin to understand why it's hard: condition means the increasing of dimension and the rapid decreasing of data samples. Traditional method would be very hard to deal with it.

Beside this, I also examine:

how the Because module calculate dependency
how the Because module calculate ACE, CDE
how the Causality submodule calculate intervene formula

RcoT algorithm

I have began to deal with the RcoT algorithm:

Read wiki about this work, watch the presentation done by Mayank.
Read the RKHS blog to have an intuitive understanding of RKHS (Great article! Very easy to understand).
Begin to read original paper of RcoT algorithm.

Plan for next week

Understand RcoT algorithm by reading papers and codes
Integrate code with Because module
Do some test on the performance and accuracy about RcoT

Wednesday, May 25, 2022

Week 1 Day 2+3: First glance at Because module

On Tuesday, I had a team meeting with Roger and Arun. In the meeting, Roger shared the basic concept of causality and the overview of HPCC system causality toolkit.

Other than that, in these two days, I mainly worked on the test examples in Because module.

Synth submodule: By running genTest.py and viewing the code in gen_data.py, understand the process of generating synthesis data, and how to write the model description to utilize this power tool. It is the basic for the experiments next.

Probability submodule: By running test examples, understand the capability of main class ProbSpace, such as the calculation of the probability or conditional probability for any combinations of variables, approximation of the distribution of variables, calculating basic statistics of distribution, and the plot of PDF. Details are still need to be examined, like the details of DP, JP, UP to calculate conditional PDF, how to calculate dependency, the effect of Power parameter.

Causality submodule: By running test examples, understand the capability of main class cGraph, which combines the hypothesized causal model and data. PropSpace is applied on data to have statistic calculation. Graph algorithm is applied on causal model to calculate causal related formula like do operation, and deduce independence and dependence. It can calculate ACE, CDE and CIE, scan data to find causal relationship, and validate the hypothesized causal model by independence and dependence relationships between variables. Implementation details still need to be check, like how to do the scan on graph, how to deduce dependence relationship from graph.

Some questions generated when doing the test and will be discussed further, some typo are discovered in code and will be amended.

Monday, May 23, 2022

Week 1 Day 1: Setup

Today is the first day of my internship with LexisNexis.

For this internship, my main job is to improve the underlying causal inference algorithms which apply on real dataset, evaluate alternate algorithms, assess the issues encountered and develop new algorithms about causal inference. My initial task would be to view and understand the code of Because module, which is a causal inference package, and learn the implemented algorithms for causal inference.

But first of all, I need to setup everything and build environment for future work.

Today I mainly accomplished:

Have a short meeting with Lorraine, set up my Email and Teams account.
Finish RSG Phishing and Training Programs and CDA-RSG Cyber Defense Onboarding Curriculum.
Have a kick off meeting with Roger, discuss my initial work. Roger also shared some test examples with me to have a better understand of the Because module.
Install Because module, do some tests and experiments on Synth submodule.

THE CAUSALITY PROJECT