Dive into Because Module
On Thursday we had a weekly team meeting. Roger continued with the lecture on Causality and HPCC system causality toolkit. We shared our progress and specifically I shared a test result which didn't meet expectations. After the meeting I tried to explore the problem myself, which is also a good start to understand how Because module calculates conditional probability in detail.
Problem description
The causal model is a typical inverted-V structure for IVA, IVB and IVC. It is easy to know that given IVB, IVA and IVC should be independent.
The result did not meet the expectations. When given IVB, the expectation for IVC is still highly influenced by the value of IVA. Tests were done repeatedly for data size of 100000 and 1000000, and the results are against expectation consistently.
This problem intrigues me to know how conditional calculations are done by Because module.
Tackle of the problem
I did it by using debug mode and to see the intermediate variables in probTest.py.
Specifically, I examined the code line ivcGupper = ps.E('IVC', [('IVA', upper), 'IVB'], power=pwr), where upper=0.901. Since the power is set to 2, we sampled 5 IVB to approximate the conditionalized expectation, and in the first loop we use IVB=-0.009. Since we know that IVC is determined by IVB plus a logistic distribution with mean of 0, then we should expect the distribution of IVC to have a mean around 0. And we can check the result distribution:
Other part of the PDF is all 0. We can see that the distribution is skewed to 1, which is highly influenced by the value of IVA. After examine the SubSpace we get for this condition, the problem shows up that we actually have large portion of IVC have value around 0.60. The reason behind it is how SubSpace gets filtered data.
From my understanding, the data generating process in this example is too determined (noise is too small compared to the causal effect), which means when IVA is at upper bound value, we just don't have enough data when IVB is around 0. In this situation, we will skew the distribution since we will expand the interval to acquire enough data. And in the end, the result is skewed.
Conclusion
Before the work above, I actually don't have a understanding of why calculating conditional probability is a very difficult problem. Now I begin to understand why it's hard: condition means the increasing of dimension and the rapid decreasing of data samples. Traditional method would be very hard to deal with it.
Beside this, I also examine:
- how the Because module calculate dependency
- how the Because module calculate ACE, CDE
- how the Causality submodule calculate intervene formula
RcoT algorithm
I have began to deal with the RcoT algorithm:
- Read wiki about this work, watch the presentation done by Mayank.
- Read the RKHS blog to have an intuitive understanding of RKHS (Great article! Very easy to understand).
- Begin to read original paper of RcoT algorithm.
Plan for next week
- Understand RcoT algorithm by reading papers and codes
- Integrate code with Because module
- Do some test on the performance and accuracy about RcoT
No comments:
Post a Comment