Friday, June 24, 2022

Week 5: Still RCoT and LiNGAM

 RCoT


In the last week I integrated RCoT into the Because module, and Roger also finished his work on the traditional probability method on conditional probability. Then it is time to compare the performance.

RCoT vs Prob (Traditional Probability Method)


I run the experiments on two version of test datasets based on the original datasets. The first one is just the original one. In the second one I add more non-linear relationship to simple cases and remove non-linear relationship from the complex case.

I tried different value of Power, which is the parameter for Prob, and used default setting for RCoT, since from the experiments last week the results won't get improved if I increase the number of Fourier Features.

The results showed that:
  • For Prob method, increasing Power won't change the results for whether a relation is dependent, it only improves a little for the margin between dependent and independent cases, and it increases running time linearly.
  • The results for RCoT is better and more accurate than Prob in most cases, the margin is also larger.
  • The running time for RCoT is similar to the running time for Prob when setting Power to 1, so basically Prob is always slower than RCoT.
Since RCoT has showed the better performance on both accuracy and running time, we have set the default dependent method to RCoT.

Downward Compatibility for Lower Version of Python


In the original implementation of lgb4 algorithm, it used math.comb function to calculate the number of combinations. However, it is only supported with Python 3.8 or higher version.

Therefore, I did a modification to it to support lower version of Python.

LiNGAM


Non-Gaussian Problem


It haunts me and Roger a long time why LiNGAM won't work on normal distribution, and in this week I tried to look into it and gave a reasonable answer to that problem from mathematical proof.

The basic idea is that:
  1. We only consider the exact case in LiNGAM, which means the relation between all variables should be linear. And in this proof we only consider two variables where one causes another in linear relation, like y=w*x+noise().
  2. Intuitively we think if we run linear regression on x with variable y, the coefficient would be 1/w, but actually we can prove that the coefficient should be w/(1+w^2).
  3. Now we can explicitly write out the residual and y, we can prove that residual and y are independent if and only if the noise for both x and y are from normal distribution.

Develop New Direct Test Program and Data Model


I developed a new direct test program this week, which supports many runs with the same method at one execution. Each run will generate a new dataset based on the model provided. In the end the program will calculate the mean results for both forward and backward directions and the margin between them. In our expectation, the results for forward direction should be around 0.5 and the results for backward direction should be close to 0. A better method should have larger margin between them.

The new data model are shown below:


Basically we contained test cases:
  • linear relation with normal and non-normal noise
  • lot kinds of non-linear relation
  • noise dependent on variable
  • reverse version of relation
  • linear and non-linear combination of multiple variables

Experiment on Different Regression Methods

The different methods tested and their running time table are below:


I also run all methods on 1000 samples with 10 runs to see the accuracy. The conclusions are:
  • LinearRegression or any kinds of linear model are totally limited on linear relations, which is expectable. Though they have the fastest speed, it is very hard to apply it in practice due to their limitations.
  • The other non-linear regression model share very similar results in experiments. They made mistakes on (N, N3) pair, where they all can not fit well due to it extreme value around 0. They made mistakes on (M, M4), which is expectable since now the noise are dependent with M. (IVB, IVA) is incorrect since the noises are normal. The results for (M, M6) is reversed which is exactly what we expected. The only difference is on (EXP, EXP5), which is complex and some model can give correct results.
  • The running time varies a lot. Overall, I think Ridge Regression with Random Fourier Feature has the best performance on speed and can correctly test the results as expected.

Non-linear Regression


In the LiNGAM algorithm, when we want to extend the algorithm to non-linear relation, we must use non-linear regression methods. In the future work on using machine learning to estimate conditional expectation value, non-linear regression is also required. Therefore, it may be better to understand all non-linear model, how they learn from data, what kind of property they have, in order to choose the better one from them.

Therefore, I will start to learn the basic idea of them, and take some notes here.

Kernel Ridge Regression


KRR (Kernel Ridge Regression), just like its name, combines both kernel function and ridge regression. It first uses kernel function to map the original data to a feature space, then uses ridge regression to learn a linear function with L2 penalty. When we use non-linear kernels, we can get the non-linear function in the original space.

So in fact the KRR is very similar to SVR (Support Vector Regression). The only difference would be the loss function. KRR uses traditional linear loss, least squired error, while SVR uses an e-insensitive loss, which ignore the cases where the error is less than a threshold e.

Therefore, the KRR would train faster than SVR, while SVR will predict faster since it only use part of training samples to predict. (The KRR implemented in sklearn and in rff implementation are different, which may be the reason for the different running time.)

Support Vector Regression


The property of SVR is that it ignores the cases that have the prediction error less than a threshold e, and only try to fit on the hard cases.

Plan for Next Week

  • Keep working on LiNGAM to compare non-linear method and its boundary.
  • Start with the machine learning method on estimation of conditional expectation.

Friday, June 17, 2022

Week 4: Finish RCoT and Begin with DirectLiNGAM

 1 RCoT algorithm


1.1 Experiment and deal with R parameter


In the last week, I did some experiments on R parameter, and the results were fully against our intuition. Therefore, in this week, my first job is to examine why this results occurred.

It turns out that the only use of the parameter R is to determine how many samples to use to calculate the median of distances between all samples. When R is set to all samples, we give the actual median of distances. When R is less than the total number of samples, we give an approximation of median based on R samples. The median of distances will be used to determine the scale of parameter W in Fourier Random Features. Larger distances result in smaller W, which increases the frequency of the cosine function to generate more fitted Fourier Features.

The problem behind this is that, in fact, we will normalize every variable at first. Therefore, the median of distances is always close to 1. In experiments, the distances vary from 0.8~1.2. Then there are no benefit for us to set larger R, which will not improve the approximation of median of distances but only increase the running time.

What's more, I could also directly set the approximation to 1. In experiments, this action didn't impact the results but improve the running time a lot. In fact, the part of calculating distances is the most time consuming part in the algorithm. When I remove it from the algorithm, the algorithm's running time decreases dramatically.

In the current version of RCoT algorithm, I set approximation to 1.

1.2 Experiment and deal with the number of Fourier Features


After remove the influence of R parameter, it is possible to experiment on the parameter which determine the number of Fourier Features.

It turns out that even with 2 conditional and non-conditional features, the algorithm begins to have the ability to test direct linear relation correctly. But it is hard to detect conditional dependence.

With 5 conditional and non-conditional features, it begin to have the ability to detect more complicated relation, and can detect direct relation accurately.

When set conditional features to 25, the algorithm can detect linear dependent accurately, and from that, the algorithm won't improve much even we set conditional features to 500 and non-conditional features to 50 (in this setting, the running time is much longer than before we have not remove parameter R). 

It turns out that increase the number features has limited influence on the accuracy of the algorithm. It will benefit the algorithm when the number is small at first, but when it is over some threshold (25 and 5 in this case), the benefit stops. The algorithm can always correctly detect linear relation, but has errors on non-linear relation no matter how large we set the number of features.



Above are the examples of non-linear inverted V structure that the algorithm can never detect correctly.

1.3 Integrate RCoT algorithm into Because module


The Pull Request has been created and merged into the main branch of Because.

Basically it adds a new method for dependence test and enable the independence test program to test on RCoT algorithm.

2 DirectLiNGAM


2.1 Improve Direct Test program


In order to test the direct detection ability of the DirectLiNGAM and to compare different methods., it is important to create a lot of conditions and examples to test the boundary of their ability.

Now the Direct Test Dataset has SEM:


Which include cases:
  • linear and non-linear relations
  • normal and non-normal noises
  • large and small noises
  • monotonically and non-monotonically relations
Further cases can be added:
  • More complex relation between more than 2 variables
  • larger and smaller noise to find the detect boundary

Direct Test is done on both direction for every relation to test the ability of detect direction.

2.2 Implement DirectLiNGAM non-linear variant


In the original DirectLiNGAM method, it only uses a linear regression, so it doesn't have the ability to detect non-linear relationship. For example, if the actual relation is a square function, then the residual is highly correlated with variables and the direction can not be detected.

Now, if we use non-linear regression method, then it is possible that we can extend the LiNGAM algorithm into non-linear condition.

The choice of non-linear regression method is a lot, and now we just use SVM method to test the possibility. More methods and experiments are needed to be done.




Above are the code and the results. The code is very straight forward. We do a regression first, get the residual and run RCoT on variable and residual to test the dependency between them. If RCoT return a p-value close to 0, then it detects a strong dependency between them, which means the current direction should be wrong if we assume that the relation is like y = f(x) + e, where f(x) is a non-linear function. If p-value are not close to 0, then the residual and the variable could be independence, and the direction could be true.

From the preliminary experiments we can observe that for these simple examples (just direct relation and fit the assumption we make about the relation, which is y = f(x) + e), the algorithm can detect direction correctly for all kinds of non-linear and linear examples, except for 3 cases:
  • N and N3: it may be too hard for SVM to regression on N3 = 1/N since it goes too large around 0.
  • IVA and IVB: it is normal noise for both variables (the reason why normal noise can not work should be looked into in the future, I also don't quite understand much)
  • EXP and EXP2: Hyperbolic Tangent function or exponential noise may be the reason. (additional experiments have done on these and in this time the algorithm detect correctly. Additional relation has set for exponential distribution and turns out the algorithm can detect direction with exponential noise.)
Further experiments should be done on it to test:
  • more non-linear regression method
  • more non-linear relation
  • more complex relation consisting of multiple variables
Since the data generating process and RCoT algorithm have randomness, it is also important to run multiples times on different dataset with the same setting to eliminate the randomness. A through test program is under development.

3 Bug Report


When I try to compare RCoT algorithm and traditional prob method on dependency test, I encountered an error:


It turns out that when running conditioning on two variables, the Subspace gives 0 data:


I tried looking into filtering function but I don't actually understand how it worked.. So I don't know why it happened.

4 Plan for Next Week

  • Develop the direct test program
  • Do experiments on different method mentioned above thoroughly
  • Look into the problem mentioned above, like the normal noise

Friday, June 10, 2022

Week 3: RCoT and LiNGAM

 DirectLiNGAM

I begin to read the original paper for DirectLiNGAM, below are my summary:

  • DirectLiNGAM does not need initial guess and hyperparameters, and is guaranteed to converge within a fixed small steps if the data fits the model.
  • DirectLiNGAM first finds an exogenous variable, then removes it from the model by eliminating all its effect on other variables. After removing, we have a new model with residuals by removing effect of the exogenous variable, then we keep doing this step until the last variable. At the end we get the estimated causal relations, then we can use LSE to compute the causal effect between variables.
  • In order to find exogenous variable, we need to do LSE and test the independence of every variable and their residuals from LSE. This step requires an independence test, and is very time consuming.
  • With prior knowledge, we can accelerate the algorithm by finding exogenous directly or more easily.

Continuing Experiment on RCoT


Experiments results are recorded in google doc.

Bug Fix


When running experiments on non-linear relationships, bug occurs that the lgb4 algorithm may produce complex number in intermediate step and cause the program to crash. I fix it by finding the original python code for lgb4 and update the code related.

Searching for Detection Boundary of Linear and Non-linear


The goal of these experiments recorded above is to find what kind of causal relation can RCoT detect, and what kind of relation it can not.

The conclusion of the experiments above are:
  • RCoT can easily detect linear causal relationship and correctly classify whether the relation is independent or dependent.
  • Even just add in a mild non-linear relationship like square, the error occurs for difficult examples.
  • More complex non-linear relationship will cause more errors.
  • For common cause structure, it is interesting to find that relation like y = abs(x) ** 1.5 can be detect, while y = x ** 2 can not.

Evaluate the Effect of Parameter R


In RCoT algorithm, the parameter R indicates how many samples are being used for Fourier Features. Intuitively, increasing R will increase the running time but also increase the accuracy.

I modify the code, make R an adjustable parameter and do some experiments on it.

However, in experiments, the results show that:
  • The running time will be highly influenced by the value of parameter R, and the relationship is almost O(R^2). One test for conditional dependency when using R=100 will result in less than 1 second, one test when R=500 is likely 6-7s, one test when R=1000 is likely 25s, and one test when R=2000 is likely 110s. Therefore, it is almost impossible to use all samples when the sample size is above 10K.
  • It turns out that changing R value will not influence much of the test results. For non-linear relationships, even I increase the R value to 3000, the RCoT algorithm still can not correctly classify whether a conditional relation is dependent or not. For linear relationships, even I decrease the R value to 10, the RCoT algorithm still can give correct results for all tests, and the margin does not have significant decreasing.

Direct Test


I simply modify the code provided for testing the direct algorithm and do some simple test on the currently Pairwise Lingam Algorithm (Hyperbolic Tangent (HT) variant).

The results do not turn out good even I increase the noise to make the relationship less determined.




The algorithm can not give the correct results for A, B, C and N, N2 relationship, which should be very simple examples.

Plan for Next Week


  • Integrate RCoT into ProbSpace fully, write appropriate comments and make pull request for it.
  • Look into Lingam algorithm, read related papers, examine why it won't work, find other alternate algorithms.

Friday, June 3, 2022

Week 2: Integrate and experiment with RCoT algorithm

 RCoT algorithm


Learning RCoT algorithm


Continue the task from last week, I finished reading RCoT and RIT paper to have a rough understanding of the algorithm. On Tuesday, Roger also gave me a lecture on RKHS and RCoT, which built an intuitive understanding of these concepts. Though the mathematical proof is still a little difficult for me to understand, the whole process began to make sense to me.

I also learned RCoT algorithm by using debug mode to read the code, which is very useful to understand what the algorithm actually done.

Integrate RCoT algorithm


After the learning of the algorithm, I integrated the code into the Because module and completed the test program for RCoT.




































It is easy to change method between traditional probability method and RCoT algorithm. There are something should be noted:
  • I fixed a bug in RCoT algorithm, which just uses the first variable in the list of conditional variables.
  • I modified function normalize in normalize.py, making it more clear and no need to do extra matrix transpose after normalization.
  • I changed one line in RCoT algorithm from L = sc.cholesky((Czz + (np.eye(num_f) * 1e-10)), lower=True) to L = sc.cholesky((Czz + (np.eye(num_f) * 1e-6)), lower=True) to avoid error from scipy saying that Czz is not positive semi-definite and is unable to decompose.
  • I changed the default value for parameter num_f (Number of features for conditioning set) from 5 to 100, which is the same as in the original R code. This modification improves the algorithm accuracy while does not increase the running time much.
  • I choose .99 as the threshold to determine whether two variables are dependent.

Experiments with RCoT algorithm


Some experiments' results are recorded below (Right now in this doc I use .95 as threshold for RCoT, I haven't added new records since meeting in Thursday):
https://docs.google.com/document/d/12C91ALhd9XIxHtONafTw8MlbwRTQdfrrYWMOIZIA75E/edit?usp=sharing

After meeting in Thursday, I did a lot extra experiments, and there are something should be noted from experiments:
  • I found the original R code and run the RCoT algorithm on the same dataset using two implementations. Since the algorithm has randomness, and the same random seed does not provide the same results in both R and Python, it is impossible to get the exactly same results. Therefore, I run the algorithm on the same quest like dependence(A, C, [B, D]) multiple times in both R and Python to compare the results. It turns out that the two implementations match and there are no big difference between them. Since p-value itself is a random variable, I would say the two implementations provide the same results.
  • After I change the num_f  from 5 to 100, the RCoT algorithm can now easily detects dependence and independence correctly when I use simple linear relations. After generating data using complex non-linear relations, the algorithm may give wrong results especially on example like dependence(A, C, [B, D]) and dependence(B, D, [A, C]). Since the original R code also makes mistake on these examples, I think these examples are true difficult examples for RCoT algorithm.
  • The randomness in RCoT algorithm is very strong. Since p-value is from uniform distribution when NULL hypothesis is correct, and the random nature in RCoT algorithm, sometimes it will provide incorrect results or subtle results (around 0.5), and the results from different runs can vary a lot.
  • From the experiments, setting threshold to .99 and using formula (1-p) ** log(0.5, 0.99) to translate the results always provide satisfied results. When given strong dependent pair of variables, the RCoT algorithm can always get result of 0, or less than 0.01. And for independence pairs, the result will vary from 0 to 1. When using this formula, result p-value in [0.05, 1] will be turned into almost 0, and p-value in [0, 0.001] will be turned into almost 1. Examples between [0.001, 0.05] can be seen as subtle case, which can be taken on extra test if needed.

LPB and Kolmagarov-Smirnov


After experiments on RCoT, I also looked into LPB code and learned what this algorithm is trying to achieve. I also learned Kolmagarov-Smirnov, which is used in the original probability method.

Other Stuff


I spent some time finishing company course of Harassment Prevention Training and CDA-RSG Cyber Defense Onboarding Curriculum.

Plan for next week


  • After Roger fix the problem in probability method, I will run a through experiments on both methods and compare them.
  • Evaluate LPB as an alternative to Kolmagarov-Smirnov in our current independence test.
  • Understand DirectLiNGAM algorithm, and our Causal Direction Conjecture.

Week 10: Sensitive Feature in RCoT

 1 Sensitive Feature in RCoT The need for adding a parameter sensitive into RCoT algorithm is that, for data from reality, the variables alw...