Friday, June 17, 2022

Week 4: Finish RCoT and Begin with DirectLiNGAM

 1 RCoT algorithm


1.1 Experiment and deal with R parameter


In the last week, I did some experiments on R parameter, and the results were fully against our intuition. Therefore, in this week, my first job is to examine why this results occurred.

It turns out that the only use of the parameter R is to determine how many samples to use to calculate the median of distances between all samples. When R is set to all samples, we give the actual median of distances. When R is less than the total number of samples, we give an approximation of median based on R samples. The median of distances will be used to determine the scale of parameter W in Fourier Random Features. Larger distances result in smaller W, which increases the frequency of the cosine function to generate more fitted Fourier Features.

The problem behind this is that, in fact, we will normalize every variable at first. Therefore, the median of distances is always close to 1. In experiments, the distances vary from 0.8~1.2. Then there are no benefit for us to set larger R, which will not improve the approximation of median of distances but only increase the running time.

What's more, I could also directly set the approximation to 1. In experiments, this action didn't impact the results but improve the running time a lot. In fact, the part of calculating distances is the most time consuming part in the algorithm. When I remove it from the algorithm, the algorithm's running time decreases dramatically.

In the current version of RCoT algorithm, I set approximation to 1.

1.2 Experiment and deal with the number of Fourier Features


After remove the influence of R parameter, it is possible to experiment on the parameter which determine the number of Fourier Features.

It turns out that even with 2 conditional and non-conditional features, the algorithm begins to have the ability to test direct linear relation correctly. But it is hard to detect conditional dependence.

With 5 conditional and non-conditional features, it begin to have the ability to detect more complicated relation, and can detect direct relation accurately.

When set conditional features to 25, the algorithm can detect linear dependent accurately, and from that, the algorithm won't improve much even we set conditional features to 500 and non-conditional features to 50 (in this setting, the running time is much longer than before we have not remove parameter R). 

It turns out that increase the number features has limited influence on the accuracy of the algorithm. It will benefit the algorithm when the number is small at first, but when it is over some threshold (25 and 5 in this case), the benefit stops. The algorithm can always correctly detect linear relation, but has errors on non-linear relation no matter how large we set the number of features.



Above are the examples of non-linear inverted V structure that the algorithm can never detect correctly.

1.3 Integrate RCoT algorithm into Because module


The Pull Request has been created and merged into the main branch of Because.

Basically it adds a new method for dependence test and enable the independence test program to test on RCoT algorithm.

2 DirectLiNGAM


2.1 Improve Direct Test program


In order to test the direct detection ability of the DirectLiNGAM and to compare different methods., it is important to create a lot of conditions and examples to test the boundary of their ability.

Now the Direct Test Dataset has SEM:


Which include cases:
  • linear and non-linear relations
  • normal and non-normal noises
  • large and small noises
  • monotonically and non-monotonically relations
Further cases can be added:
  • More complex relation between more than 2 variables
  • larger and smaller noise to find the detect boundary

Direct Test is done on both direction for every relation to test the ability of detect direction.

2.2 Implement DirectLiNGAM non-linear variant


In the original DirectLiNGAM method, it only uses a linear regression, so it doesn't have the ability to detect non-linear relationship. For example, if the actual relation is a square function, then the residual is highly correlated with variables and the direction can not be detected.

Now, if we use non-linear regression method, then it is possible that we can extend the LiNGAM algorithm into non-linear condition.

The choice of non-linear regression method is a lot, and now we just use SVM method to test the possibility. More methods and experiments are needed to be done.




Above are the code and the results. The code is very straight forward. We do a regression first, get the residual and run RCoT on variable and residual to test the dependency between them. If RCoT return a p-value close to 0, then it detects a strong dependency between them, which means the current direction should be wrong if we assume that the relation is like y = f(x) + e, where f(x) is a non-linear function. If p-value are not close to 0, then the residual and the variable could be independence, and the direction could be true.

From the preliminary experiments we can observe that for these simple examples (just direct relation and fit the assumption we make about the relation, which is y = f(x) + e), the algorithm can detect direction correctly for all kinds of non-linear and linear examples, except for 3 cases:
  • N and N3: it may be too hard for SVM to regression on N3 = 1/N since it goes too large around 0.
  • IVA and IVB: it is normal noise for both variables (the reason why normal noise can not work should be looked into in the future, I also don't quite understand much)
  • EXP and EXP2: Hyperbolic Tangent function or exponential noise may be the reason. (additional experiments have done on these and in this time the algorithm detect correctly. Additional relation has set for exponential distribution and turns out the algorithm can detect direction with exponential noise.)
Further experiments should be done on it to test:
  • more non-linear regression method
  • more non-linear relation
  • more complex relation consisting of multiple variables
Since the data generating process and RCoT algorithm have randomness, it is also important to run multiples times on different dataset with the same setting to eliminate the randomness. A through test program is under development.

3 Bug Report


When I try to compare RCoT algorithm and traditional prob method on dependency test, I encountered an error:


It turns out that when running conditioning on two variables, the Subspace gives 0 data:


I tried looking into filtering function but I don't actually understand how it worked.. So I don't know why it happened.

4 Plan for Next Week

  • Develop the direct test program
  • Do experiments on different method mentioned above thoroughly
  • Look into the problem mentioned above, like the normal noise

No comments:

Post a Comment

Week 10: Sensitive Feature in RCoT

 1 Sensitive Feature in RCoT The need for adding a parameter sensitive into RCoT algorithm is that, for data from reality, the variables alw...