GSOC2014 - Implementing ichol and ilu functions

Monday, 18 August 2014

Firm pencils down.

During this week I have been reorganizing all the code, docs and tests in a better way for integrating into Octave. As Rik kindly suggested, I decided to organize things this way:

Inside libinterp/dldfcn directory I have created two files, __ichol__.cc and __ilu__.cc

Within those files there are the dld functions that implements the each of the algorithms. They are ment to be built-in functions and follows the __foo__.cc naming convention.

* __ilu__.cc: contains __ilu0__() , __iluc__() and __ilutp__()

* __ichol__.cc: contains __ichol0__() and __icholt__().

I have moved all the tests from .cc files to .m scripts so no tests are performed for built-in functions.

The code is ready to be pulled from my repo to be reviewed :

https://edu159@bitbucket.org/edu159/octave-edu159

Practical examples: preconditioned conjugate gradient

It is interesting to show how preconditioning techniques can improve the convergency of some iterative solvers. In this case I am running a Matlab example using the Poisson matrix (that is positive definite) obtained with gallery() function. The scritp:

1. In this first case the convergency of pcg using ICHOL(0) algorithm, modified ICHOL(0) algorithm and without preconditioning are compared.

N = 100;

A = gallery ('Poisson', N);

b = ones (size (A, 1), 1);

tol = 1e-6; maxit = 100;

[x0, fl0, rr0, it0, rv0] = pcg (A, b, tol, maxit);

L1 = ichol (A);

[x1, fl1, rr1, it1, rv1] = pcg(A, b, tol, maxit, L1, L1');

opts.type = 'nofill'; opts.michol = 'on';

L2 = ichol (A, opts);

e = ones (size (A, 2), 1);

norm(A * e - L2 * (L2' * e))

[x2, fl2, rr2, it2, rv2] = pcg (A, b, tol, maxit, L2, L2');

semilogy (0:maxit, rv0 ./ norm (b), 'b.');

hold on;

semilogy (0:it1, rv1 ./ norm(b), 'r.');

semilogy (0:it2, rv2 ./ norm(b), 'k.');

xlabel ('iterations');

ylabel ('error');

legend ('No Preconditioner', 'IC(0)', 'MIC(0)');

Octave

Matlab

2. In this second part of the script what is compared is the convergency of ICHOLT algorithm with different values of droptol.

L3 = ichol(A, struct('type', 'ict', 'droptol', 1e-1));

[x3, fl3, rr3, it3, rv3] = pcg (A, b, tol, maxit, L3, L3');

L4 = ichol (A, struct ('type', 'ict', 'droptol', 1e-2));

[x4, fl4, rr4, it4, rv4] = pcg (A, b, tol, maxit, L4, L4');

L5 = ichol (A, struct ('type', 'ict', 'droptol', 1e-3));

[x5, fl5, rr5, it5, rv5] = pcg (A, b, tol, maxit, L5, L5');

figure; semilogy (0:maxit, rv0 ./ norm (b), 'b-', 'linewidth', 2);

hold on;

semilogy (0:it3, rv3 ./ norm(b), 'b-.', 'linewidth', 2);

semilogy (0:it4, rv4 ./ norm(b), 'b--', 'linewidth', 2);

semilogy (0:it5, rv5 ./ norm(b), 'b:', 'linewidth', 2);

ylabel ('error');

xlabel ('iterations');

legend ('No Preconditioner', 'ICT(1e-1)', 'ICT(1e-2)', ...

'ICT(1e-3)', 'Location', 'SouthEast');

Octave

Matlab

As it can be seen Octave plots are the same as Matlab's ones. Both lead to a decrease in the number of steps upt to convergence of the pcg method. ILU algorithms could also have been used here, but due to the simetry of the problem matrix ICHOL is faster.

Regards,

Eduardo

Monday, 11 August 2014

Soft pencils down.

It's been quite long since I posted here due to some personal situations. Anyway to sum up: I have finished ilu and ichol functions as I have planned in the beginning with great results.

Things done after mid-term evaluation:

Implementing ICHOLT and ICHOL0 algorithms.
Fixing several bugs in ILU algorithms and introducing some enhancements for big sparse matrices with verly low densities.

The files involved in ichol, within the repository, are:

src/icholt.cc
src/ichol0.cc
ichol.m

You can clone the code from the repo:

https://edu159@bitbucket.org/edu159/gsoc2014-edu15

Before going into the details of the algorithms' implementation, I want to point out some details about how ichol behave in MATLAB.

In the real case the matrix must be symetric positive definite. In the complex case the input matrix must be hermitian. That means: diagonal elements of the input and output matrix have to be non-zero, positive and real values. So that, at each iteration those conditions have to be fullfilled.
If ichol is called just as L = ichol (A), Matlab ignores complex numbers and only work with their real part. Using L = ichol (A, setup) call, complex numbers are considered. Seriusly I do not understand why they do that and I have not followed that behaviour. Anyway if to be 100% compatible I must change that, it would be only a line of code extra.

Details of implementation

-->src/ichol0.cc

In this file is located the implementation of ICHOL(0) algorithm. The zero-pattern of the output matrix is the same as the input one so it is known from the beginning how much memory is needed to be allocated. The milu = ['on'|'off'] parameter indicates whether the dropped elements are added to the pivot or not (that keeps the colum sumation).

I will show two examples, one that corresponds to a big matrix with a very low density and the one that used Kai last year in his blog.

Example 1:

A = gallery ('poisson', 500);
size (A)
ans =
          250000   250000
tic; L = ichol (A); toc;
Elapsed time is 0.031718 seconds.
density = nnz (A) / (size (A)(1))^2
density =    1.9968e-05
norm (A - L*L', 'fro') / norm (A, 'fro')
ans = 0.0924207846384523
norm(A-(L*L').*spones(A),'fro')./norm(A,'fro')
ans =    2.28617974245061e-17

It can be seen that the product L*L' is quite different from A, but the product L*L' will match A on its pattern (that is expected for the ICHOL(0) algorithm. The execution time is just given to give an idea of how fast the code is. It is executed in a i7 2.4GHz.

Example 2:

This example is taken from that post, written by Kai the past year. He faced problems with the michol option, obtaining different results from Matlab.

input:

A = [ 0.37, -0.05, -0.05, -0.07;

-0.05, 0.116, 0.0, -0.05;

-0.05, 0.0, 0.116, -0.05;

-0.07, -0.05, -0.05, 0.202];

A = sparse (A);
opts.michol = 'on';
L = ichol (A, opts);

Octave:
ans =

   0.60828   0.00000   0.00000   0.00000
-0.08220   0.32014   0.00000   0.00000
-0.08220   0.00000   0.32014   0.00000
-0.11508 -0.18573 -0.18573   0.34607

Matlab:
ans =

    0.6083         0         0         0
   -0.0822    0.3201         0         0
   -0.0822         0    0.3201         0
   -0.1151   -0.1857   -0.1857    0.3461

Works fine.

-->src/icholt.cc

This file contains the implementation of ICHOLT algorithm. In this case the final structure of the output matrix is unknown. Therefore, a policy should be adopted for allocating memory. After trying different ways of doing that I end up using that one:

// max_len is the maximun length of ridx and data arrays for the output sparse matrix.
max_len = sm.nnz ();
max_len += (0.1 * max_len) > n ? 0.1 * max_len : n;

What is done here is just to increment 10% of the actual size of the ridx and data internal arrays of the output sparse matrix. But only if that amount is larger than the dimension of the input matrix (n). In other case the increment in size is just n. That policy seems to work very well in every case I tested and do not slow down the process at all due to reallocations.

Example 3:

icholt accepts a parameter for controling the sparsity of the ouput matrix called droptol. If droptol = 0 then the complete factorization takes place. If we increase that value the output matrix will become more sparse as more elements will be dropped. Taking the same matrix than in example 1:

A = gallery ('poisson', 500);
opts.type= 'ict'

% Complete factorization
opts.droptol = 0;
tic;L = ichol(A, opts);toc;
Elapsed time is 46.0734 seconds.
norm (A - L*L', 'fro') / norm (A, 'fro')
ans =    7.8595e-16

% droptol = 1e-2
opts.droptol=1e-2
tic;L = ichol(A, opts);toc;
Elapsed time is 0.0650802 seconds.
norm (A - L*L', 'fro') / norm (A, 'fro')
ans = 0.016734

% droptol = 1e-3
opts.droptol=1e-3
tic;L = ichol(A, opts);toc;
Elapsed time is 0.183416 seconds.
norm (A - L*L', 'fro') / norm (A, 'fro')
ans = 0.0021773

% droptol = 1e-4
opts.droptol=1e-4
tic;L = ichol(A, opts);toc;
Elapsed time is 0.589693 seconds.
norm (A - L*L', 'fro') / norm (A, 'fro')
ans =    2.4820e-04

As it can be seen, the higher the droptol parameter is, the sparser the matrix become. That lead to less execution times but on the other hand a higher error is obtained in the factorization. The complete factorization obviously have practically no error. Cool.

Location of source files inside Octave core

Now I've finished with the development of the algorithms, the final step is to integrate them into Octave core. For doing so I will create a subrepo of the default Octave repository and add the files. I have chosen the location for the functions looking at the last year repository Kai set.

Location:
libinterp/dldfcn: ilutp.cc ilu0.cc iluc.cc ichol0.cc icholt.cc
scripts/sparse: ilu.m ichol.m

That is just a sugestion and should be revised and accepted by the maintainers.

Future contributions

There is a week left that I want to use it to start (and hopefully finish) the development of sprandsym extra parameters that Matlab have but Octave does not. As I submitted in the past a changeset for a similar functionality in sprand and sprandn, it will be much easier to implement for me.

Also I am interested in developing some sparse linear solvers like minres and lsqr that Octave lacks. They are tightly related to the preconditioners I have been working on, and would be nice if they could be assigned to me for developing them.

Regards,

Eduardo

Thursday, 3 July 2014

ILU function - implementation details

Hi all,

The purpose of this post is to explain the details behind the implementation of the ilu function, my work during this first period of GSOC program. The files involved are:

src/ilu0.cc
src/iluc.cc
src/ilutp.cc
ilu.m

You can pull the repo using mercurial from:

https://edu159@bitbucket.org/edu159/gsoc2014-edu159

--> src/ilu0.cc

This file contains the implementation of ILU0 algorithm, the easiest one. In this version the zero-pattern of the input matrix is not modified so it is known the final structure of the output matrix. That simplifies things. For the milu=['col'|'row'] option, it is needed to implement both the IKJ and JKI versions of the algorithm to efficiently compute the compensation of the diagonal element with dropped elements. I managed to do both in the same function, just changing a few lines of code. Lets use Matlab's documentation example:

Example 1:

A = gallery('neumann', 1600) + speye(1600);
setup.type = 'nofill';
setup.milu = 'row';
[L,U] = ilu(A,setup);
e = ones(size(A,2),1);
norm(A*e-L*U*e)

ans =    1.4660e-14 (Very low, good)

The following little function can be used, when milu = ['row'|'col'] to check that all the columns/rows preserve its sumation (not only with ilu0 but with iluc and ilutp). Just run it after calling ilu in any form.

benchmark/check_sums.m (It can be found here in the repo)

function check_sums (A, L, U, milu)
b = L * U;
dim = 1;
if (milu == 'row')
    dim = 2;
endif
c = sum (b, dim);
d = sum (A, dim);
v = abs (c - d);
num_zeros = length (find (v > sqrt (eps)));
printf('Number of rows-columns not well compensated: %d\n', num_zeros);
if (num_zeros > 0)
    v (find (v > sqrt (eps)))
endif
endfunction

NOTE: I have found in Matlab 2013a that the row and col sumation does not work well always, and the row and column compensation fails for ilutp and iluc. I will show an example later.

--> src/ilutp.cc

This algorithm is the trickiest one due to pivoting, and has caused me more than one headache during its coding because it is not well described in Saad's book, just a few indications. I have found here several bugs in Matlab's 2013a implementation that make me a bit reticent about trusting results correctness.

Error 1

A = sparse ([3 4 4 3 -1.1; 2 0 8 9 2.2; 2 1 9 9 1; 3.2 10 2.3 2 4.5; 9 2 6 2 1]);
setup =
       type: 'ilutp'
       milu: 'col'
       droptol: 0.2000
       thresh: 0
     udiag: 0

>> [L, U, P] = ilu(a,setup);

sum(A(:, 2))    => 17
sum(L*U(:, 2) => 14.4857

Clearly the sum of the second column is not preserved :/.

Error 2

A = sparse([3 1.5 0 0 1.44; 0 1 0 0 -2;0 0 8 0 0; 0 2 0 2 -4.5; 0 -1 0 0 1]);

setup =
       type: 'ilutp'
       milu: 'col'
       droptol: 0.5000
       thresh: 0.2000
     udiag: 0

>> [L, U, P] = ilu(a,setup);

The output is:

U =
    3.0000    1.5000 0        0         0
         0         0         0        0         0
         0         0     8.0000    0         0
         0         0         0    2.0000    Inf
         0         0         0         0       -Inf

L =
     1     0     0     0     0
     0     1     0     0     0
     0     0     1     0     0
     0   Inf    0     1     0
     0     0     0     0     1

What are those Inf doing there? Clearly the are not detecting correctly 0 pivots.

Error 3

A= sparse([3 1 0 0 4; 3 1 0 0 -2;0 0 8 0 0; 0 4 0 4 -4.5; 0 -1 0 0 1]);

setup =
       type: 'ilutp'
       milu: 'row'
       droptol: 0
       thresh: 0
       udiag: 0

>> [L, U, P] = ilu(a,setup);

Output:

L =

    1.0000         0         0         0         0
    1.0000         0         0         0         0
         0           0    1.0000     0         0
         0    1.0000         0    1.0000     0
         0   -0.2500         0         0    1.0000

That 0 cannot be there. By construction L has to be a lower unit triangular matrix and that zero element spoils the L*U product. Again WRONG.

I have encountered more issues when testing Matlab using some testing matrices with 2000x2000 and 5000x5000 dimensions. With them my output is not the same as Matlab's (nnz of L and U are different from Matlab's), but taking into account the errors I found, I trust the most my version and not theirs. BTW in my case the rows and columns sums were preserved, theirs not. Obviously I have checked that those examples behave correctly in my code detecting 0 pivots

A similar example can be run as with ilu0:

Example 2:

A = gallery('neumann', 1600) + speye(1600);
setup.droptol = 1e-2;
setup.type = 'ilutp';
setup.thresh = 0.5;
setup.milu = 'row';
[L,U] = ilu(A,setup);
e = ones(size(A,2),1);
norm(A*e-L*U*e)

ans = 2.5170e-14 (Nice)

Pivoting: It worths to mention how pivoting is performed in that algorithm. When milu = 'row' the U matrix is column permuted (IKJ version used) but when milu=['off',|'col'] L is the permuted one and it is row permuted (JKI version used). Both algorithms share a lot of similarities and the code is designed to work in one version or another depending on milu option. That way code duplication is avoided. That was one of my primary fears when I realized that both versions were needed to attain Matlab compatibility.

--> src/iluc.cc

This is the file containing the crout version of ILU. This version is an enhancement of pure IKJ and JKI variants of gaussian eliminations. At iteration k the k:n section of k column and k row is computed. The enhancement is noticed in the execution time for the same input matrix. The following example is a comparison between my versions of ilutp and iluc:

For a 2000x2000 matrix ( I have not included this matrix in the repository due to it size):

With setup.droptol = 0.01 and setup.milu = 'off'.

Octave:
ilutp --> 12.3458 seconds
iluc    --> 6.31089 seconds

Matlab:
ilutp --> 12.868686 seconds
iluc    --> 7.498106 seconds

That is just to illustrate the performance of different versions.

NOTE: In iluc the dropping strategy for elements in U (stored as CRS) is to drop the element aij if (abs(aij) < droptol * norm(A(i, :))). For the L part (stored as CCS) aij is dropped if (abs(aij) < droptol * norm(A(:, j))).

Finally the numeric example:

Example 3:

A = gallery('neumann', 1600) + speye(1600);
setup.droptol = 1e-2;
setup.type = 'crout';
setup.milu = 'row';
[L,U] = ilu(A,setup);
e = ones(size(A,2),1);
norm(A*e-L*U*e)

ans = 2.5212e-14 (Nice)

That is all I wanted to show till now. I have written tests for the functions and adapted several ones from Kai last year work. However I want to add some more function-specific ones for validating results. The last thing pending is to place the source files inside the Octave source tree. I am not totally sure where they should go. On the other hand I have already started to work on ichol function and next week I'll report about my progress.

I know the post is a bit long but I think it is needed due to the poor verbosity I had through the blog during this period. I am aware of that (Jordi pointed me out a few days ago) and I will take into account for the following weeks.

Regards,

Eduardo

Thursday, 26 June 2014

Mid-term pre-post.

Hi all,

This is a short post just to clarify my state at midterm. As I have arranged with Kai, at the beginning of the next week I will write a verbose post to explain all the details related to the development of "ilu" function. I need a couple of days to tidy up all the code and make it presentable.

The state of the code now is functional. It lacks of tests and documentation and need a bit of re-factorization I will do this weekend. I will list several relevant points about the implementation of the function.

It is programmed to output the same as Matlab.

My version is at least as fast as Matlab's, outperforming by small amounts of time in large output cases.

I found a bug. At least on Matlab2013a regarding "ilutp" option when using milu="col". The col sum is not preserved in at least one case so I found in my testing cases that my function does not output the same. I will explain with more detail that issue next week.

In the end I am happy with the performance of the function and that I'm meeting the time-line proposed at the beginning of GSOC (with a few days of delay maybe).

Future work: The second part of the GSOC I will implement "ichol" function. There are several points to discuss about its development with Kai because he implemented the code last year but there were some kind of issues with the licensing of it. This period is a bit longer and I will have no classes nor exams. Because of that, if I have time remaining at the end, I can start implementing "minres" or "lsqr" algorithms that Octave lacks of too. So there will be no time wasted.

See you,

Eduardo

Sunday, 15 June 2014

ILUTP ready, steady, .... almost go.

It has been a bit more than two week since my last posting. I just wanted something solid enough to show before doing it again :). Because one image is better than a 1000 words. This is the state of my project till now:

In green color is what it is finished and working (obvious...) and in pink what it is partially finished. Red stuff is not working at all.

ILUTP implementation:

As I did with ilu0 function, I started the implementation of ilutp using the IKJ variant of the Gaussian elimination as prof. Saad does in his book. For working efficiently with CCS(column compressed storage) structure of sparse matrices it is only needed a transposition before and after the process. So I came up with a working version without pivoting using this strategy a week before this post (src/ilutp_crs.cc file in the repository). All OK till that point. Well ... it was not all OK. When pivoting comes into play, all get messy. It is not feasible to do row-pivoting efficiently after transposing the matrix and using the CCS structure with the IKJ algorithm. What I realized is that Matlab, by default, implements for milu="col" and milu="off" options a JKI variant of the algorithm. This way row- pivoting can be used and no transposition is needed using the CCS structure. So for the whole last week I had to almost rewrite entirely the function to implement it in the JKI way. That was a serious delay because I was not familiar with that variant. On the other hand I also got to the conclusion that milu="row" option demands a IKJ implementation with column pivoting. It can be infer from the documentation:

"....When SETUP.milu == 'row', U is a column permuted upper triangular factor. Otherwise, L is a row-permuted unit lower triangular factor."

Column pivoting means that if CCS is used as storage structure (Octave does), the strategy must be to [transpose - perform IKJ algorithm with column pivoting - transpose again]. So it is needed another implementation. That is the reason milu="row" is not working with ilutp. I had no time to implement that variant with pivoting. However, I have half way traversed because of my early IKJ implementation. So I am working on it.

I am taking special care to output exactly the same as Matlab, that means figuring out some nuances of their implementation that can only be understood after trial and error experimentation with their ilu version. I tried to test intensively the function and for my test cases my version outputs the same as Matlab's.

I have integrated the ilu0 and ilutp function inside a m-file wrapper called ilu.m located in the root directory of the repository. The file was written last year by Kai and need to be changed a bit. But for now it is OK to provide a user-friendly interface to try my functions. Use it the same way as you were in Matlab.

A quick script to test it could be:

A = sprand(100, 0.5);
setup.thresh = 0.4;
setup.droptol = 0.005;
setup.type = 'ilutp';
[L, U, P] = ilu(a, setup);

To get the code pull from here:

https://edu159@bitbucket.org/edu159/gsoc2014-edu159

Just execute make in the root directory and then open the Octave interpreter inside it too.

For the next week I am planning to finish the implementation for the milu option in both ilu0 and ilutp. (You can find the files as src/ilutp.cc and src/ilu0.cc in the project directory)

P.D: For who cares about performance ( I do), my version is a bit faster than Matlab's. You can try it for big matrices. I did, and for low values of droptol (means few terms of the matrix will be dropped), using pivoting and relatively big matrices (5000x5000) my version lasted around 200 secs and Matlab 220 secs. For a 2000x2000 one, the times were 19secs Matlab's, 13 secs mine. The numbers are just for you to get an idea. But they are good news.

See you!

Monday, 26 May 2014

Weekly post: Matlab ilutp behaviour disclosed and implemented in a m-file.

Here I am again.

This week has been more about researching than coding. I have finally been able to reproduce the output from the ilutp(ilu with threshold and pivoting) Matlab's algorithm with an m-script (named ILU_pc.m in my project's directory). The fact is that Matlab does not implement the algorithm as is described in Yousef Saad's book in a few ways. Because of that I had to do reverse engineering, testing many cases and matrices. That is the function, ugly as hell, but is just for testing purposes.

function [A, P] = ILU_pc(A, tau, thresh)

B = A;
n = length(A);
P = speye(n);
for i = 1:n
    for k = i:n
      A(k:n,k) *= thresh;
      A(k,k) /= thresh;
      [m,mi] = max(abs(A(k:n,k)))
      A(k,k) *= thresh;
      A(k:n,k) /= thresh;
      mi = mi + k -1;
      tmp = A(mi,:);
      A(mi,:) = A(k,:);
      A(k,:) = tmp;
      e = speye(n);
      e(mi,mi) = 0; e(k,mi) = 1;
      e(k,k) = 0; e(mi,k) = 1;
      P = e*P;
    endfor
    for k = 1:i-1
         if ( (A(i,k) == 0) || (abs(A(i,k)) < (tau*norm(B(:,k)))))
            A(i,k) = 0;
            continue
         endif
         A(i,k) = A(i,k) / A(k,k);
         A(i,k+1:n) = A(i,k+1:n) - A(i,k) * A(k,k+1:n);
    endfor
endfor

for i = 1:n
    for j = i+1:n
      if (abs(A(i,j)) < (tau*norm(B(:,j))))
        A(i,j) = 0;
      end
    end
end
end

The next goal to achieve is obviously to implement the function as .oct file translating this algorithm into a sparse one using Octave's internal data types.

All the testing I did was at college using their Matlab license. That delayed me because I couldn't do almost nothing in the weekend. Now I have a function that reproduce the behavior of Matlab's version I can test against it my c++ code.

See you next week!

Monday, 19 May 2014

The starting line.

As code period is starting today, I want to write a brief timeline for the first period of the GSOC here:

FIRST PERIOD

19 May-20 June: Implement ilu related functions (ilu0.cc, iluc.cc, ilutp.cc) and merge them together with ilu.m script

20-25 June: Automated test writing and documentation. Integration to mainstream octave code should be achieved here.

27 June: (Millstone 1) ilu.m is fully functional and integrated with Octave core.

Taking the idea from Kai's last year blog, I will keep track of what is already done with the following figure.

Regarding repository setup, Kai helped me to configure a subrepository using bitbucket service. At present, it only contains an outdated Octave development version just to make sure things work. For cloning:

hg clone https://edu159@bitbucket.org/edu159/octave-subrepo

However, I would not need to use this subrepo until the final integration of my code into Octave. For development purposes I have set another repository for daily work, as I am working with .oct files that compile standalone. Here is the repo you should check for my updated work.

hg clone https://edu159@bitbucket.org/edu159/gsoc2014-edu159

See you next week!