Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

grisuthedragon · 2017-01-20T19:31:10Z

After the discussion of compiling the HPL on the Power 8 platform I tried several of my codes with OpenBLAS from the development branch on an IBM OpenPower 8.
The machine is an IBM Model 8335-GTB (2x10 Cores, 8-way SMT, 160 virtual cores) with CentOS7.3, gcc 5.4, glibc 2.17 and Kernel 4.8.

I compiled the current OpenBLAS development ( ab2033f ) version using

make USE_OPENMP=1 USE_THREAD=1

and linked my code (Fortran, without any OpenMP parallelized parts). The code solves a Sylvester equation AXB+CXD=F, (A,B,C,D,F,X are 1024x1024 matrices) using some algorithm and it relies on the following BLAS operations (figured out using the profiling feature of OpenBLAS):

dswap
dscal
dnrm2
idamax
drot
dgemv
dger
dtrmv
dgemm
~~dlaswp~~ (see below in the posts)

The code solves the equation computing X 25 times and checks the forward error. When I execute the code with setting OMP_NUM_THREADS=20 to avoid over-subscription of the cores more than the half of the performed computations are wrong (ferr := || X_true - X_computed || / || X_true ||):

Run:   0 	 ferr = 6.12559e-12
Run:   1 	 ferr = 6.12559e-12
Run:   2 	 ferr = 0.0187795
Run:   3 	 ferr = 0.63829
Run:   4 	 ferr = 6.12559e-12
Run:   5 	 ferr = 0.0902158
Run:   6 	 ferr = 0.0154748
Run:   7 	 ferr = 4367.12
Run:   8 	 ferr = 4.22541
Run:   9 	 ferr = 6.12559e-12
Run:  10 	 ferr = 6.12559e-12
Run:  11 	 ferr = 0.00973169
Run:  12 	 ferr = 30.6905
Run:  13 	 ferr = 6.12559e-12
Run:  14 	 ferr = 12.3574
Run:  15 	 ferr = 334.143
Run:  16 	 ferr = 574.977
Run:  17 	 ferr = 0.0585436
Run:  18 	 ferr = 258.05
Run:  19 	 ferr = 401.401
Run:  20 	 ferr = 493.521
Run:  21 	 ferr = 0.112386
Run:  22 	 ferr = 0.0393082
Run:  23 	 ferr = 0.258171
Run:  24 	 ferr = 543.353

If I run the same code using OMP_NUM_THREADS=1 or the reference BLAS implementation and having OMP_NUM_THREADS unset (to ensure that really nothing in my code depends on threading) I obtain a forward error ferr of approx. 10^-12 for all runs of the benchmark. I already check what happened if I restrict the number of threads to 20 at compile time (make ... NUM_THREADS=20) and this yields the wrong result by more runs are correct. Disabling threading completely in OpenBLAS results in the correct reasults as well.

From this observation I concluded that something with threading went wrong and there is a race condition.

Interestingly, if I use the PGI for OpenPower compiler suite, which delivers a BLAS implementation based on OpenBLAS (I think a slightly modified one to be able to be compiled with the PGI compiler on the ppc64le architecture) the same error appears. But this means the bug is not in the GNU OpenMP implementation because PGI uses its own separate one.

Unfortunately, I do not have a minimal-not-working example for this bug yet, because the code mentioned above is part of current research.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2017-01-20T19:48:33Z

I wonder if valgrind/helgrind is available for power8, or if you might have access to some comparable thread debugger ? I made an amateurish attempt to fix the races I came upon in #1052 but may well have missed something ifdef'd for non-x86 platforms (conversely you may even want to try a version preceding 87c7d10 of Jan 8 in case I managed to break something for you)

grisuthedragon · 2017-01-21T15:29:49Z

Helgrind is available and I have done a smaller example (solving for a 256x256 matrix) to do this. Otherwise it takes for days and I recognized that even for this small problem in <10% of the cases I got the wrong results as well. I compiled the current OPenBLAS again with enabled debug symbols and for using 20 OpenMP threads and I get the following helgrind output:

https://gist.github.com/grisuthedragon/43a36d248454fd02c52ea18aa1b2614f

(It will be too long to be posted here, therefore the gist link)

The error log also shows some stuff from the HDF5 library but also in examples where the data is not read by HDF5 it computes wrong results.

PS: I also tried to use 87c7d10 but this does not change anything. Due to the fact that I regard the same erros in PGIs OpenBLAS (released Nov 16, 2016) the change in 87c7d10 does not influence this behavior. I also tested it on an 16 Core x86-86 Haswell Xeon, and everything works fine, at least what I can say after 1000 runs of the code.

martin-frbg · 2017-01-21T17:02:33Z

Thanks for the helgrind.log - at first glance it looks as if at least part of the problem was in libgomp, but I now see from the helgrind manual that it simply cannot keep track of omp's threading primitives unless gcc including libgomp was specially compiled with the --disable-linux-futex option. Looking further, according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55561 another option seems to be to compile everything with -fsanitize=thread to let gcc itself do the checking, but this again would also need a libgomp specifically recompiled with --disable-linux-futex (this bugzilla entry was last edited around the time of gcc 4.9 it seems, so it is unclear if your 5.4 contains any improvements over that stage). I am not sure how to proceed.

martin-frbg · 2017-01-21T17:39:41Z

IFF the reduced testcase runs quickly enough you could try commenting the DGEMVNKERNEL and/or DGEMMKERNEL line(s) in KERNEL.POWER8 to see if the issue crept in with more recent optimizations.

brada4 · 2017-01-21T18:25:03Z

helgrind points finger to interface/lapack/laswp.c.
is it possible to edit that file:
82 if (nthreads == 1) {
to become
82 if (1) {
I.e disable threading in this just one function.

Rest of output looks like heap of false positives as in gcc bugzilla....

grisuthedragon · 2017-01-22T15:24:45Z

@martin-frbg Unfortunately, the thread sanitizer is only available on on x86_64 as far as I know. But I recompiled the gcc with --disable-linux-futex an will do a helgrind run again. Commenting out the specialized kernels is on my to do list as well.
@brada4 This does not change anything but leading me to a new idea.

In general, when I remember right, I got into struggle with an similar error on an Power 7 (Big Endian), two years ago but there I stopped using OpenBLAS on this machine and switched to ATLAS.

grisuthedragon · 2017-01-22T20:54:30Z

Turning on the --disable-linux-futex incredibly slow down the computations and make helgrind running 10 slower. But running the original large problem without helgrind gives the correct results (100 runs).

The I triied the following, this time with the compiler with enabled linux-futex and I compiled OpenBLAS without LAPACK support and use Netlib LAPACK to reduce the number of possible errors a bit. This does not yield correct results but I can concluded that the dlaswp function seems to be ok. On top of this setup I took a look how the GEMM call is done and I recognized that the old threading model can be still turned on. So I compiled OpenBLAS with

make USE_OPENMP=1 USE_THREAD=1 NUM_THREADS=20 NO_LAPACK=1 USE_SIMPLE_THREADED_LEVEL3=1

Now, I get the correct results as well (checked for 1000 runs of the benchmark). From this point of view the problem must somewhere inside the new GEMM threading in combination with OpenMP. I will not say GNU OpenMP because the PGI version does the same strange things.

PS: The OpenBLAS version delivered with the 2016.10 community edition of the PGI compiler suite is 0.2.18.

martin-frbg · 2017-01-22T22:56:14Z

Thanks for the extensive testing. Now you seem to have pushed the problem back into generic code when I was hoping that it was something specific to the POWER platform ? But if it is indeed the gemm threading then hopefully the existing gemm checks and benchmarks are sufficient to expose the problem. (Not sure if I am able to tackle this problem though...)

martin-frbg · 2017-01-22T23:36:36Z

Hmm. A quick check on Haswell, running helgrind on "benchmark/dgemm.goto 100 100 1" indeed yields complaints about a race between two instances of inner_thread without any locks held. (One at line 433 of level3_thread.c called from blas_thread_server, the other at line 494 of the same file, called from dgemm_thread_nn via exec_blas). And they go away when USE_SIMPLE_THREADED_LEVEL3 is set. Food for thought at least...

brada4 · 2017-01-23T07:03:57Z

@martin-frbg the other martin's helgrind ouput looked like laswp reads past what it is expected to.

martin-frbg · 2017-01-23T08:47:56Z

@brada4 Not sure we can trust the original log where helgrind had no idea what OMP was doing behind its back. Also grisuthedragon has apparently narrowed it down to GEMM already by removing the OpenBLAS dlaswp from the equation, and a thread data race demonstrably occurs in level3_thread.c even on x86_64 when the more elaborate multithreading code is used. (It is possible that the likelyhood of the race occuring depends on the implementation of YIELDING for the platform - at least I remember from #990 that doing away with sched_yield() on x86_64 seemed to increase the likelyhood of lockups - hopefully now fixed through #995)

grisuthedragon · 2017-01-23T15:04:31Z

So as I promised the helgrind output of the small example with --disable-linux-futex in the compiler but enabled Power8 optimizations:

https://gist.github.com/grisuthedragon/87df1ae18702fb30d1ee546c08085362

As well as in the small Haswell experiment of @martin-frbg the inner_thread routine is involved.

Due to the disable futex one run takes up to 6 hours -.-

brada4 · 2017-01-26T08:51:10Z

I looked through your log - it looks like all false positives, i.e 1ct optimisation could be possible getting threads in next step in (in->kernel->out) processing path to start in warm data, but it does not explain numerical mayhem.

Lets try another thing - all 0s, all 1s, checkerboard,(all combinations of those passed to dgemm)
Any stripes off-patern means one of 3 functions has one-off error choosing strides try somehing you can easily see on the screen.

martin-frbg · 2017-01-26T13:19:32Z

Wouldn't an error in stride calculation lead to rather obvious and unavoidable errors on any platform running more than a single thread, in contrast to thread data races where "merely" the likelyhood of bad access patterns increases with the number of threads ?
(I have managed to reduce the number of warnings from my helgrind testcase a bit already, but further changes only introduced lockups so far.)

brada4 · 2017-01-26T13:47:59Z

could be base 0 vs base 1 numbering etc.... You never know.

grisuthedragon · 2017-02-24T09:26:20Z

So guys, I started looking into the gemm scheduler but I do not get how this thing really works. Is there any paper/preprint/whatever publication describing the "new" scheduling? I only found the papers of Goto and Robert van de Geijn about the optimizations of the kernels.

martin-frbg · 2017-02-24T14:55:24Z

I looked already but did not find any, nor was there any explanation to be found in the various old versions of libgoto I had archived. The gotoblas mailing list archive is gone from the utexas website and is not available at archive.org either. On arxiv.org I did come across a very recent whitepaper by Smith & van de Geijn (labeled FLAME working note 83) that touches on how Goto's algorithm tries to optimize cache use but it does not address the topic of parallelization at all.

brada4 · 2017-02-24T16:10:47Z

It does not look complex by any means:
driver/level3/gemm_thread_variable.c
If GEMM parallelisation was at fault dont you think it would somehow self-fix on Haswell that nobody complained yet?

grisuthedragon · 2017-05-17T19:11:17Z

After a few weeks I run into problems with this issue again. This time I have a piece of code which reproduces the error. The attached code below computes the QR decomposition of a matrix using the Tile-QR approach and is parallelized using OpenMP4. Details about the algorithm can be found here.

System Details:

CentOS 7.3 litle endian
IBM Power8 (SL822C) (20 Cores, 160 Threads)
GCC 5.4.1

For the first experiment I compiled the current OpenBLAS using

make USE_OPENMP=1

and compiled the attached code by

gfortran -fopenmp tileqr.f90 path/to/openblas/libopenblas.a

As result for a test run I get:

$ OMP_THREAD_LIMIT=20 OMP_NUM_THREADS=20 ./a.out 10240 10240 256 32
LAPACK   -- Time:   46.312    Gflops/s:   30.913
Tiled    -- Time:    6.366    Gflops/s:  224.905
Max diff between R from tiled   and LAPACK is: 0.9E+05

which obviously did not return the correct result.

Using a single threaded OpenBLAS compiled with

make USE_THREAD=0 USE_OPENMP=0

I get

$  OMP_THREAD_LIMIT=20 OMP_NUM_THREADS=20 ./a.out 10240 10240 256 32
LAPACK   -- Time:   73.346    Gflops/s:   19.519
Tiled    -- Time:    5.364    Gflops/s:  266.911
 
Max diff between R from tiled   and LAPACK is: 0.2+226

which is wrong as well. Running the same code with RefBLAS and RefLAPACK the maximum difference is in the order of 10^(-16). The same holds true if I use IBM ESSL as BLAS library and the RefLAPACK. From this point of view it seems that the OpenBLAS functions, even in single thread operation mode are not thread safe.

tileqr.f90.txt
The code only works if m % b = 0 and n % b = 0 which is guaranteed above. Furthermore, the performance of DGEQRT from LAPACK seems to be poor.

brada4 · 2017-05-17T23:05:55Z

It is accuracy problem, not thread safety.

grisuthedragon · 2017-05-18T06:20:24Z

@brada4 On the one hand it works with on x86-64 and there with different OpenBLAS and MKL versions this can not be an accuracy problem. On the other hand, I am computing a QR decomposition which is build on to of orthogonal/unitary transformation so from the numerical point of view the differences between the LAPACK and the new implementation can not be explained. Furthermore that the single and the multithreaded OpenBLAS results differ in more than 200 orders of magnitude can not be explained by the accuracy problems.

Another argument against the accuracy problem is, that if I set the number of threads to 1 at runtime I obtain the correct result (independent of the threading capabilities of OpenBLAS):

$ OMP_THREAD_LIMIT=1 OMP_NUM_THREADS=1 ./a.out 5120 5120 256 32
LAPACK   -- Time:    8.671    Gflops/s:   20.637
Tiled    -- Time:    9.205    Gflops/s:   19.440
 
Max diff between R from tiled   and LAPACK is: 0.1E-15

martin-frbg · 2017-05-18T06:46:21Z

Possibly not a race this time but side effects of the allegedly broken assembly kernels for PPC (issue #1078)

grisuthedragon · 2017-05-18T07:37:44Z

@martin-frbg I think that could be the only explainable reason at the moment. Because on x86-64 (Sandybride, Haswell) I never realized such a behavior.

I am not able to write the assembly stuff. The only thing I can do is performing some tests on the POWER platforms (Power 8 Little Endian/ Power 7 Big Endian).

grisuthedragon · 2017-05-18T08:11:42Z

In order to see from which optimized routine the errors are coming I have the following idea.. I replaced the KERNEL.POWER8 by the most generic version, which only relies on routines written in pure C. KERNEL.POWER8
and then i compiled OpenBLAS using:

make USE_OPENMP=1 CC=gcc-5 CXX=g++-5 FC=gfortran-5 NO_LAPACK=1

which unfortunately leads to some errors in the BLAS checking: (for double, complex and complex*16 the results are similar)

cat test/SBLAT3.SUMM 
 TESTS OF THE REAL             LEVEL 3 BLAS

 THE FOLLOWING PARAMETER VALUES WILL BE USED:
   FOR N                   0     1     2     3     7    31
   FOR ALPHA             0.0   1.0   0.7
   FOR BETA              0.0   1.0   1.3

 ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN   16.00

 RELATIVE MACHINE PRECISION IS TAKEN TO BE  1.2E-07

 SGEMM  PASSED THE TESTS OF ERROR-EXITS

 SGEMM  PASSED THE COMPUTATIONAL TESTS ( 17496 CALLS)

 SSYMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.262339         -0.849231E-01
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* SSYMM  FAILED ON CALL NUMBER:
    382: SSYMM ('R','U',  1,  7, 1.0, A,  8, B,  2, 0.0, C,  2)    .

 STRMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.485095          0.186814    
      THESE ARE THE RESULTS FOR COLUMN   2
 ******* STRMM  FAILED ON CALL NUMBER:
    758: STRMM ('R','U','N','U',  1,  7, 1.0, A,  8, B,  2)        .

 STRSM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.186813         -0.166775E-01
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* STRSM  FAILED ON CALL NUMBER:
    764: STRSM ('R','U','T','U',  1,  7, 1.0, A,  8, B,  2)        .

 SSYRK  PASSED THE TESTS OF ERROR-EXITS

 SSYRK  PASSED THE COMPUTATIONAL TESTS (  1944 CALLS)

 SSYR2K PASSED THE TESTS OF ERROR-EXITS

 SSYR2K PASSED THE COMPUTATIONAL TESTS (  1944 CALLS)

 END OF TESTS

Level 1 and Level 2 operations are working without any error. If we get this correctly working, than I can plugin each optimized routine in the KERNEL.POWER8 and we see which one breaks at least my application.

martin-frbg · 2017-05-18T08:30:20Z

The stock KERNEL.POWER8 file has a number of commented-out entries for xSYMV functions near the end - maybe it would make sense to enable these (first?) to see if switching back to their generic implementations helps.

grisuthedragon · 2017-05-18T08:45:20Z

If I only enable the xSYMV routines from the generic implementation, all BLAS test are passed but for my application this does not make any change.

martin-frbg · 2017-05-18T09:12:59Z

Getting the BLAS tests to pass would appear to be an important first step in any case. Alan Modra has fixed the inline assembly contained in local .c implementations, so the likely candidates for further tests are the implementations in the various power8 .S files - xGEMM, xTRSM, xTRMM - or do I take your reply to mean that switching everything back to the non-optimized defaults does not make your code run either ?

grisuthedragon · 2017-05-18T10:36:14Z

I think, having a generic OpenBLAS, which does not rely on any platform specific optimizations, i.e., wriiten in pure (ANSI/ISO) C would be preferable to find errors. Regardless of the problem I have with the POWER architecture at the moment, this allows us to run OpenBLAS on arbitrary platform which only provide a C/Fortran compiler. So we can check on the one hand if the general OpenBLAS concept works and searching errors in this part and on the other hand if we know that the general concept etc. works that we can enable the optimized kernels step by step to figure out which causes the problems. I think that is a more systematic way to find the complicated errors.

If there are no other race conditions in OpenBLAS then my code should run correctly with the unoptimized routines. As I said on x86-64 with OpenBLAS/MKL/RefBLAS/ATLAS and on ppc64le with RefBLAS/ESSL it works, so most likely the bug seems to be inside the ppc64 specific routines.

brada4 · 2017-05-18T14:08:57Z

You could exclude the three routines tests show as defective, not others btw...

I ran your sample on sandy bridge and broadwell 10-core CPUs, for me lapack is tiny bit faster than tiled routine but slows a bit as thread number is increased past some threshold (like faster until 3-4 then slightly down) - can you check for same effect once you manage to get accuracy within control?

grisuthedragon · 2017-05-19T06:28:16Z

@brada4
On my workstation (Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, 4 cores, gcc 5.4.1) I get :

OMP_NUM_THREADS=4 ./a.out 5120 5120  256 32
LAPACK   -- Time:    5.515    Gflops/s:   32.447
Tiled    -- Time:    3.421    Gflops/s:   52.311
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

brada4 · 2017-05-19T12:22:17Z

I suspect this is one or other way similar to #1051 for performance issue. Try perf record + perf report, i see sched_yield + schedule taking more percentage of CPU time the more CPUs you have (still looks almost optimal at 4 cores, but gets worse after)

grisuthedragon · 2017-05-19T12:39:25Z

@brada4 The performance is not the topic of this issue. But in order to obtain a good performance using the TILE-QR approach one have to change the data layout of the matrices and the presented code above was only for reproducing the issue. If i get correct results on the 20 core POWER server than I will take a look on the performance.

martin-frbg · 2017-05-19T12:50:32Z

So could you update about the actual issue please - do you still see your test failing on ppc with everything replaced by generic functions, or did you not get around to switching out individual xGEMM etc implementations yet ? (And I assume you are still building with the USE_SIMPLE_THREADED_LEVEL3=1 option to work around the original issue, though I did not see this mentioned in your recent messages ?)

grisuthedragon · 2017-05-30T15:01:19Z

I started a new clear issue as #1191.

grisuthedragon mentioned this issue Feb 7, 2017

openblas fails to compile on ppc64le-linux using gcc-7 #1078

Closed

martin-frbg mentioned this issue Feb 11, 2017

Terrible Deadlock issue #937

Closed

martin-frbg mentioned this issue Mar 8, 2017

Race condition in ssyrk causing deadlock #1117

Closed

grisuthedragon mentioned this issue May 30, 2017

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

Closed

grisuthedragon closed this as completed May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

grisuthedragon commented Jan 20, 2017 •

edited

Loading

martin-frbg commented Jan 20, 2017

grisuthedragon commented Jan 21, 2017 •

edited

Loading

martin-frbg commented Jan 21, 2017

martin-frbg commented Jan 21, 2017

brada4 commented Jan 21, 2017

grisuthedragon commented Jan 22, 2017

grisuthedragon commented Jan 22, 2017 •

edited

Loading

martin-frbg commented Jan 22, 2017

martin-frbg commented Jan 22, 2017 •

edited

Loading

brada4 commented Jan 23, 2017

martin-frbg commented Jan 23, 2017

grisuthedragon commented Jan 23, 2017

brada4 commented Jan 26, 2017

martin-frbg commented Jan 26, 2017

brada4 commented Jan 26, 2017

grisuthedragon commented Feb 24, 2017

martin-frbg commented Feb 24, 2017

brada4 commented Feb 24, 2017

grisuthedragon commented May 17, 2017 •

edited

Loading

brada4 commented May 17, 2017

grisuthedragon commented May 18, 2017 •

edited

Loading

martin-frbg commented May 18, 2017

grisuthedragon commented May 18, 2017

grisuthedragon commented May 18, 2017

martin-frbg commented May 18, 2017

grisuthedragon commented May 18, 2017

martin-frbg commented May 18, 2017

grisuthedragon commented May 18, 2017 •

edited

Loading

brada4 commented May 18, 2017

grisuthedragon commented May 19, 2017

brada4 commented May 19, 2017

grisuthedragon commented May 19, 2017 •

edited

Loading

martin-frbg commented May 19, 2017

grisuthedragon commented May 30, 2017

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Comments

grisuthedragon commented Jan 20, 2017 • edited Loading

martin-frbg commented Jan 20, 2017

grisuthedragon commented Jan 21, 2017 • edited Loading

martin-frbg commented Jan 21, 2017

martin-frbg commented Jan 21, 2017

brada4 commented Jan 21, 2017

grisuthedragon commented Jan 22, 2017

grisuthedragon commented Jan 22, 2017 • edited Loading

martin-frbg commented Jan 22, 2017

martin-frbg commented Jan 22, 2017 • edited Loading

brada4 commented Jan 23, 2017

martin-frbg commented Jan 23, 2017

grisuthedragon commented Jan 23, 2017

brada4 commented Jan 26, 2017

martin-frbg commented Jan 26, 2017

brada4 commented Jan 26, 2017

grisuthedragon commented Feb 24, 2017

martin-frbg commented Feb 24, 2017

brada4 commented Feb 24, 2017

grisuthedragon commented May 17, 2017 • edited Loading

brada4 commented May 17, 2017

grisuthedragon commented May 18, 2017 • edited Loading

martin-frbg commented May 18, 2017

grisuthedragon commented May 18, 2017

grisuthedragon commented May 18, 2017

martin-frbg commented May 18, 2017

grisuthedragon commented May 18, 2017

martin-frbg commented May 18, 2017

grisuthedragon commented May 18, 2017 • edited Loading

brada4 commented May 18, 2017

grisuthedragon commented May 19, 2017

brada4 commented May 19, 2017

grisuthedragon commented May 19, 2017 • edited Loading

martin-frbg commented May 19, 2017

grisuthedragon commented May 30, 2017

grisuthedragon commented Jan 20, 2017 •

edited

Loading

grisuthedragon commented Jan 21, 2017 •

edited

Loading

grisuthedragon commented Jan 22, 2017 •

edited

Loading

martin-frbg commented Jan 22, 2017 •

edited

Loading

grisuthedragon commented May 17, 2017 •

edited

Loading

grisuthedragon commented May 18, 2017 •

edited

Loading

grisuthedragon commented May 18, 2017 •

edited

Loading

grisuthedragon commented May 19, 2017 •

edited

Loading