forked from KhronosGroup/webcl-validator
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDESIGN.txt
564 lines (411 loc) · 19.3 KB
/
DESIGN.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
Overview of the validator design
================================
General overview
----------------
The WebCL validator is an executable tool that takes a WebCL C source
file as its input and outputs a validated version of the input
source. The validated source contains checks that prevent illegal
memory references. If validation can't be done, the tool outputs
nothing but relevant error messages. If everything goes well the
transformed output source can be fed to an OpenCL driver.
The validator has been implemented as a Clang tool. There are actually
multiple Clang tools embedded into a single executable. Each of them
implement different validation phases. The first phase tool
preprocesses the input source. Intermediate phases normalize and
simplify the input source. The last phase runs the validation
algorithm.
Preprocessing and interfacing
-----------------------------
Transformations need to be applied on a source that doesn't contain
any preprocessing directives because the validator phase needs to know
what it's going to validate. That wouldn't be possible if there were
any (complicated) branching directives in the code.
Since WebCL C code may contain conditional preprocessor branches, any
implicit preprocessor definitions need to be known. In practice these
implicit definitions are the features and extensions supported by the
used OpenCL driver. More specifically, the supported features and
extensions depend on the used OpenCL device. Therefore the validator
needs to know which device is used for building its output.
If the validator knows the target device (e.g. it can be passed as a
command line option), it can deduce the implicit definitions by
querying the device properties. However, this would introduce a
dependency to the used OpenCL driver. The validator can function
without OpenCL driver dependency if the relevant implicit defines are
passed to the tool as command line options. For this reason each
implicit define should be passed to the tool with -D option.
Currently supported options (extensions) are:
-Dcl_khr_fp16
-Dcl_khr_fp64
-Dcl_khr_gl_sharing
-Dcl_khr_initialize_memory
Normalization of input
----------------------
The input sources are normalized after preprocessing. Normalizations
are transformations that simplify the structure of the code but don't
change the functionality. For example, one possible normalization is
the separation of type definitions from variable declarations:
struct Type { int field; } variable;
->
struct Type { int field; };
struct Type variable;
Normalizations are done only to simplify transformations. Sometimes
multiple transformations need to be applied to same source code
locations and it's much easier to implement one specific
transformation at a time than to keep track of a very complex
transformation consisting of multiple partial transformations. This is
because Clang's source code rewriting mechanism allows only one
transformation to be applied reliably to a specific source code
location.
Therefore there are multiple passes, each pass applying no more than a
single transformation to a single source code location and each
subsequent pass reparsing the output of the previous pass.
Validation approach
-------------------
The validation solution described here is based on an already existing
LLVM IR based implementation. This alternative solution was invented
as a replacement for a smart pointer based solution that wasn't found
to be good enough in practice.
The project sources are available at
https://github.com/toaarnio/llvm-ir-memprotect. The people involved in
that project have also done initial performance analysis and therefore
they should have some rough benchmark results available. However, they
haven't done extensive benchmarking yet.
It's best to start explaining this alternative solution with example
code.
Example transformation
----------------------
Consider the following OpenCL C code:
int helper(int *ptr)
{
return *ptr;
}
__kernel void entry(__global int *res)
{
int val = 0;
res[get_global_id(0)] = helper(&val);
}
The code will be transformed as:
typedef struct {
int entry__val;
} _WclPrivates;
typedef struct {
__global int *entry__res_min;
__global int *entry__res_max;
} _WclGlobalLimits;
typedef struct {
_WclGlobalLimits gl;
_WclPrivates pa;
} _WclProgramAllocations;
int helper(_WclProgramAllocations *_wcl_allocs, int *ptr)
{
return *CHECK_PRIVATE(_wcl_allocs, ptr);
}
__kernel void entry(__global int *res, unsigned long _wcl_res_size)
{
_WclProgramAllocations _wcl_allocations_allocation = {
{ &res[0], &res[_wcl_res_size] }, // _WclGlobalLimits
{ 0 } // _WclPrivates
};
_WclProgramAllocations *_wcl_allocs = &_wcl_allocations_allocation;
*CHECK_GLOBAL(_wcl_allocs, res + get_global_id(0)) =
helper(_wcl_allocs, &_wcl_allocs->pa.entry__val);
}
Introduction to required transformations
----------------------------------------
Variable relocation:
Any variable that requires its address to be known, must be relocated
to an address space specific structure. In the example 'val' was
relocated to '_WclPrivates'. Similar structures are required for local
and constant address spaces.
Variable renaming:
Relocated variables may need to be renamed, because identically named
variables may be relocated to the same address space structure from
multiple places. In the example 'val' was renamed as 'entry__val', the
new name must be unique within the address space structure.
Type relocation:
The types of relocated variables need to be relocated as well (or at
least forward declarations need to be generated). If the type of 'val'
would have been 'struct V { int v; }' instead of 'int', the type would
have had to be introduced before '_WclPrivates'.
Type renaming:
Just like relocated variables, some relocated types need to be
uniquely renamed.
Static limit check:
Accesses for memory that has been defined inside the kernel can be
checked by determining the begin and end addresses of the
corresponding address space structure. In the example 'CHECK_PRIVATE'
can clamp 'ptr' to be within instance of '_WclPrivates'. The limits of
this check are known at compile time.
Dynamic limit check:
Size of memory regions passed to kernels need to be defined. In the
example '_wcl_res_size' was introduced for 'res'. The extra size
parameters can be used to define limits for passed memory
regions. This is shown in '_WclGlobalLimits'. Similar limit structures
are needed for local and constant address spaces as well. In the
example 'CHECK_GLOBAL' will use the limit information to clamp
'res + get_global_id(0)'.
Initialization:
All address space structures (for internal allocations) and limit
structures (for external allocations) are put together into
'_WclProgramAllocations'. Instances of the generated structures need to
be created and initialized in the beginning of each kernel.
Transformation:
A pointer to the main allocation structure, '_wcl_allocs', needs to be
passed to any function that performs pointer checks. The signatures of
these functions need to be modfied so that they can accept this extra
argument. See the signature and call of 'helper'.
Each reference to relocated variables needs to be transformed. In the
example 'val' was transformed to '_wcl_allocs->pa.entry__val'.
Issues with required transformations
------------------------------------
Variable relocation:
Simple implementation allocates too much space for address space
structures and relocates all variables, not only those whose address
is needed:
__kernel void entry(__global int *res)
{
int val = 0; // relocated even though not necessary
res[get_global_id(0)] = val;
}
In the ideal case each kernel requires its own structures, otherwise
too many unused variables may be relocated:
int helper(int *ptr)
{
return *ptr;
}
__kernel void entry1(__global int *res)
{
int val1 = 0; // relocated
res[get_global_id(0)] = helper(&val1);
}
__kernel void entry2(__global int *res)
{
int val2 = 0; // relocated to the same structure
res[get_global_id(0)] = helper(&val2);
}
Static limit check:
In principle, each internally allocated variable can be relocated to a
single address space structure. Therefore each pointer access needs to
perform only a single region check.
However, in some very rare cases the initialization of a single
constant address space structure may become challenging:
__constant int val = { 0 };
__constant int * __constant ptr = &val;
An initializer for transformed code, that is accepted by C compilers,
makes at least AMDs OpenCL C compiler crash:
typedef struct {
int val;
__constant int *ptr;
} _WclConstants;
__constant _WclConstants _wcl_constants = { 0, &_wcl_constants.val };
Dynamic limit check:
Each memory region passed to a kernel fragments the corresponding
address space:
__kernel void entry(
__constant int *c1, __constant int *c2,
__global int *g1, __global int *g2,
__local int *l1, __local int *l2)
{
...
}
In the above example each pointer access needs to perform three region
checks (2 dynamic + 1 static) for local and constant address
spaces. Two region checks are required for global address space
pointers.
Comparison with smart pointers
------------------------------
By a smart pointer we mean a structure that contains limits for a
pointer:
typedef struct { char *min; char *ptr; char *max; } SmartPtr;
Generality:
The alternative method doesn't restrict pointer operations in any way,
whereas keeping track of smart pointer limits can become challenging
in some cases. Consider the following example:
int value = 0xabcdef;
int *pointer = &value;
int *base = 0;
unsigned long offset = pointer - base;
int value2 = *(base + offset);
In this case 'base' can't have valid limits and therefore evaluating
'*(base + offset)' must fail with smart pointers.
Size required by pointers:
Alternative solution doesn't suffer size overhead regarding
pointers.
This issue would be more pronounced with large smart pointer
arrays. Each smart pointer would require additional min/max addresses
to be stored and would consume three times more memory/registers,
which would have a negative effect on memory consumption.
Initialization and assignment of pointers:
Smart pointer approach requires assigment to limit fields whenever a
pointer is created (e.g. with &-operator) or assigned (possibly passed
as argument). This degrades performance because whenever a pointer is
assigned we need to store/load three addresses instead of just one.
Preloading memory access limit values:
Since in smart pointer approach each pointer has its own limits, many
different pointers cannot share the same limits. This means that we
would need to do indirect load operation for every pointer access to
get limits for that single memory access.
With the altenative approach we have single limits for private address
space, which we can load just once in running kernel and reuse the
loaded values for every memory access of private address space. For
other address spaces we have few separate limits, which the compiler
should be able to optimize in such a way that the required limit loads
would be done just once in each function.
Number of region checks in pointer accesses:
Smart pointers require only one region check for each pointer access,
where min and max limits are always loaded indirectly, whereas the
alternative approach may require multiple checks and usually limits
needs to be loaded just once for each limit range in the program.
Only one check is needed for pointers accessing private memory. Local,
global and constant memory accesses might need multiple checks depending on
how many different memory areas are passed to kernels and how easy it
is to resolve exact limits of each memory access.
The user can usually rearrange the code in such a way that local and
constant memory objects don't need to be passed. Global memory is
commonly copied to local memory before it's used and therefore global
memory checks can also be minimized if the user wants to. Compiler
also knows at compile time if inefficient checks are needed and can
guide the programmer how to possibly avoid the checks.
Also simple dependence analysis can resolve single limits for the
cases where kernel arguments are accessed directly through a base
pointer.
It's therefore possible to write code that doesn't require multiple
region checks for core parts of the code. The validator tool could
provide optimization hints and warnings if it recognizes that too many
region checks are required.
Coding guidelines for programmer to avoid checks
------------------------------------------------
Avoid having to use unneccessary amount of *(), -> or [] operators,
because each of them causes check to be added to code.
NOTE: In the future, dependecy analysis can help to remove checks in
many cases.
Memory access validation
------------------------
All types of memory accesses are recognized and normalized to *()
access:
table[index]; -> (*(table + index));
struct_ptr->field; -> (*(struct_ptr)).field;
vector_type_ptr->x -> (*(vector_type_ptr)).x;
*(any_pointer); -> *(any_pointer);
The address inside *() notation is then clamped to correct range:
*(WCL_CLAMP((type), ptr, min_ptr, max_ptr, null_ptr))
Null pointers
-------------
The validation algorithm keeps track of the largest *() memory access
for each address space. If there is a memory access that doesn't fit
into any valid memory area, two courses of action can be taken.
A special null variable can be introduced and the invalid memory
access will then access that variable. The null variable is simply an
initialized memory region that is big enough for the largest *()
memory access. This is done for locals and constants.
Code can be emitted at the beginning of each kernel that simply makes
the kernel return if there is a risk that some memory accesses can't
be made safely. This is done for globals and privates.
Memory initialization
---------------------
Global memory is not initialized. Constant memory passed as a kernel
parameter is also left uninitialized.
Local memory, which doesn't support variable initializers, is
initialized at start of each workitem by calling a parallel
initialization algorithm.
Private memory is initialized at start of each workitem by using an
initializer. Constants memory is initialized in the same manner.
Collecting variables to address space structures
------------------------------------------------
Basically all variables, whose contents may be accessed with a
pointer, must be collected. For example, whole structures are
collected if the address of one of their fields is taken:
struct { int a; int b; } variable;
foo(&(variable.a));
This means that function parameters may need to be collected as well:
void bar(int parameter)
{
foo(¶meter);
}
Safe wrappers for vload and vstore
----------------------------------
All variants of vload are wrapped by the following algorithm:
- For all calls to vload/vstore:
- Generate a function that
- Checks that the pointer argument is within bounds
(using similar mechanism as for clamping pointers)
- If so, execute the original call, otherwise do nothing
The wrappers themselves care very little of the types and the
generated wrapping functions just use the same types as the original
call (some checks are made, for example the pointer argument must be
of pointer type). The burden of checking if a certain vload variant
for certain types exists is left to the OpenCL implementation,
although all the list of valid vload and vstore variants are listed.
Safety checking for read_imagef, read_imagei
--------------------------------------------
The image2d_t argument of read_imagef and read_imagei is checked to be
the right type. The sampler_t argument is required to be an integer
constant expression, an error is reported if it is not. The argument
is decoded and then re-encoded by building a binary or expression
using the constants related to sampler_t, ie. CLK_ADDRESS_CLAMP |
CLK_FILTER_LINEAR. This handles the case of the original code
containing an integer literal, thus always producing an expression
that the OpenCL implementation can understand. While using integer
literals may not be the sane thing to do, it will still result in safe
code.
Safety checking for image2d_t
-----------------------------
The following restrictions are placed to image2d_t, which prevents the
user from mutating image2d_t handles or creating his own:
* All function call arguments of type image2d_t must originate
directly from the functions parameters
* All references to values of type image2d_t must be arguments to a
function call
* image2d_t values that originate from kernel parameters are not
relocated
This allows normal usage (image2d_t flows from kernel arguments to a
function that makes use of it) while preventing attacks:
* While the user is able to craft an instance of image2d_t, he will
not be able to pass it to a function (because in such cases it must
originate directly from a function parameter and a cast of a local
variable is not such a parameter).
* While the user is able to declare a variable of type image2d_t* he
will not be able to take an address of an existing image2d_t because
it would not then be a direct function argument. He will also not be
able to pass its dereferenced version to a function as an argument
as that would not be directly from a function parameter.
* While the user may be able to clear all the memory accessible to the
kernel, the image2d_t parameters for the kernel are not relocated to
the memory the user is able to access. Therefore they cannot be
altered.
Future improvements
-------------------
Here is a list of improvements that could be done in the future:
* Add dependecy analysis to follow dependencies of limits to respect
between function calls.
* Inject safe builtin implementations for the builtin functions which
take pointers as arguments for the builtins that are not yet
wrapped.
* Add separate version of helper functions for each kernel and create
different limit structures for each kernel.
* Find worst case scenario from call graph about needed private memory
and write stack emulation, instead of having all the private
variables separately in private address space structure.
* Add value range analysis, so that checks can be moved out of loops
to start of program, or in extreme case completely create
pre-validation kernel, which checks that arguments are ok before
just executing the code (this would allow e.g. autovectorization to
work normally).
Implementation details and brief introduction to code base
==========================================================
main: Creates and runs cascaded tools. Each tool takes OpenCL C code
as input and outputs the code with some transformations.
WebCLAction, WebCLTool: Components that represent the preprocessing,
normalization and validation passes.
WebCLMatcher: Components that implement normalization passes using AST
matchers.
WebCLConsumer: Runs the memory access validation algorithm. Calls
visitors to check and analyse the AST and then runs further passes
that transform the code based on earlier analysis.
WebCLVisitor: Components for validation pass that check and analyze
ASTs, but don't do any transformations.
WebCLPass: Components for validation pass that perform
transformations.
WebCLTransformer: Provides a way for validation passes to create
transformations. This should be mostly a brainless class that knows
how transformations are done, but doesn't know when to do them.