DESIGN.txt

Overview of the validator design
================================


General overview
----------------

The WebCL validator is an executable tool that takes a WebCL C source
file as its input and outputs a validated version of the input
source. The validated source contains checks that prevent illegal
memory references. If validation can't be done, the tool outputs
nothing but relevant error messages. If everything goes well the
transformed output source can be fed to an OpenCL driver.

The validator has been implemented as a Clang tool. There are actually
multiple Clang tools embedded into a single executable. Each of them
implement different validation phases. The first phase tool
preprocesses the input source. Intermediate phases normalize and
simplify the input source. The last phase runs the validation
algorithm.


Preprocessing and interfacing
-----------------------------

Transformations need to be applied on a source that doesn't contain
any preprocessing directives because the validator phase needs to know
what it's going to validate. That wouldn't be possible if there were
any (complicated) branching directives in the code.

Since WebCL C code may contain conditional preprocessor branches, any
implicit preprocessor definitions need to be known. In practice these
implicit definitions are the features and extensions supported by the
used OpenCL driver. More specifically, the supported features and
extensions depend on the used OpenCL device. Therefore the validator
needs to know which device is used for building its output.

If the validator knows the target device (e.g. it can be passed as a
command line option), it can deduce the implicit definitions by
querying the device properties. However, this would introduce a
dependency to the used OpenCL driver. The validator can function
without OpenCL driver dependency if the relevant implicit defines are
passed to the tool as command line options. For this reason each
implicit define should be passed to the tool with -D option.

Currently supported options (extensions) are:
    -Dcl_khr_fp16
    -Dcl_khr_fp64
    -Dcl_khr_gl_sharing
    -Dcl_khr_initialize_memory


Normalization of input
----------------------

The input sources are normalized after preprocessing. Normalizations
are transformations that simplify the structure of the code but don't
change the functionality. For example, one possible normalization is
the separation of type definitions from variable declarations:

struct Type { int field; } variable;
->
struct Type { int field; };
struct Type variable;

Normalizations are done only to simplify transformations. Sometimes
multiple transformations need to be applied to same source code
locations and it's much easier to implement one specific
transformation at a time than to keep track of a very complex
transformation consisting of multiple partial transformations. This is
because Clang's source code rewriting mechanism allows only one
transformation to be applied reliably to a specific source code
location.

Therefore there are multiple passes, each pass applying no more than a
single transformation to a single source code location and each
subsequent pass reparsing the output of the previous pass.


Validation approach
-------------------

The validation solution described here is based on an already existing
LLVM IR based implementation. This alternative solution was invented
as a replacement for a smart pointer based solution that wasn't found
to be good enough in practice.

The project sources are available at
https://github.com/toaarnio/llvm-ir-memprotect. The people involved in
that project have also done initial performance analysis and therefore
they should have some rough benchmark results available. However, they
haven't done extensive benchmarking yet.

It's best to start explaining this alternative solution with example
code.


Example transformation
----------------------

Consider the following OpenCL C code:

int helper(int *ptr)
{
    return *ptr;
}

__kernel void entry(__global int *res)
{
    int val = 0;
    res[get_global_id(0)] = helper(&val);
}

The code will be transformed as:

typedef struct {
    int entry__val;
} _WclPrivates;

typedef struct {
    __global int *entry__res_min;
    __global int *entry__res_max;
} _WclGlobalLimits;

typedef struct {
    _WclGlobalLimits gl;
    _WclPrivates pa;
} _WclProgramAllocations;

int helper(_WclProgramAllocations *_wcl_allocs, int *ptr)
{
    return *CHECK_PRIVATE(_wcl_allocs, ptr);
}

__kernel void entry(__global int *res, unsigned long _wcl_res_size)
{
    _WclProgramAllocations _wcl_allocations_allocation = {
        { &res[0], &res[_wcl_res_size] }, // _WclGlobalLimits
        { 0 } // _WclPrivates
    };
    _WclProgramAllocations *_wcl_allocs = &_wcl_allocations_allocation;

    *CHECK_GLOBAL(_wcl_allocs, res + get_global_id(0)) =
        helper(_wcl_allocs, &_wcl_allocs->pa.entry__val);
}


Introduction to required transformations
----------------------------------------

Variable relocation:

Any variable that requires its address to be known, must be relocated
to an address space specific structure. In the example 'val' was
relocated to '_WclPrivates'. Similar structures are required for local
and constant address spaces.

Variable renaming:

Relocated variables may need to be renamed, because identically named
variables may be relocated to the same address space structure from
multiple places. In the example 'val' was renamed as 'entry__val', the
new name must be unique within the address space structure.

Type relocation:

The types of relocated variables need to be relocated as well (or at
least forward declarations need to be generated). If the type of 'val'
would have been 'struct V { int v; }' instead of 'int', the type would
have had to be introduced before '_WclPrivates'.

Type renaming:

Just like relocated variables, some relocated types need to be
uniquely renamed.

Static limit check:

Accesses for memory that has been defined inside the kernel can be
checked by determining the begin and end addresses of the
corresponding address space structure. In the example 'CHECK_PRIVATE'
can clamp 'ptr' to be within instance of '_WclPrivates'. The limits of
this check are known at compile time.

Dynamic limit check:

Size of memory regions passed to kernels need to be defined. In the
example '_wcl_res_size' was introduced for 'res'. The extra size
parameters can be used to define limits for passed memory
regions. This is shown in '_WclGlobalLimits'. Similar limit structures
are needed for local and constant address spaces as well. In the
example 'CHECK_GLOBAL' will use the limit information to clamp
'res + get_global_id(0)'.

Initialization:

All address space structures (for internal allocations) and limit
structures (for external allocations) are put together into
'_WclProgramAllocations'. Instances of the generated structures need to
be created and initialized in the beginning of each kernel.

Transformation:

A pointer to the main allocation structure, '_wcl_allocs', needs to be
passed to any function that performs pointer checks. The signatures of
these functions need to be modfied so that they can accept this extra
argument. See the signature and call of 'helper'.

Each reference to relocated variables needs to be transformed. In the
example 'val' was transformed to '_wcl_allocs->pa.entry__val'.


Issues with required transformations
------------------------------------

Variable relocation:

Simple implementation allocates too much space for address space
structures and relocates all variables, not only those whose address
is needed:

__kernel void entry(__global int *res)
{
    int val = 0; // relocated even though not necessary
    res[get_global_id(0)] = val;
}

In the ideal case each kernel requires its own structures, otherwise
too many unused variables may be relocated:

int helper(int *ptr)
{
    return *ptr;
}

__kernel void entry1(__global int *res)
{
    int val1 = 0; // relocated
    res[get_global_id(0)] = helper(&val1);
}

__kernel void entry2(__global int *res)
{
    int val2 = 0; // relocated to the same structure
    res[get_global_id(0)] = helper(&val2);
}

Static limit check:

In principle, each internally allocated variable can be relocated to a
single address space structure. Therefore each pointer access needs to
perform only a single region check.

However, in some very rare cases the initialization of a single
constant address space structure may become challenging:

__constant int val = { 0 };
__constant int * __constant ptr = &val;

An initializer for transformed code, that is accepted by C compilers,
makes at least AMDs OpenCL C compiler crash:

typedef struct {
    int val;
    __constant int *ptr;
} _WclConstants;

__constant _WclConstants _wcl_constants = { 0, &_wcl_constants.val };

Dynamic limit check:

Each memory region passed to a kernel fragments the corresponding
address space:

__kernel void entry(
    __constant int *c1, __constant int *c2,
    __global int *g1, __global int *g2,
    __local int *l1, __local int *l2)
{
    ...
}

In the above example each pointer access needs to perform three region
checks (2 dynamic + 1 static) for local and constant address
spaces. Two region checks are required for global address space
pointers.


Comparison with smart pointers
------------------------------

By a smart pointer we mean a structure that contains limits for a
pointer:

typedef struct { char *min; char *ptr; char *max; } SmartPtr;

Generality:

The alternative method doesn't restrict pointer operations in any way,
whereas keeping track of smart pointer limits can become challenging
in some cases. Consider the following example:

    int value = 0xabcdef;
    int *pointer = &value;
    int *base = 0;
    unsigned long offset = pointer - base;
    int value2 = *(base + offset);

In this case 'base' can't have valid limits and therefore evaluating
'*(base + offset)' must fail with smart pointers.

Size required by pointers:

Alternative solution doesn't suffer size overhead regarding
pointers.

This issue would be more pronounced with large smart pointer
arrays. Each smart pointer would require additional min/max addresses
to be stored and would consume three times more memory/registers,
which would have a negative effect on memory consumption.

Initialization and assignment of pointers:

Smart pointer approach requires assigment to limit fields whenever a
pointer is created (e.g. with &-operator) or assigned (possibly passed
as argument). This degrades performance because whenever a pointer is
assigned we need to store/load three addresses instead of just one.

Preloading memory access limit values:

Since in smart pointer approach each pointer has its own limits, many
different pointers cannot share the same limits. This means that we
would need to do indirect load operation for every pointer access to
get limits for that single memory access.

With the altenative approach we have single limits for private address
space, which we can load just once in running kernel and reuse the
loaded values for every memory access of private address space. For
other address spaces we have few separate limits, which the compiler
should be able to optimize in such a way that the required limit loads
would be done just once in each function.

Number of region checks in pointer accesses:

Smart pointers require only one region check for each pointer access,
where min and max limits are always loaded indirectly, whereas the
alternative approach may require multiple checks and usually limits
needs to be loaded just once for each limit range in the program.

Only one check is needed for pointers accessing private memory. Local,
global and constant memory accesses might need multiple checks depending on
how many different memory areas are passed to kernels and how easy it
is to resolve exact limits of each memory access.

The user can usually rearrange the code in such a way that local and
constant memory objects don't need to be passed. Global memory is
commonly copied to local memory before it's used and therefore global
memory checks can also be minimized if the user wants to. Compiler
also knows at compile time if inefficient checks are needed and can
guide the programmer how to possibly avoid the checks.

Also simple dependence analysis can resolve single limits for the
cases where kernel arguments are accessed directly through a base
pointer.

It's therefore possible to write code that doesn't require multiple
region checks for core parts of the code. The validator tool could
provide optimization hints and warnings if it recognizes that too many
region checks are required.


Coding guidelines for programmer to avoid checks
------------------------------------------------

Avoid having to use unneccessary amount of *(), -> or [] operators,
because each of them causes check to be added to code.

NOTE: In the future, dependecy analysis can help to remove checks in
      many cases.


Memory access validation
------------------------

All types of memory accesses are recognized and normalized to *()
access:

table[index]; -> (*(table + index));

struct_ptr->field; -> (*(struct_ptr)).field;

vector_type_ptr->x -> (*(vector_type_ptr)).x;

*(any_pointer); -> *(any_pointer);

The address inside *() notation is then clamped to correct range:

*(WCL_CLAMP((type), ptr, min_ptr, max_ptr, null_ptr))


Null pointers
-------------

The validation algorithm keeps track of the largest *() memory access
for each address space. If there is a memory access that doesn't fit
into any valid memory area, two courses of action can be taken.

A special null variable can be introduced and the invalid memory
access will then access that variable. The null variable is simply an
initialized memory region that is big enough for the largest *()
memory access. This is done for locals and constants.

Code can be emitted at the beginning of each kernel that simply makes
the kernel return if there is a risk that some memory accesses can't
be made safely. This is done for globals and privates.


Memory initialization
---------------------

Global memory is not initialized. Constant memory passed as a kernel
parameter is also left uninitialized.

Local memory, which doesn't support variable initializers, is
initialized at start of each workitem by calling a parallel
initialization algorithm.

Private memory is initialized at start of each workitem by using an
initializer. Constants memory is initialized in the same manner.


Collecting variables to address space structures
------------------------------------------------

Basically all variables, whose contents may be accessed with a
pointer, must be collected. For example, whole structures are
collected if the address of one of their fields is taken:

struct { int a; int b; } variable;
foo(&(variable.a));

This means that function parameters may need to be collected as well:

void bar(int parameter)
{
    foo(&parameter);
}

Safe wrappers for vload and vstore
----------------------------------

All variants of vload are wrapped by the following algorithm:
- For all calls to vload/vstore:
  - Generate a function that
    - Checks that the pointer argument is within bounds
      (using similar mechanism as for clamping pointers)
    - If so, execute the original call, otherwise do nothing

The wrappers themselves care very little of the types and the
generated wrapping functions just use the same types as the original
call (some checks are made, for example the pointer argument must be
of pointer type). The burden of checking if a certain vload variant
for certain types exists is left to the OpenCL implementation,
although all the list of valid vload and vstore variants are listed.

Safety checking for read_imagef, read_imagei
--------------------------------------------

The image2d_t argument of read_imagef and read_imagei is checked to be
the right type. The sampler_t argument is required to be an integer
constant expression, an error is reported if it is not. The argument
is decoded and then re-encoded by building a binary or expression
using the constants related to sampler_t, ie. CLK_ADDRESS_CLAMP |
CLK_FILTER_LINEAR. This handles the case of the original code
containing an integer literal, thus always producing an expression
that the OpenCL implementation can understand. While using integer
literals may not be the sane thing to do, it will still result in safe
code.

Safety checking for image2d_t
-----------------------------

The following restrictions are placed to image2d_t, which prevents the
user from mutating image2d_t handles or creating his own:

* All function call arguments of type image2d_t must originate
  directly from the functions parameters

* All references to values of type image2d_t must be arguments to a
  function call

* image2d_t values that originate from kernel parameters are not
  relocated

This allows normal usage (image2d_t flows from kernel arguments to a
function that makes use of it) while preventing attacks:

* While the user is able to craft an instance of image2d_t, he will
  not be able to pass it to a function (because in such cases it must
  originate directly from a function parameter and a cast of a local
  variable is not such a parameter).

* While the user is able to declare a variable of type image2d_t* he
  will not be able to take an address of an existing image2d_t because
  it would not then be a direct function argument. He will also not be
  able to pass its dereferenced version to a function as an argument
  as that would not be directly from a function parameter.

* While the user may be able to clear all the memory accessible to the
  kernel, the image2d_t parameters for the kernel are not relocated to
  the memory the user is able to access. Therefore they cannot be
  altered.

Future improvements
-------------------

Here is a list of improvements that could be done in the future:

* Add dependecy analysis to follow dependencies of limits to respect
  between function calls.

* Inject safe builtin implementations for the builtin functions which
  take pointers as arguments for the builtins that are not yet
  wrapped.

* Add separate version of helper functions for each kernel and create
  different limit structures for each kernel.

* Find worst case scenario from call graph about needed private memory
  and write stack emulation, instead of having all the private
  variables separately in private address space structure.

* Add value range analysis, so that checks can be moved out of loops
  to start of program, or in extreme case completely create
  pre-validation kernel, which checks that arguments are ok before
  just executing the code (this would allow e.g. autovectorization to
  work normally).


Implementation details and brief introduction to code base
==========================================================

main: Creates and runs cascaded tools. Each tool takes OpenCL C code
as input and outputs the code with some transformations.

WebCLAction, WebCLTool: Components that represent the preprocessing,
normalization and validation passes.

WebCLMatcher: Components that implement normalization passes using AST
matchers.

WebCLConsumer: Runs the memory access validation algorithm. Calls
visitors to check and analyse the AST and then runs further passes
that transform the code based on earlier analysis.

WebCLVisitor: Components for validation pass that check and analyze
ASTs, but don't do any transformations.

WebCLPass: Components for validation pass that perform
transformations.

WebCLTransformer: Provides a way for validation passes to create
transformations. This should be mostly a brainless class that knows
how transformations are done, but doesn't know when to do them.