-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process XOR/ROL optimisations #21
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really comment on the algorithmic correctness of the code, but I've got a few comments for you on the implementation itself.
|
||
while (new_len > 0 && src[new_len - 1] == pad_byte) | ||
new_len--; | ||
|
||
if (new_len == max_len) | ||
return src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this optimization have a measurable impact?
As I understand it, the src
parameter is a copy of the value specified by the caller, and the value returned by this function is similarly a copy of the value specified in the return
statement, so in this case, there are already two copies being performed. (If we're lucky, the compiler is optimizing away at least one of them.) Given that, I can't tell just by reading the code whether skipping the call to substr()
actually saves anything. (That is, if the compiler has to copy the return value, then the overhead of using substr()
to generate the value copied is probably "in the noise", but adding the optimization makes the code more complicated and introduces an extra (arguably unnecessary) branch.
In any case, you should consider passing the argument as a reference to a const
value, which would remove one of the two copies. (Alternatively, if we had access to C++-11 semantics, you would want to consider using std::move()
to create the return value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the problem. In about every other language, both static like C# and dynamically typed like Python Ruby, strings are passed around in O(1). Do I understand correctly that in C++ those strings are passed by full copy in O(N)? If so, then I dont know about the optimisations...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ offers all three options:
- pass by
const
reference, where the callee receives a reference (what you're calling O(1)) to the caller's object (here a string), but the callee may not modify it; - pass by mutable reference, where the callee receives a reference to the caller's object, and is able to modify it;
- pass by value, where the callee receives a "copy" of the caller's object -- "copy" is in quotes, because the value is produced using the (possibly default-generated) "copy constructor", which, really, could do anything, including creating a reference to the original object.
However, you have to specify the one that you want...and, the most straightforward syntax produces arguably the safest and simplest effect: pass by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its funny that with all that convoluted answers that you gave me, it makes me think that you actually dont know... which kind of doesnt come as a suprise, I think whoever said "nobodody really understands quantum mechanics" DEFINATELY meant C++!!
I am working on a benchmark, I will test it. Standby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for the confusion. Which passing mechanism is used is made clear by the function's signature in the declaration/definition.
Pass by "const reference":
std::string kaitai::kstream::bytes_strip_right(const std::string& src, char pad_byte);
The callee and caller share the storage for the object (here, src
), but the callee is prevented (at compile-time) from modifying it. However, it looks, inside the function, like the object was declared in the local scope. (I.e., src
is not a pointer, and it is accessed with "dot" notation, as src.length()
.)
Pass by "reference":
std::string kaitai::kstream::bytes_strip_right(std::string& src, char pad_byte);
This is the old-fashioned "pass by pointer" hidden under a bunch of syntactic sugar -- the callee and the caller share the storage for the object, the callee accesses it as though it were declared locally, and the callee has the ability to modify it. This works well for "output parameters" and "in-out parameters", but it is somewhat dangerous, as it's not always apparent to the reader that the caller's value can be modified.
Pass by "value":
std::string kaitai::kstream::bytes_strip_right(std::string src, char pad_byte);
This one does what it says: only the "value" of the object gets passed, and the callee receives a copy-constructed version of the object. By default, that is a full and separate copy of it, which means that the caller is protected from anything that the callee might do, and the callee has full freedom to (ab)use the object as it wishes (once again, the object is accessed as if it were a local declaration); however, there is a performance penalty associated with this if the object is large or complicated to copy.
So, the code as it stands is passing by value, which is arguably the safest, most conservative, least confusing, and probably most expensive approach.
However, that said, it is possible that the compiler can apply some optimizations transparently. The most obvious is for it to allocate the return value in the caller and use that storage to receive the callee's copy, which removes the need for one of the two copy operations. (There are other options, depending on what the caller does with the result and on which version of the C++ compiler is being used.)
@@ -417,6 +421,9 @@ std::string kaitai::kstream::bytes_terminate(std::string src, char term, bool in | |||
if (include && new_len < max_len) | |||
new_len++; | |||
|
|||
if (new_len == max_len) | |||
return src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto my previous comment.
kaitai/kaitaistream.cpp
Outdated
|
||
return result; | ||
} | ||
|
||
std::string kaitai::kstream::process_xor_many(std::string data, std::string key) { | ||
size_t len = data.length(); | ||
size_t kl = key.length(); | ||
if (len == 1) | ||
return process_xor_one(data, key[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this optimization produce a measurable impact?
That is, is it really cheaper to execute a branch and a subroutine call than it is to execute the loop overhead for a single iteration? (I can't tell by reading the code.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All XOR and ROL optimisations are assuming that data is much larger than the key in xor, and group in rol. Therefore, the impl with cheaper main loop is considered better. Therefore, if data is passed in O(1) time, then yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the definition of this function stands above, the input data will be copied during the invocation.
Nevertheless, if the length of the data is 1, then it will result in only one iteration of the loop, and so the question is, which is cheaper -- executing a branch and invoking a (possibly in-lined) simple function, or setting up the loop and then exiting it after the first iteration. I can't tell.
kaitai/kaitaistream.cpp
Outdated
ki++; | ||
if (ki >= kl) | ||
ki = 0; | ||
result[i] = data[i] ^ key[i % kl]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than allocating a single-byte string and overwriting the byte(s), I suggest allocating an empty string and using an append()
operation to add byte(s) to the end of it. I don't know which is faster, but it would be worth an experiment.
Alternatively, since you know how long the input string is, you know how long the output string is, so you should be able to pre-allocate the output string. (I don't know that this will make any difference for small strings, as I suspect that std::string()
pre-allocates a minimum size already, but it might be very helpful for large translations.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I have to ask: does the introduction of the modulo operator produce a measurable impact?
It's not obvious from reading that the modulo operation will perform better than the add-and-reset operation from the original code. (The answer will likely depend on the hardware that the code is run on -- division can be a very expensive operation, whereas conditional assignment can be very cheap.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came to understanding, although I have to admit that I dont remember where exactly I learned this, so dont ask me for reference or benchmark data, that modulo operation is as cheap as any other "atomic" operation.
You see, in Python which is my primary lang, there is literally no difference in perfomance between >> or | or + or * or / or %. That is because a simple operator call involves a lot of red tape, checking if both operands/objects are integers or floats, etc. The actual "C operation" is one of many instructions.
In lower level langs like C# which is my previous primary language, the difference is a bit more pronounced but not as much as people make you believe. At university they teach you that bitwise ops are cheaper than integer addition then multiplication etc. In reality as benchmarks show, the difference is so small, that you just dont care.
So assuming that this theory holds for C++ as well, single op (modulo) is better than several ops (addition, if, branch misprediction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh its also consistent with the other runtimes implementations. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real answer is, there are levels of "cheap". The fact of the matter is that, while any form of division is probably several times as expensive as a shift, logical-OR, addition, or multiplication, if the operands are not already in the processor memory cache, then the difference doesn't really matter.
Analogously, it is less expensive to execute a modulo operation than it is to load an instruction which is not in the instruction memory cache. So, there is an argument in favor of fewer instructions (regardless of what they are), under the notion that it will fit better into cache. Similarly, a mis-predicted branch is a performance pothole (relative to any single operation); however, many architectures offer predicated instructions (including assignment) or speculative execution, which remove the need for branches or cover for many branching problems, respectively.
So, the cost of a modulo is probably not better than the cost of the other approach; but the cost of fetching an operand (or instruction) from L2 cache is probably 10 times more than either one; and, the cost of fetching the operand from memory is probably 100 times more than that. I would recommend whatever is most easily understood by the maintainers, and the trust the compiler to make it "go".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we need a benchmark to know for sure. Please standby.
kaitai/kaitaistream.cpp
Outdated
} | ||
|
||
return result; | ||
} | ||
|
||
std::string kaitai::kstream::process_rotate_left(std::string data, int amount) { | ||
// NOTE: perhaps it should be heap allocated? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you intend to deallocate it (or, unless you're going to try to allocate and initialize it lazily), I see no point in allocating it from the heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I dont intend to deallocate it. I am concerned with permanently using stack space. This adds 2KB to stack. What is a stack capacity nowadays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer on StackOverflow ranges from 16KB to 8MB. So perma adding 2KB of precomputed data to the stack is reasonable but just might be detrimental on some less common machines.
https://stackoverflow.com/a/1826072/2375119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not allocated on the stack. There are three basic areas (more or less) from which memory can come: dynamic allocation, stack, and static storage. This one comes from the last. (It's similar to the area where the code resides, but it's writable.) So, it won't impose on the available stack space at all (which is one of the reasons for declaring it to be static
).
For the record, stack sizes (and even locations) are generally configurable if you know where the knobs are. :-)
kaitai/kaitaistream.cpp
Outdated
uint8_t bits = data[i]; | ||
result[i] = (bits << amount) | (bits >> (8 - amount)); | ||
if (groupSize == 1) { | ||
// NOTE: perhaps its `amount * 256` in the index? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this note is trying to say. (Replace the pronoun, perhaps?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please nevermind. Forget you saw it.
} | ||
|
||
return result; | ||
if (len % groupSize != 0) | ||
throw std::runtime_error("process_rotate_left: data length must be a multiple of group size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than runtime_error
, I think invalid_argument
would probably be more appropriate.
kaitai/kaitaistream.cpp
Outdated
throw std::runtime_error("process_rotate_left: data length must be a multiple of group size"); | ||
|
||
if (amount % 8 == 0) { | ||
size_t[groupSize] indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like legal C++ syntax, to me...does it compile? I would expect
size_t indices[groupSize];
Ditto lines 520 & 521.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you meant the syntax was wrong, gosh... stupid me. Fixed!
} | ||
|
||
{ | ||
int amount1 = amount % 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be tempted to write this as
int amount1 = amount & 7;
} | ||
} | ||
|
||
std::string kaitai::kstream::process_rotate_left(std::string data, int amount, int groupSize = 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think my C++ compiler permits specifying default parameter values in the definition if they are already specified in the declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have absolutely no idea about this. C++ syntax is completely outside my area of expertise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand...but, does the code as is compile?
0fca875
to
2c8e817
Compare
All points made by @webbnh were resolved, either fixed or disregarded, aside of one: whether std::string is passed around as cheap O(1) or by full copy in O(N). As I am not proficient with C++, I wouldnt know. Oh two things actually: also whether optional arguments are declared properly. Last commit amended. |
2c8e817
to
109e2bc
Compare
I would be highly surprised that any precalculation like that would be faster for C++. Anything related to C++ generally needs rigorous testing, and especially these performance-related things. So, it's not even a question of running a benchmark, it needs a lot of benchmarks to be considered — i.e. different OSes / compilers combinations, x86 / x86_64 targets, etc. There's a huge potential of improvement, but The suggested changes are not harmless, unfortunately. You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM and unconditionally allocating 2KB of static memory is a bad idea as well. There are also numerous coding style flaws, but that's relatively minor. |
In C# benchmarks, array lookup is definately faster than bitwise
I think that is just balooney. It may be that someone working on embedded systems code would want to use Kaitai but that doesnt mean that we have to support it at the cost of making ridiculous restrictions. If they want to use Kaitai in embedded code, they can fork it and strip it of any non-used functionality. We just dont need to go to that much lenghts. Most processes on the desktop you are using have several MB of working space. 2KB is like next to nothing. If you demand, I can change the code to generate that translation table on usage but it will lower benchmarks where the data is short, <256 bytes. Its not worth it. |
Benchmarking VM-powered languages like C# is hard, microbenchmarking such intricate matters as a single inlinable method is super hard. It's certainly not as simple as running a single executable with a Stopwatch. At the very least, it should be using something like https://github.com/dotnet/BenchmarkDotNet, ensure that we've past the warmup phase, that the code got compiled, etc. |
Benchmark for C++, results identical to C# with respect to "which of the two is slower". It shows (again) that array lookup is faster than several bitwise shifts/ors but its not faster than a simple xor. It also shows the same that modulo is slower than check-and-wrap. #include <iostream>
#include <string>
#include <boost/timer/timer.hpp>
using namespace std;
using namespace boost::timer;
string pass_copy(string data)
{
data[0] = 1;
return data;
}
string & pass_reference(string & data)
{
data[0] = 1;
return data;
}
string loop_xor(string data)
{
size_t len = data.length();
string result(len, ' ');
uint8_t key = 1;
for (size_t i = 0; i < len; i++) {
result[i] = data[i] ^ key;
}
return result;
}
string loop_shifts(string data)
{
size_t len = data.length();
string result(len, ' ');
int amount = 1;
int antiamount = 7;
for (size_t i = 0; i < len; i++) {
uint8_t x = data[i];
result[i] = (x << amount)|(x >> antiamount);
}
return result;
}
string loop_lookup(string data)
{
size_t len = data.length();
string result(len, ' ');
uint8_t translate[256];
for (size_t i = 0; i < len; i++) {
result[i] = translate[data[i]];
}
return result;
}
string loop_modulo(string data)
{
size_t len = data.length();
string result(len, ' ');
size_t keylen = 10;
uint8_t key[10];
for (size_t i = 0; i < len; i++) {
result[i] = data[i] ^ key[i % keylen];
}
return result;
}
string loop_checkwrap(string data)
{
size_t len = data.length();
string result(len, ' ');
size_t keylen = 10;
uint8_t key[10];
size_t k = 0;
for (size_t i = 0; i < len; i++) {
result[i] = data[i] ^ key[k];
k++;
if (k == keylen)
k = 0;
}
return result;
}
int main()
{
string data(1024*1024*1024, ' ');
{
cout << "Pass data by copy (current impl)" << endl;
auto_cpu_timer timer;
string result = pass_copy(data);
}
{
cout << "Pass data by reference" << endl;
auto_cpu_timer timer;
string result = pass_reference(data);
}
{
cout << "Loop with ^ key" << endl;
auto_cpu_timer timer;
string result = loop_xor(data);
}
{
cout << "Loop with << >> |" << endl;
auto_cpu_timer timer;
string result = loop_shifts(data);
}
{
cout << "Loop with translate[...]" << endl;
auto_cpu_timer timer;
string result = loop_lookup(data);
}
{
cout << "Loop with data[i %% keylen]" << endl;
auto_cpu_timer timer;
string result = loop_modulo(data);
}
{
cout << "Loop with data[k]; if(k==keylen)k=0" << endl;
auto_cpu_timer timer;
string result = loop_checkwrap(data);
}
}
|
03a9c25
to
f378a58
Compare
Code was updated according to benchmarks, and is therefore READY TO MERGE, although it would be prudent for someone to check if the runtime compiles properly with some KSY-compiled cpp, just to be sure. |
f378a58
to
bc8baa5
Compare
I'm not sure the code is ready for merge: I started looking at it and I think I found an error. (I'll do a more detailed review and post comments.) Also, I'm not convinced that your program actually demonstrates that passing by copy and reference is equally cheap: since your code doesn't do anything with the results of the subroutines, I suspect that the compiler is within its rights to optimize away the whole call (that is, the definitions are in-scope, and the implementations have no side-effects, so the compiler could be able to tell that it doesn't matter whether the functions are actually called or not). |
bc8baa5
to
6fccdb8
Compare
Identical code as in C# and Python runtimes, well almost identical.
I am unable to run tests so someone may need to fix type cases. Please assist.