Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule::unify_source method execution takes about 30% of ForwardChainer execution time #54

Open
vsbogd opened this issue Dec 27, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@vsbogd
Copy link
Collaborator

vsbogd commented Dec 27, 2019

I profiled ForwardChainer to find out how performance can be improved and found that it spends significant amount of time in the Rule::unify_source method. It depends on rules complexity. The more complex rules are used the bigger this amount of time is. The following benchmark demonstrates this: opencog/benchmark#27

The problem is that unify_source does almost nothing. First part converts rule's BindLink into another one with random variable naming (RewriteLink::alpha_convert). And second one constructs new BindLink with some variables grounded.

Proper fix probably should be raised to opencog/atomspace, but I am raising the issue here as main concern is URE performance.

Flame graph of the code benchmarked (better opening in web-browser):
perf.svg.gz

Following steps require getting perf and FlameGraph tool (https://github.com/brendangregg/FlameGraph). Intel CPU with support of Intel Processor Trace is required as well.

Steps to reproduce:
Run benchmark in the background:

./micro/benchmark --benchmark_filter=ForwardChainer_Basic --benchmark_min_time=600 &

Collect profile using Intel Processor Trace. You need only 1 second of profile it will be too large otherwise:

perf record -e intel_pt//u -p `pidof benchmark` sleep 1

Build flame graph of the do_step call (it can take a lot of time):

perf script --itrace=i100nsg | ./FlameGraph/stackcollapse-perf.pl | fgrep 'benchmark;BM_ForwardChainer_Basic;opencog::ForwardChainer::do_chain;opencog::ForwardChainer::do_step_rec;opencog::ForwardChainer::do_step;' | ./FlameGraph/flamegraph.pl > perf.svg
@vsbogd vsbogd added the enhancement New feature or request label Dec 27, 2019
@linas
Copy link
Member

linas commented Dec 28, 2019

Not to hijack the conversation, but the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff. Over the last decade, I've repeatedly discovered that boost is a performance nightmare -- rip it out of your code, everything will run 2x or 3x faster. I mean, it seems odd to me that pthread-anything should be more than 1% of the total performance; that indicates thundering herds in operator==

Also: what's __GI__exp and __ieee7? These seem to be eating lots of time...

Also: as a general rule, you want to run any/all atomspace things with HUGETLB_MORECORE=yes LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so.0 as this will in general give you a 50% performance boost.

@linas
Copy link
Member

linas commented Dec 28, 2019

If you haven't configed hugepages, see below; its cut-n-paste from atomspace-postgres; you have to adjust these instructions for your app:

Tuning HugePages

Using 2MB-sized HugePages can also offer a large performance boost,
both for Postgres, and for the atomspace, especially when working with a
large atomspace. The procedure to set these up is a bit complicated.
First, add a hugepages user-group, and add postgres to it:

   sudo groupadd hugepages
   sudo gpasswd -a postgres hugepages

Then you need to find out what the group id (gid) was:

   id postgres

Suppose that this shows group id 1234 for hugepages. This needs to
be added to /etc/sysctl.conf as well. So edit, and add:

   vm.nr_hugepages = 16384       # 32GB of hugepages, 25% of RAM.
   vm.hugetlb_shm_group=1234

Don't forget to sudo sysctl -p /etc/sysctl.conf again.
Finally, the ability to use those pages. Add to /etc/security/limits.conf:

    @hugepages      soft    memlock         unlimited
    @hugepages      hard    memlock         unlimited

@linas
Copy link
Member

linas commented Dec 28, 2019

actually, you may as well set hugepages to 80% of RAM, or whatever the typical working-set size for the atomspace is.

@vsbogd
Copy link
Collaborator Author

vsbogd commented Dec 28, 2019

the flame-graph also points at opencog::SourceSet which looses huge amounts of time in boost which calls operator== which calls pthread stuff

Linas thank you for spotting this. I believe pthread stuff is a mutex lock at:

bool Source::operator==(const Source& other) const
{
std::lock_guard<std::mutex> lock(_whole_mutex);
return body == other.body && vardecl == other.vardecl;
}

so root cause is not boost but intensive locking.

Also: what's __GI__exp and __ieee7? These seem to be eating lots of time...

I believe it is exponent calculation here:

double SourceSet::get_weight(const Source& src) const
{
// TODO: we could take into account some sort of fitness. For
// instance if the fitness is maximize confidence, then we could
// factor in the confidence of the source, as the higher the
// confidence of the source, the higher the confidence of the
// conclusion.
return src.exhausted ? 0.0 : complexity_factor(src);
}
double SourceSet::complexity_factor(const Source& src) const
{
return exp(-_config.get_complexity_penalty() * src.complexity);
}

@linas
Copy link
Member

linas commented Dec 28, 2019

Hmm. Given that exp is fast (should be less than 50-100 clock cycles) the fact that its showing up a lot means it's being called a lot. Probably too much. Similar remarks for the mutex -- it should be no more than dozens of clocks, unless there's a huge amount of lock contention...

@ngeiswei
Copy link
Member

ngeiswei commented Jan 7, 2020

Thanks. Great findings!

Regarding the locks, these are just crude half-way steps towards parallel URE. Once we move to C++17 I'll use more refined (unique/shared) mutexes. Maybe some thread pool, etc, not sure yet.

Regarding exp, it is used to calculate the probability distribution over sources, obviously this should be cached.

ngeiswei added a commit to ngeiswei/ure that referenced this issue Feb 14, 2020
Calculate once at insertion rather than at each iteration.

Address the exp part of opencog#54
ngeiswei added a commit to ngeiswei/ure that referenced this issue Feb 21, 2020
Calculate once at insertion rather than at each iteration.

Address the exp part of opencog#54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants