Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intel mkl optimized tensorflow performance degradation #23238

Closed
patelprateek opened this issue Oct 25, 2018 · 29 comments
Closed

intel mkl optimized tensorflow performance degradation #23238

patelprateek opened this issue Oct 25, 2018 · 29 comments
Assignees
Labels
comp:mkl MKL related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author

Comments

@patelprateek
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Deep Learning VM
    Version: m10
    Based on: Debian GNU/Linux 9.5 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): deep-learning image
  • TensorFlow version (use command below): 1.11
  • Python version: 2.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory: N/A

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
Running deep model and some wide linear models. Inference performance is very bad. 2-4x slower relative to running inference without MKL
Describe the expected behavior
Performance should actually improve with intel mkl instruction set .

Code to reproduce the issue
code for deep and wide linear model. or logistic regression example code from tensorflow example

Other info / logs
When running using google deep learning image version M9 on gpu machine (image : tf-latest-cu92, version M9) . Note : the inference is only running on cpu as i turn off the visibility for cuda devices, So the tensorflow code runs only runs on cpu. The image family says they are intel optimized packages but when i rung the benchmarks with verbosity on , i do not observe any mkl related stuff.

I start another deep learning image (tf-latest-cpu , version M10): Running exact same code on this machine with environment variable (export MKL_VERBOSE=1): I can observe a lot of openMP thread settings , KMP_xxx settings and mkl instructions logged with some timing information. I didn't observe any such thing in the M9 gpu image , even though in both place when i execute command i observe following logs:
M9 gpu image
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55fd25117d40,1,0x55fd25117d40,1) 1.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0

M10 cpu image :
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime

User settings:

KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

Effective settings:

KMP_ABORT_DELAY=0
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=128
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=0
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DEVICE_THREAD_LIMIT=2147483647
KMP_DISP_HAND_THREAD=false
KMP_DISP_NUM_BUFFERS=7
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_SPIN_BACKOFF_PARAMS='4096,100'
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASKLOOP_MIN_TASKS=0
KMP_TASK_STEALING_CONSTRAINT=1
KMP_TEAMS_THREAD_LIMIT=32
KMP_TOPOLOGY_METHOD=all
KMP_USER_LEVEL_MWAIT=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
OMP_ALLOCATOR=omp_default_mem_alloc
OMP_CANCELLATION=false
OMP_DEBUG=disabled
OMP_DEFAULT_DEVICE=0
OMP_DISPLAY_AFFINITY=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_MAX_TASK_PRIORITY=0
OMP_NESTED=false
OMP_NUM_THREADS='32'
OMP_PLACES: value is not defined
OMP_PROC_BIND='intel'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_TARGET_OFFLOAD=DEFAULT
OMP_THREAD_LIMIT=2147483647
OMP_TOOL=enabled
OMP_TOOL_LIBRARIES: value is not defined
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='verbose,warnings,respect,granularity=fine,compact,1,0'

OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 16 cores/pkg x 2 threads/core (16 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 0 core 4 thread 1

OMP: Info #250: KMP_AFFINITY: pid 8331 tid 8331 thread 0 bound to OS proc set 0
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x5622b7736500,1,0x5622b7736500,1) 2.54ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0

So i assume intel mkl is being used in M10 image where as mkl is not being used in the M9 image (Note: i have turned off visibility for cuda devices so only cpu inference is being compared) . I observe 2-4x performance degradation with intel mkl.
The mkl suggested flags are appropriate:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

Any ideas on how to debug the root cause and get the maximum performance for my models.

@ymodak ymodak added the comp:mkl MKL related issues label Oct 25, 2018
@ymodak ymodak assigned ymodak and TensorFlow-MKL and unassigned ymodak Oct 25, 2018
@ymodak ymodak added stat:awaiting response Status - Awaiting response from author stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Oct 26, 2018
@wei-v-wang
Copy link
Contributor

It is "deep and wide linear model", can you do "export OMP_NUM_THREADS=1" as a first step?
And can you please try inter_op_parallelism_threads and intra_op_parallism_threads similar to https://github.com/NervanaSystems/tensorflow-models/commit/55d55abc71483723743c0273b9c1fd8e0c7d8391#diff-00c5d001cb14a21f6d7dbf16d4e55032R90 if you haven't?

@patelprateek
Copy link
Author

@wei-v-wang : the link oyu mentioned doesnt work for me. Can you please share the link again, or may be let me know what config for inter and intra op parallelism i should try , i will post back the results here.

Also not just wide and linear models , also i am observing similar 2-3x worse latency (inference) for deep cross network model as well . Could you please perhaps explain the reasoning behing omp_num_threads = 1 as well , this will help us to understand better the internal workings.

@wei-v-wang
Copy link
Contributor

Sorry, here is the updated link: https://github.com/tensorflow/models/blob/master/official/wide_deep/wide_deep_run_loop.py#L87-L88

If some application is not bound by compute, changing OMP_NUM_THREADS might help.

I think for wide/deep models, inter_op/intra_op has been providing some help. Please definitely enable it in your model and give it a try.

@patelprateek
Copy link
Author

@wei-v-wang : the link you provided change inter and intra op thread settings : but when i run the code, it still prints out :
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

, so i am not sure if it is taking affect . Are those 2 different settings ?

@wei-v-wang
Copy link
Contributor

In order to change OMP_NUM_THREADS, please use "export OMP_NUM_THREADS=". The link I provided will only change inter and intra op settings only.

@patelprateek
Copy link
Author

Ok so i tried bunch of parameters : Machine type : 32 core , logical threads per core 2
i tried : num of intra op threads = OMP threads : [4, 8, 16, 32, 64]
inter op threads = number of physical cores and number of sockets [2,8,16,32]

the best performance i could get for a batch size of 1k : 48 micro secs
the best i get without mkl without much tuning (num of inter and intra op threads being the same : 16/32/64] : 23 micro secs

Any other setting i need to try ?
Can we know if MKL library and ISA is even being taken advantage of by looking at some ops which should definitely perform better ?

@patelprateek
Copy link
Author

I definitely found setting number of OMP threads to a lower count helped and same for inter op parallelism.
But the performance for the current model is still 2-3x worse in general

@wei-v-wang
Copy link
Contributor

Since it is inference, I have one last suggestion:
could you please prefix your runs with "numactl -c 1 -m 1 python ..." , the rest of the configurations can remain the same. This is to use just one socket to rule out memory access overhead across two sockets.

If you still observe ~2X slowness with TF w/MKLDNN, can you please share your model script with us?

@wei-v-wang
Copy link
Contributor

Sorry, I should have given out all BKMs in a batch. But, here is another important one that I've missed.

export OMP_NUM_THREADS=x
export KMP_BLOCKTIME=1
numactl -c 1 -m 1 python ... <inter_op> <intra_op>

@patelprateek
Copy link
Author

numactl -c 1 -m 1 python ...
libnuma: Warning: node argument 1 is out of range
<1> is invalid

@patelprateek
Copy link
Author

here is machine config :
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 0
CPU MHz: 2200.000
BogoMIPS: 4400.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0-31

@patelprateek
Copy link
Author

i tried numactl -c 0 -m 0 python , still the best i get is around 48 micro secs with omp threads and inter op threads = 6 and the blocktime =1

@wei-v-wang
Copy link
Contributor

@patelprateek I see, it is single socket system so numactl does not help here. Is it possible for us to get your customized model?

@patelprateek
Copy link
Author

@wei-v-wang : i will try to get you that if that really helps debugging. But i would need to get some privacy and legal approval.
Are there any steps you want me to do help debug this. Basically i want to understand what ops are being used in my model (both with mkl and without mkl) , see if that helps us understand why mkl optimization degrades the performance.

As for model : i have wide and deep linear models using tf.estimator and dataset api.

@wei-v-wang
Copy link
Contributor

Ok, I see. To simplify things, as you said, Wide and Deep (wide only) is a good proxy for your model. I will double check the performance comparison just using this wide and deep linear model. Hopefully learnings can be applied to your custom model.
BTW, Are you using private data set or public dataset? The performance may vary depend on the dataset size you are using.

@patelprateek
Copy link
Author

data set is private. i can get more details about type of features and number of crosses if that helps but this is all for inference and not for training

@patelprateek
Copy link
Author

@wei-v-wang : I am trying to re-write the model graph to anonymize the features. this works quite well except for few sparse features for which i have an embedding as well. Do you happen to know a tool/library that can help do this and take care of the edge acse i am missing ?
Mt graph rewrite code is pretty trivial by iterating over all nodes and seaching for some feature names and replacing them with some ids. For some reason i cant get the scores of model to match when i apply this translation for sparse features using embedding layer. Any caveats you know of ?

@patelprateek
Copy link
Author

@wei-v-wang : any updates on how can this issue be resolved ? did you guys find perf regression in the the new DL image ?

@wei-v-wang
Copy link
Contributor

@patelprateek Sorry for the delay, can you please try this PR #24272 ?

Before it is merged, you can use: https://github.com/Intel-tensorflow/tensorflow/tree/sriniva2/small_alloc_fix

@TensorFlow-MKL
Copy link

still monitoring. Waiting for PR: #24777

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 18, 2019
@wangcj05
Copy link

wangcj05 commented Aug 9, 2019

I think I have similar issue.

@jkihgit
Copy link

jkihgit commented Oct 9, 2019

Same here
4x slowdown using MKL
So far tried the anaconda version and google container (both latest releases)
HW is Xeon 6132, 2 sockets, HT on

@wei-v-wang
Copy link
Contributor

@dare0021 Apologize for the issue and we are starting to address this. I will provide more frequent update here.

@aashay96
Copy link

Hey, Facing the same issue. inference is 3-4x slower. Is there any update or any solution?

@NeoZhangJianyu
Copy link

NeoZhangJianyu commented Dec 3, 2020

@patelprateek
@wangcj05
@dare0021
@aashay96

This topic is opened for a very long time.
Intel Optimization for Tensorflow has be improved more since this issue is created.

To release the performance potential of Tensowflow with MKL, user need to set for optimization.

Example for Intel Core CPU (4 cores/socket, 1 sockets)

export TF_ENABLE_MKL_NATIVE_FORMAT=1  
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=4
export OMP_NUM_THREADS=4
export KMP_BLOCKTIME=1
export KMP_AFFINITY=granularity=fine,compact,1,0

TF_ENABLE_MKL_NATIVE_FORMAT is key optimization, friendly to Keras model inference.
TF_NUM_INTEROP_THREADS could be set to 1, sockets number or other number no more than cores number.
TF_NUM_INTRAOP_THREADS & OMP_NUM_THREADS are set as cores number per socket, or other value no more it.
KMP_BLOCKTIME are set 0, 1 or other number.

PS: the recommended value would not be right value to your model. The best way is test the performance to find the right values.

Now, use could install the Tensorflow with MKL by pip and conda:

python -m pip install intel-tensorflow
conda install tensorflow-mkl

Or build it from source code with '--config=mkl'.

Please refer to the Intel® Optimization for TensorFlow* Installation Guide

@NeoZhangJianyu
Copy link

@patelprateek
How do you think the suggestion?
If there is still issue, could you share it?

Thank you!

@sushreebarsa sushreebarsa self-assigned this Apr 20, 2022
@sushreebarsa
Copy link
Contributor

@patelprateek Could you please refer to this comment and try with the latest TF versions(2.4 or later ) as older TF versions(1.x) are not actively supported.Please let us know if it helps?Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Apr 20, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 27, 2022
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:mkl MKL related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

10 participants