-
Notifications
You must be signed in to change notification settings - Fork 74.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intel mkl optimized tensorflow performance degradation #23238
Comments
It is "deep and wide linear model", can you do "export OMP_NUM_THREADS=1" as a first step? |
@wei-v-wang : the link oyu mentioned doesnt work for me. Can you please share the link again, or may be let me know what config for inter and intra op parallelism i should try , i will post back the results here. Also not just wide and linear models , also i am observing similar 2-3x worse latency (inference) for deep cross network model as well . Could you please perhaps explain the reasoning behing omp_num_threads = 1 as well , this will help us to understand better the internal workings. |
Sorry, here is the updated link: https://github.com/tensorflow/models/blob/master/official/wide_deep/wide_deep_run_loop.py#L87-L88 If some application is not bound by compute, changing OMP_NUM_THREADS might help. I think for wide/deep models, inter_op/intra_op has been providing some help. Please definitely enable it in your model and give it a try. |
@wei-v-wang : the link you provided change inter and intra op thread settings : but when i run the code, it still prints out : , so i am not sure if it is taking affect . Are those 2 different settings ? |
In order to change OMP_NUM_THREADS, please use "export OMP_NUM_THREADS=". The link I provided will only change inter and intra op settings only. |
Ok so i tried bunch of parameters : Machine type : 32 core , logical threads per core 2 the best performance i could get for a batch size of 1k : 48 micro secs Any other setting i need to try ? |
I definitely found setting number of OMP threads to a lower count helped and same for inter op parallelism. |
Since it is inference, I have one last suggestion: If you still observe ~2X slowness with TF w/MKLDNN, can you please share your model script with us? |
Sorry, I should have given out all BKMs in a batch. But, here is another important one that I've missed. export OMP_NUM_THREADS=x |
numactl -c 1 -m 1 python ... |
here is machine config : |
i tried numactl -c 0 -m 0 python , still the best i get is around 48 micro secs with omp threads and inter op threads = 6 and the blocktime =1 |
@patelprateek I see, it is single socket system so numactl does not help here. Is it possible for us to get your customized model? |
@wei-v-wang : i will try to get you that if that really helps debugging. But i would need to get some privacy and legal approval. As for model : i have wide and deep linear models using tf.estimator and dataset api. |
Ok, I see. To simplify things, as you said, Wide and Deep (wide only) is a good proxy for your model. I will double check the performance comparison just using this wide and deep linear model. Hopefully learnings can be applied to your custom model. |
data set is private. i can get more details about type of features and number of crosses if that helps but this is all for inference and not for training |
@wei-v-wang : I am trying to re-write the model graph to anonymize the features. this works quite well except for few sparse features for which i have an embedding as well. Do you happen to know a tool/library that can help do this and take care of the edge acse i am missing ? |
@wei-v-wang : any updates on how can this issue be resolved ? did you guys find perf regression in the the new DL image ? |
@patelprateek Sorry for the delay, can you please try this PR #24272 ? Before it is merged, you can use: https://github.com/Intel-tensorflow/tensorflow/tree/sriniva2/small_alloc_fix |
still monitoring. Waiting for PR: #24777 |
I think I have similar issue. |
Same here |
@dare0021 Apologize for the issue and we are starting to address this. I will provide more frequent update here. |
Hey, Facing the same issue. inference is 3-4x slower. Is there any update or any solution? |
@patelprateek This topic is opened for a very long time. To release the performance potential of Tensowflow with MKL, user need to set for optimization. Example for Intel Core CPU (4 cores/socket, 1 sockets)
TF_ENABLE_MKL_NATIVE_FORMAT is key optimization, friendly to Keras model inference. PS: the recommended value would not be right value to your model. The best way is test the performance to find the right values. Now, use could install the Tensorflow with MKL by pip and conda:
Or build it from source code with '--config=mkl'. Please refer to the Intel® Optimization for TensorFlow* Installation Guide |
@patelprateek Thank you! |
@patelprateek Could you please refer to this comment and try with the latest TF versions(2.4 or later ) as older TF versions(1.x) are not actively supported.Please let us know if it helps?Thanks! |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
Version: m10
Based on: Debian GNU/Linux 9.5 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the current behavior
Running deep model and some wide linear models. Inference performance is very bad. 2-4x slower relative to running inference without MKL
Describe the expected behavior
Performance should actually improve with intel mkl instruction set .
Code to reproduce the issue
code for deep and wide linear model. or logistic regression example code from tensorflow example
Other info / logs
When running using google deep learning image version M9 on gpu machine (image : tf-latest-cu92, version M9) . Note : the inference is only running on cpu as i turn off the visibility for cuda devices, So the tensorflow code runs only runs on cpu. The image family says they are intel optimized packages but when i rung the benchmarks with verbosity on , i do not observe any mkl related stuff.
I start another deep learning image (tf-latest-cpu , version M10): Running exact same code on this machine with environment variable (export MKL_VERBOSE=1): I can observe a lot of openMP thread settings , KMP_xxx settings and mkl instructions logged with some timing information. I didn't observe any such thing in the M9 gpu image , even though in both place when i execute command i observe following logs:
M9 gpu image
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55fd25117d40,1,0x55fd25117d40,1) 1.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0
M10 cpu image :
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
User settings:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32
Effective settings:
KMP_ABORT_DELAY=0
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=128
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=0
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DEVICE_THREAD_LIMIT=2147483647
KMP_DISP_HAND_THREAD=false
KMP_DISP_NUM_BUFFERS=7
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_SPIN_BACKOFF_PARAMS='4096,100'
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASKLOOP_MIN_TASKS=0
KMP_TASK_STEALING_CONSTRAINT=1
KMP_TEAMS_THREAD_LIMIT=32
KMP_TOPOLOGY_METHOD=all
KMP_USER_LEVEL_MWAIT=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
OMP_ALLOCATOR=omp_default_mem_alloc
OMP_CANCELLATION=false
OMP_DEBUG=disabled
OMP_DEFAULT_DEVICE=0
OMP_DISPLAY_AFFINITY=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_MAX_TASK_PRIORITY=0
OMP_NESTED=false
OMP_NUM_THREADS='32'
OMP_PLACES: value is not defined
OMP_PROC_BIND='intel'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_TARGET_OFFLOAD=DEFAULT
OMP_THREAD_LIMIT=2147483647
OMP_TOOL=enabled
OMP_TOOL_LIBRARIES: value is not defined
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='verbose,warnings,respect,granularity=fine,compact,1,0'
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 16 cores/pkg x 2 threads/core (16 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 0 core 4 thread 1
OMP: Info #250: KMP_AFFINITY: pid 8331 tid 8331 thread 0 bound to OS proc set 0
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x5622b7736500,1,0x5622b7736500,1) 2.54ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0
So i assume intel mkl is being used in M10 image where as mkl is not being used in the M9 image (Note: i have turned off visibility for cuda devices so only cpu inference is being compared) . I observe 2-4x performance degradation with intel mkl.
The mkl suggested flags are appropriate:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32
Any ideas on how to debug the root cause and get the maximum performance for my models.
The text was updated successfully, but these errors were encountered: