PROJECT NOT UNDER ACTIVE MANAGEMENT
This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
Contact: webadmin@linux.intel.com
This case uses Intel® Optimized version of XGBoost* to achieve fast traing and inference times, converts a gradient boosting model to a daal4py version included inside Intel® Extension for Scikit-Learn* and enable inference performance acceleartion. With this use case you will learn to use Intel® tools to build a Digital Twin model which reflects the response (leakage current) of a Metal-Oxide Substrate Field Effect Transistors (MOSFETs) based on the voltage received (gate) for design exploration purposes helping saving cost compared with normal physical experimentation. Visit Developer Catalog for more workflow examples.
- Building a MOSFET Digital Twin for Design Exploration: Modeling Sub-threshold Voltage Leakage Current using XGBoostRegressor
A Digital Twin ([1],[2]) is a virtual model designed to accurately reflect a physical object behaviour during its lifecycle, it can be updated with real-time data, machine learning and simulation. For the creation of a Digital Twin the object in question is outfitted with various sensors located in vital areas of functionality, this areas are defined according to the impact the information has with the desired output of the studied object. Examples of data produced by the sensors are temperature, humidity, pressure, distance, voltage, current, resistance, etc. Once the data is studied and analyzed, it can fed a virtual model to run simulations, study critical behaviours, experimental optimizitations and provide valuable insights to be applied to the original physical object from response to the input variables or conditions. This mere definition of a Digital Twin has impact in many areas of study for different types of industries due to the low cost compared with having a real physical twin object to perform tests which may cause the object to stop working or even cause catasthropic reactions. Digital Twins can also predict the lifespan of the object under certain conditions with predictive analytics, support the maintanance methods for it and manage complex connections within systems of systems.
For this reference kit we have chosen to model the behavior of Metal-Oxide Substrate Field Effect Transistors (MOSFETs), which are commonly used in consumer electronics and power devices. For MOSFETs the "leakage current" is a key indicator of performance. Hence understanding how the leakage current varies as a function of the input conditions is critical.
The device includes three components (source, drain and gate). The source-drain current is a function of the operating gate voltage,
A Machine Learning (ML) solution or an ML-powered MOSFET Digital Twin can be a valuable substitute which will predict leakage current from input values which include
In addition, this use case uses Intel® tools to speed the whole pipeline, which will be briefly described below, if you want to go directly to the links for each one of the Intel® tools described go to Learn More section.
Scikit-learn* (often referred to as sklearn) is a Python* module for machine learning. Intel® Extension for Scikit-learn* seamlessly speeds up your scikit-learn applications for Intel® CPUs and GPUs across single- and multi-node configurations. This extension package dynamically patches scikit-learn estimators while improving performance for your machine learning algorithms.
The extension is part of the Intel® AI Analytics Toolkit (AI Kit) that provides flexibility to use machine learning tools with your existing AI packages.
XGBoost* is an open source gradient boosting machine learning library. It performs well across a variety of data and problem types, so it often pushes the limits of compute resources. Using XGBoost* on Intel® CPUs takes advantage of software accelerations powered by oneAPI, without requiring any code changes. Software optimizations deliver the maximum performance for your existing hardware. This enables faster iterations during development and training, and lower latency during inference.. Please keep in mind that to train an XGBoost* model using Intel® optimizations, the 'tree_method' parameter should be set to 'hist'.
Modin* is a drop-in replacement for pandas, enabling data scientists to scale to distributed DataFrame processing without having to change API code. Intel® Distribution of Modin* adds optimizations to further accelerate processing on Intel® hardware.
daal4py included in Intel® oneAPI Data Analytics Library (oneDAL)* as part of the Intel® Extension for Scikit-learn*, is an easy-to-use Python* API that provides superior performance for your machine learning algorithms and frameworks. Designed for data scientists, it provides a simple way to utilize powerful Intel® DAAL machine learning algorithms in a flexible and customizable manner. For scaling capabilities, daal4py also provides you the option to process and analyze data via batch, streaming, or distributed processing modes, allowing you to choose the option to best fit your system's needs.
For more details, visit the Building a MOSFET Digital Twin for Design Exploration: Modeling Sub-threshold Voltage Leakage Current using XGBoostRegressor GitHub repository.
A schematic of the proposed reference architecture is shown in the following figure. The portion of the diagram enclosed in the red dashed line is the section of the workload for generating the synthetic data. The dashed green line section corresponds to the XGBoost* optimization process.
daal4py's speedy frameworks are best known as a way to accelerate machine learning algorithms from Scikit-Learn*, however, this guide provides you with the information to use the daal4py algorithms directly.
The main data generator script is located in (src/utils/synthetic_datagen.py) The following figure describes how the leakage current is calculated from voltage values and other parameters.
The proposed reference solution is built primarily using an XGBoost* Regressor. However, separate experiments were conducted using a Linear Regressor to serve as a reference point for Mean Squared Error (MSE) values and confirm that XGBoost* outperforms a simple Linear Regressor.
The Hyperparameter tunning happens during the training phase by getting the best parameters and best estimator from the GridSearchCV function from Intel® Extension for Scikit-learn*, the estimator will return the highest score (or smallest loss if specified) on the left out data and the best parameters will gave the best results on the hold out data.
In addition to the standard training/hyperparameter tuning/prediction sections of an ML pipeline, we will also build a semi-supervised learning component to enable continuous learning. Here we will start by training a conventional ML model and then use it to create pseudo-response values for non-training, synthetic data. Both the original and synthetic pseudo-response data will be combined and used to train a "semi-supervised" model.
This process can continue iteratively to simulate self-learning - similar to a digital twin - from influx of "fresh" data from devices. The model.pkl file is the XGBRegressor model which will be stored to be later used for inferencing.
Once the development exercise is complete, the final model can then be deployed into production and used as a digital replica of a MOSFET device for simulating leakage behavior of a real device OR can be used as one of the componets to build a more complex Digital Twin system.
Recommended Hardware | Precision |
---|---|
Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz with 187 GB de RAM | FP32 |
RAM: 187 GB | |
Recommended Free Disk Space: 22 GB or more |
Code was tested on Ubuntu* 22.04 LTS.
The workflow below represents the end to end process for this use case within the scripts that will be used in get started section.
This workflow gives a first approach model to MOSFETs behaviour using a synthetic dataset, if you want to use this use case pipeline in a production environment is critical to change the synthetic dataset for a pre analyzed dataset with data gathered from your productive environment.
Define the environment variables that will store the path to your desired workspace folder, this variables will be referenced in next steps for an easy go through experience.
WORKSPACE:
Path where the current repository be cloned in next steps.
DATA_DIR:
Path where the dataset must be placed.
OUTPUT_DIR:
Path where the pipeline logs will be saved.
export WORKSPACE=$PWD/digital-twin
export DATA_DIR=$WORKSPACE/data
export OUTPUT_DIR=$WORKSPACE/logs
Create the workspace directory and clone the Workflow Repository into the WORKSPACE
path.
mkdir -p $WORKSPACE && cd $WORKSPACE
git clone https://github.com/oneapi-src/digital-twin.git $WORKSPACE
mkdir -p $DATA_DIR/models
mkdir -p $OUTPUT_DIR
Reffer to Conda Installing on Linux for more details.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
This reference kit uses libmamba solver for fast environment creation. The dependencies file is located in $WORKSPACE/env/intel_env.yml.
Packages | Version |
---|---|
python | 3.10 |
intelpython3_full | 2024.0.0 |
modin-all | 0.24.1 |
Suggested libmamba setup
conda install -n base conda-libmamba-solver
conda config --set solver libmamba
Environment creation
conda env create -f $WORKSPACE/env/intel_env.yml
conda activate digital_twin_intel
Note:
The environment must be set just once time, the environment must include the dependencies listed above. To list your environments use conda env list
or conda info -e
.
The execution of the reference kit is compatible with the following environments:
- Bare Metal
Before running the following steps, make sure your environment is complete according the Get Started Section.
Requirements:
Run Workflow
Go to the WORKSPACE directory
cd $WORKSPACE
For the pipeline to run, make sure to have the use case environment activated:
conda activate digital_twin_intel
Once we create and activate the virtual environment, we can run the benchmarks for evaluating performance gain. The training and inference portion of benchmarking can be run using the python script MOSFET_main.py
. How this script works?
The run benchmark script takes the following arguments:
usage: MOSFET_main.py [-h] [-l LOGFILE] -m MODEL [-mf MODELFILE] [-n N_DATA_LEN] [-d DATA_PATH] [-x [X_COLS ...]]
[-y Y_COL]
optional arguments:
-h, --help show this help message and exit
-l LOGFILE, --logfile LOGFILE
log file to output benchmarking results to
-m MODEL, --model MODEL
type of model lr:linreg, xgb:xgboost, xgbh: xgb with hyperparameter tuning, xgbfull:
-mf MODELFILE, --modelfile MODELFILE
name for the built model please add extension if desired
-n N_DATA_LEN, --n_data_len N_DATA_LEN
option for data length. Provide 1 2 or 3, default 1
-d DATA_PATH, --data_path DATA_PATH
path to the customized csv dataset, optional
-x [X_COLS ...], --x_cols [X_COLS ...]
provide the independent columns of customized dataset space separated
-y Y_COL, --y_col Y_COL
provide the dependent column of customized dataset
Training Types
The pipeline has tree different type of trainings to perform for reference:
- Linear Regression ( lr )
- XGBoost ( xgb )
- XGBoost Hyperparameter ( xgbh )
- XGBFull ( xgbfull )
Data Length
As mentioned previously, the dataset used for this use case is synthetic, the main data generator script is located in (src/utils/synthetic_datagen.py)
Valid values for data length factors are:
- 1: 120000 rows 10 columns datashape.
- 2: 960000 rows 10 columns datashape.
- 3: 3240000 rows 10 columns datashape.
Note: The training script takes only 2.5M of rows, if the data length is larger only 2.5M rows will be taken.
To run the pipeline with default values:
python $WORKSPACE/src/MOSFET_main.py -m <training-type>
To run the pipeline giving names and logs addresses:
python $WORKSPACE/src/MOSFET_main.py -m <training-type> -mf <model-name>.pkl -l $OUTPUT_DIR/<log-name>.log -n 1
Note: After a successfull run you can find the model in the $WORKSPACE path with your given name, in addition the logs can be found inside $OUTPUT_DIR folder (if your setup is the same as the previous example) otherwise you can find your logs in your given path.
Example 1:
To run a simple XGBoost* training, with default values:
python $WORKSPACE/src/MOSFET_main.py -m xgb
To run a simple XGBoost* training, with model name "xgb_model.pkl" , with logs saved in "$OUTPUT_DIR/xgb_log.log" and data length 2:
python $WORKSPACE/src/MOSFET_main.py -m xgb -mf $DATA_DIR/models/xgb_model.pkl -l $OUTPUT_DIR/xgb_log.log -n 2
Example 2:
To run a XGBoost* training with hyperparameters, with default values:
python $WORKSPACE/src/MOSFET_main.py -m xgbh
To run a XGBoost* training with hyperparameters, with model name "xgbh_model.pkl" , with logs saved in "$OUTPUT_DIR/xgbh_log.log" and data length 1:
python $WORKSPACE/src/MOSFET_main.py -m xgbh -mf $DATA_DIR/models/xgbh_model.pkl -l $OUTPUT_DIR/xgbh_log.log -n 1
Note: The same name and path for logfile for every call appends to the existing log file with equals name and path.
Before proceeding to the cleaning process, it is strongly recommended to make a backup of the data that the user wants to keep. To clean the previously downloaded and generated data, run the following commands:
conda deactivate #Run line if digital_twin_intel is active
conda env remove -n digital_twin_intel
rm $OUTPUT_DIR $DATA_DIR -rf
Reffering to the examples mentioned in this section the following outputs represent successfull runs.
Example 1 :
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
===== Running benchmarks for oneAPI tech =====
===== Generating Synthetic Data =====
--------- Synthetic Dataset Overview ---------
w_l vgs vth eta temp sub-vth w_l_bins vgs_bins vth_bins log-leakage
0 0.001 0.01 1.050000 1.217129 330.082475 9.035183e-17 1 1 1 16.044063
1 0.001 0.01 1.062342 1.201704 282.045813 2.233716e-19 1 1 1 18.650972
2 0.001 0.01 1.074684 1.200153 281.472996 1.292874e-19 1 1 1 18.888444
3 0.001 0.01 1.087025 1.175888 284.751179 6.035123e-20 1 1 1 19.219314
4 0.001 0.01 1.099367 1.211889 356.945319 2.028265e-16 1 1 1 15.692875
Done ✓
Data saved in: //frameworks.ai.platform.sample-apps.digital-twin//data/synthetic_data.csv
Synthetic data shape 960000 11
Synthetic dataset 'X' columns: ['w_l', 'vgs', 'vth', 'eta','temp', 'w_l_bins', 'vgs_bins', 'vth_bins']
Synthetic dataset 'Y' target column: 'log-leakage'
INFO:sklearnex: sklearn.model_selection.train_test_split: running accelerated version on CPU
sklearn.model_selection.train_test_split: running accelerated version on CPU
INFO:sklearnex: sklearn.model_selection.train_test_split: running accelerated version on CPU
sklearn.model_selection.train_test_split: running accelerated version on CPU
===== Running Benchmarks for XGB Regression =====
Training time = 3.133789300918579
Prediction time = 0.042740821838378906
daal4py Prediction time = 0.01962566375732422
Mean SQ Error: 0.017
daal4py Mean SQ Error: 0.017
Example 2 :
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
===== Running benchmarks for oneAPI tech =====
===== Generating Synthetic Data =====
--------- Synthetic Dataset Overview ---------
w_l vgs vth eta temp sub-vth w_l_bins vgs_bins vth_bins log-leakage
0 0.001 0.01 1.050 1.208536 293.649819 1.701777e-18 1 1 1 17.769097
1 0.001 0.01 1.075 1.204200 320.383640 1.223601e-17 1 1 1 16.912360
2 0.001 0.01 1.100 1.221108 312.065480 3.785925e-18 1 1 1 17.421828
3 0.001 0.01 1.125 1.217852 279.375132 3.034652e-20 1 1 1 19.517891
4 0.001 0.01 1.150 1.204661 240.956293 1.615252e-23 1 1 1 22.791760
Done ✓
Data saved in: //frameworks.ai.platform.sample-apps.digital-twin//data/synthetic_data.csv
Synthetic data shape 120000 11
Synthetic dataset 'X' columns: ['w_l', 'vgs', 'vth', 'eta','temp', 'w_l_bins', 'vgs_bins', 'vth_bins']
Synthetic dataset 'Y' target column: 'log-leakage'
INFO:sklearnex: sklearn.model_selection.train_test_split: running accelerated version on CPU
sklearn.model_selection.train_test_split: running accelerated version on CPU
INFO:sklearnex: sklearn.model_selection.train_test_split: running accelerated version on CPU
sklearn.model_selection.train_test_split: running accelerated version on CPU
===== Running Benchmarks for XGB Hyperparameter Training =====
Fitting 4 folds for each of 8 candidates, totalling 32 fits
Training time = 14.580235242843628
Prediction time = 0.46219873428344727
daal4py Prediction time = 0.004792928695678711
Mean SQ Error: 0.015
daal4py Mean SQ Error: 0.015
Because MOSFET devices are so common, any performance gain in model development will be amplified significantly in a deployed model. This offers a significant advantage in model solution scalability. Because leakage current is a key indicator of performance, a digital twin which can predict the leakage current of MOSFET devices at scale will be extremely valuable. To deploy this solution, the model.pkl file which is created as a result of training/hyperparameter tuning can be used to create the end-user applications (APIs to handle client requests) through standard OS packages such as flask or FastAPI.
To use this use case with your own dataset please take note of the name of the columns of your dataset (independent columns and target column), since some scripts points directly to the columns used with the synthetic dataset. The dataset must be in csv format to guarantee the functionallity of this section. Follow this steps to use your own dataset.
Step 1. Make sure this steps are completed succesfully:
Step 2. Place your customized dataset in the dataset dir
mv /path/to/your/customized/dataset.csv $DATA_DIR/
Step 3. Run your workflow with your data The script MOSFET_main.py can receive the independent columns of your dataset which corresponds to the 'X' part of the dataframes, and the target 'Y' column of your data which corresponds to the variable you want to predict as you can see in the help message below:
usage: MOSFET_main.py [-h] [-l LOGFILE] -m MODEL [-mf MODELFILE] [-n N_DATA_LEN] [-d DATA_PATH] [-x [X_COLS ...]]
[-y Y_COL]
optional arguments:
-h, --help show this help message and exit
-l LOGFILE, --logfile LOGFILE
log file to output benchmarking results to
-m MODEL, --model MODEL
type of model lr:linreg, xgb:xgboost, xgbh: xgb with hyperparameter tuning, xgbfull:
-mf MODELFILE, --modelfile MODELFILE
name for the built model please add extension if desired
-n N_DATA_LEN, --n_data_len N_DATA_LEN
option for data length. Provide 1 2 or 3, default 1
-d DATA_PATH, --data_path DATA_PATH
path to the customized csv dataset, optional
-x [X_COLS ...], --x_cols [X_COLS ...]
provide the independent columns of customized dataset space separated
-y Y_COL, --y_col Y_COL
provide the dependent column of customized dataset
Example 1: Lets say your customized dataset follows this structure:
w_l vgs vth eta temperature sub-vth w_l_b vgs_b vth_b curr-log-leakage
0 0.001 0.01 1.050 1.197438 316.039904 1.401728e-17 1 1 1 16.853336
1 0.001 0.01 1.075 1.205384 343.338401 1.069032e-16 1 1 1 15.971009
2 0.001 0.01 1.100 1.191140 296.339600 2.713053e-19 1 1 1 18.566542
3 0.001 0.01 1.125 1.201279 311.210269 9.215524e-19 1 1 1 18.035480
4 0.001 0.01 1.150 1.216623 309.201781 5.407484e-19 1 1 1 18.267005
Then your target variable is named as curr-log-leackage.
Then your independent columns name are: 'w_l' 'vgs' 'vth' 'eta' 'temperature' 'sub-vth' 'w_l_b' 'vgs_b' and 'vth_b'.
So, the argument form for each one of the arguments X and Y should be:
-x w_l vgs vth eta temperature sub-vth w_l_b vgs_b vth_b
-y curr-log-leackage
Then the command form you will take will be:
python $WORKSPACE/src/MOSFET_main.py -m <training-type> -mf <model-name>.pkl -l $OUTPUT_DIR/<log-name>.log -d $DATA_DIR/<customized-dataset-name>.csv -x w_l vgs vth eta temperature sub-vth w_l_b vgs_b vth_b -y curr-log-leackage
Now, let's put all together with examples:
Example 1.1:
To run a simple XGBoost* training, with model name "xgb_model.pkl" , with logs saved in "$OUTPUT_DIR/xgb_log.log" with your own dataset with name "modified.csv":
python $WORKSPACE/src/MOSFET_main.py -m xgb -mf xgb_model.pkl -l $OUTPUT_DIR/xgb_log.log -d $DATA_DIR/modified.csv -x w_l vgs vth eta temperature sub-vth w_l_b vgs_b vth_b -y curr-log-leackage
Example 1.2:
To run a XGBoost* training with hyperparameters, with model name "xgbh_model.pkl" , with logs saved in "$OUTPUT_DIR/xgbh_log.log" with your own dataset with name "modified.csv":
python $WORKSPACE/src/MOSFET_main.py -m xgbh -mf xgbh_model.pkl -l $OUTPUT_DIR/xgbh_log.log -d $DATA_DIR/modified.csv -x w_l vgs vth eta temperature sub-vth w_l_b vgs_b vth_b -y curr-log-leackage
If you have questions related to the rest of the parameters used please reffer to this section.
Note: Customized data pipelines only works with XGB,LR,XGBH training types.
Visit Intel® Extension for Scikit-learn for more.
Visit Intel® Optimization for XGBoost for more.
Visit Intel® Distribution of Modin for more.
Visit Python* API (daal4py) for Intel® oneAPI Data Analytics Library (oneDAL) for more.
The End-to-end Digital Twin team tracks both bugs and enhancement requests using GitHub issues. Before submitting a suggestion or bug report, search the DLSA GitHub issues to see if your issue has already been reported.
Intel® has released XGBoost* optimizations as part of the general XGBoost* packages. Please keep in mind the performance benefit will be a result of both Intel® Optimizations as well as version updates. No code changes are needed to realize these performance gains apart from just updating the XGBoost* version, except for explicitly listing tree_method as hist (as all training optimizations from Intel® are limited to the hist tree method). However, the daal4py optimizations are still relevant to the use case as this can further improve the performance of end-user applications.
The base code was sourced from the following github repository: https://github.com/tirthajyoti/Digital-Twin/blob/main/MOSFET-1.ipynb
[1]: IBM. (2022). What is a Digital Twin. Www.ibm.com. https://www.ibm.com/topics/what-is-a-digital-twin
[2]: Cheat sheet: What is Digital Twin? Internet of Things blog. (2020, December 4). IBM Blog. https://www.ibm.com/blog/iot-cheat-sheet-digital-twin/
The dataset used here is synthetic. Intel® Corporation does not own the rights to this data set and does not confer any rights to it.
To the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel® does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license. Intel® expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel® is not liable for any liability or damages relating to your use of public content.