-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmethodology.tex
246 lines (193 loc) · 30.4 KB
/
methodology.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
\section{Objective 1: \texttt{mpbenchmark}}
As discussed in the introduction, \texttt{mpbenchmark} performs calculations of a jet engine using multiple threads and produces the time taken to complete these calculations as output. The source code of \texttt{mpbenchmark} provided a solution in \texttt{C\#} language\cite{mpbenchmark_code}, this served as a useful reference of its implementation in code using object-oriented design. Subsequently, the \texttt{C++} object oriented design comprised of three main classes:
\begin{enumerate}
\item \texttt{FileDataLoader}: the primary function of this class was to load data from the input file and also to allow the user to save output data to the output file.
\item \texttt{SharedPerformanceData}: this class stores data loaded from the input file into an array and also allows storage of output data into a separate array. But importantly it allows threads to access specific parts of the input data in a thread-safe manner.
\item \texttt{Worker}: this class contains functions to perform the important calculations, computations of deadlines missed and output data. This class defines the \texttt{operator ()} which encapsulates the main calculations, this class design is know as a \texttt{Functor}.
\end{enumerate}
The \texttt{C++} object-oriented class design can be visualised using a UML class diagram shown in Figure \ref{fig:mpbenchmark_UML_diagram}.
\begin{figure}[H] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=1\textwidth, height=20cm]{~/Documents/Part_D_Modules/Individual_Project/Individual_report/figures/mpbenchmark_class.png} % Adjust the path and width as needed
\caption{UML class diagram of the proposed \texttt{C++} solution.}
\label{fig:mpbenchmark_UML_diagram} % Use this label to reference the figure
\end{figure}
Figure ~\ref{fig:mpbenchmark_UML_diagram2} shows the UML sequence diagram on how the three classes in the previous Figure ~\ref{fig:mpbenchmark_UML_diagram} are used. The ``\texttt{main}" in the sequence diagram represents the \texttt{.cpp} file where the \texttt{int main(...)} function exists.
\begin{figure}[htbp] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=1\textwidth, height=20cm]{figures/mpbenchmark_sequence.png} % Adjust the path and width as needed
\caption{UML sequence diagram of the proposed \texttt{C++} solution.}
\label{fig:mpbenchmark_UML_diagram2} % Use this label to reference the figure
\end{figure}
The key features of the proposed solution leverage the modern enhancements introduced to the \texttt{C++} language in its 2011 update. Since then, \texttt{C++} has undergone significant transformations that include advanced multi-threading capabilities and improved memory management techniques. Significantly, the introduction of \texttt{std::thread} and \texttt{std::mutex} in modern \texttt{C++} offers a more convenient and cross-platform approach to multi-threading compared to traditional \texttt{POSIX} threads, which are not only more cumbersome to use but also restricted to \texttt{Unix}-based systems. The utilization of these features in the proposed solution enhances its portability and ease of use. Additionally, the proposed solution incorporates the \texttt{std::vector} class, a dynamic array sequence container that significantly simplifies memory management. Unlike in the \texttt{C} language, where manual handling of dynamic memory can be error-prone, \texttt{std::vector} automates this aspect, thereby reducing complexity and potential bugs. While modern \texttt{C++} offers numerous advantages over \texttt{C}, including object-oriented programming features and performance parity, it does come with a steeper learning curve. This learning curve is often seen as a trade-off for the powerful features and robustness provided by \texttt{C++}\cite{evolution_of_C++}.
The proposed solution employed the \texttt{fmt} library in \texttt{C++}\cite{fmt_printing_library} for output, providing an enhanced printing method compared to the \texttt{std::cout()} function from the \texttt{C++} standard library. Although the compiler on the target system supported the \texttt{C++20} standard, it lacked support for the \texttt{std::format()} function, which provides better printing capabilities in \texttt{C++}\cite{std_format_gcc_compiler_version}. Additionally, a new command line argument was introduced to adjust the number of threads used, complementing the existing argument that modifies the simulation engine. For compilation and linking, the industry-standard \texttt{CMake}\cite{cmake_about} software was utilized. In the \texttt{CMakeLists.txt} file, essential configurations included specifying the \texttt{C++20} standard, incorporating the \texttt{fmt} library, and compiling with the \texttt{-O2} optimization flag. This optimisation flag, also used by the original authors of \texttt{mpbenchmark}\cite{mpbenchmark_paper}, was retained in the proposed solution for consistency.
An alternate solution was developed to further enhance the performance of \texttt{mpbenchmark}, the \texttt{Valgrind/Callgrind} function profiler tool was deployed to find potential bottlenecks. The application was compiled with debug information and optimisations turned off, this was done by specifying the build type as \texttt{Debug} and compiling with \texttt{-g} and \texttt{-O0} flags in the \texttt{CMake} file. \texttt{Callgrind} profiling shows ``self-cost" of different parts of the code. ``Self-cost" refers to the number of CPU cycles that were directly consumed by a specific function itself. This can be used to find potential areas of optimisations\cite{Valgrind2024}. \texttt{Callgrind} profiling results showed that part of the application where it approximates the value of $\pi$ had the highest self-cost. This code snippet is shown in the Listing ~\ref{lst:pi_approximation_1} below:
\begin{lstlisting}[
caption={Part of the code with the highest self-cost. It approximates $\pi$ using numerical integration.},
label={lst:pi_approximation_1}
]
// initialise variables for the pi calculation
const long num_steps = 1000000;
double step = 1.0 / static_cast<double>(num_steps);
double x{}, pi{}, sum{};
// performing numerical integration using the midpoint Riemann sum
for (int i = 0; i < num_steps; i++) {
x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x * x);
}
pi = sum * step;
\end{lstlisting}
The problem with this part of the code was that it was already part of the parallel regions where it is executed by each thread. It may seen like a good candidate for applying parallel \texttt{for loop} from the OpenMP library however creating nested threads beyond the number of threads available on the system does not always lead to higher performance and in many cases can degrade performance. Another way to improve performance is by using SIMD intrinsics. An advantage of using \texttt{C++}(and/or \texttt{C}) is that SIMD intrinsics can be deployed whereas higher level languages like \texttt{Java} or \texttt{C\#} make it very difficult to access these.
SIMD intrinsics vary by the target CPU, \texttt{x86} processors (which are found in most laptop and desktops) use \texttt{AVX2} instructions whereas \texttt{ARM} processors(commonly found in Apple products and embedded devices) use \texttt{NEON} instructions. In this project we use both as our proposed \texttt{C++} solution will be deployed on desktop(\texttt{x86} processor) and Raspberry Pi devices(\texttt{ARM} processor). The SIMD enhanced code can be summarised algorithmatically in the following steps in Figure ~\ref{fig:_simd_algorithm}.
\begin{figure}[htbp]
\centering
\begin{enumerate}
\item Initialise \texttt{256-bit} wide vectors: each vector can hold four double-precision (\texttt{64-bit}) floating point numbers. The main initialisations would be a vector to hold four loop indices (\texttt{vec\_i}), a vector to calculate four values of $x$ (\texttt{vec\_x}), a vector to hold the result of the integrand (\texttt{vec\_temp}) and a vector to accumulate the sum (\texttt{vec\_sum}) after each loop iteration.
\item \texttt{for loop} iterate until \texttt{num\_steps/4}:
\begin{itemize}
\item step 1: calculate the four midpoints $x$-values simultaneously using the vector \texttt{vec\_i} and hold result in \texttt{vec\_x}. Original formula used: \texttt{(i + 0.5) * step}.
\item step 2: compute the value of the integrand in parallel using the four calculated $x$-values stored in \texttt{vec\_x}, store result in \texttt{vec\_temp}. Original formula used: \texttt{4 / (1 + x * x)}.
\item step 3: accumulate the values from \texttt{vec\_temp} into the \texttt{vec\_sum} vector.
\item step 4: increment loop indices vector \texttt{vec\_i} by \texttt{4}.
\end{itemize}
\item Final summation: upon the completion of the loop, perform a horizontal sum on the vector (\texttt{vec\_sum}) that held the accumulated values.
\item Calculation of $\pi$: sum is multiplied by the step size to approximate the value of $\pi$. Original formula used : \texttt{pi = sum * step}.
\end{enumerate}
\caption{Algorithm for calculating $\pi$ using SIMD(\texttt{AVX2}) instructions.}
\label{fig:_simd_algorithm}
\end{figure}
As discussed, to utilise SIMD intrinsics on Raspberry Pi devices, \texttt{NEON} instructions must be employed. \texttt{NEON} instructions come with limitations, notably in their support for double precision floating points, which is restricted, and their vector width, which is only \texttt{128-bit}, compared to the \texttt{256-bit} vectors seen in \texttt{AVX2}\cite{neon_reference}. To accommodate \texttt{NEON}, two solutions were developed: one using single precision floating points (\texttt{float}) and the other using double precision floating points (\texttt{double}). The \texttt{NEON} code with \texttt{float} can perform four computations simultaneously, while the code with \texttt{double} can manage only two computations simultaneously, due to the \texttt{128-bit} vector's capacity to store four \texttt{float} values or two \texttt{double} values. Typically, \texttt{float} variables offer less decimal precision than \texttt{double} variables. These two SIMD-enhanced solutions, along with their decimal accuracy levels, will be compared in the results and discussion section. The \texttt{NEON} implementation follows a similar algorithm as the one shown in Figure ~\ref{fig:_simd_algorithm}. The \texttt{CMake} file also required some changes to allow the \texttt{AVX2} instructions to be used, this along with the code snippets for the SIMD enhanced code can be found in the [Appendix \ref{sec:app_obj1}].
To collect benchmark data, the runtime was recorded and saved to a specified \texttt{.txt} file, the number of deadlines missed were also saved but in a separate \texttt{.txt} file. The application underwent 103 runs, with the first three designated as warm-up runs; the subsequent 100 runs were averaged and utilised for plotting bar charts and speedup plots. For the languages \texttt{Java} and \texttt{C\#}, data were collected over 103 runs. In contrast, for the compiled languages \texttt{C}, \texttt{Ada}, and \texttt{C++}, the application was executed 203 times. The variance in the number of threads was controlled using the \texttt{taskset} command in Linux. Specifically for the \texttt{C++} application, the second command line argument was employed to specify the number of threads. This approach aligns with the methodology described by the authors of \texttt{mpbenchmark}\cite{mpbenchmark_paper}. An example bash script demonstrating how the application was executed multiple times is shown in the following Listing ~\ref{lst:benchmark_collection}:
\begin{lstlisting}[
caption={Bash script to run application multiple times along with \texttt{taskset} command. Command line arguments: \texttt{mpbenchmark [engine\_type] [threads]} .},
label={lst:benchmark_collection}
]
# Loop to run the executable 103 times, using "3" as the default engine type and 2 cores/threads
for i in {1..103}
do
taskset -c 0,2 ./mpbenchmark 3 2
done
# Loop to run the executable 103 times, using "3" as the default engine type and 4 cores/threads
for i in {1..103}
do
taskset -c 0,2,4,6 ./mpbenchmark 3 4
done
\end{lstlisting}
The process of collecting benchmark times can be visualised in the image shown in Figure ~\ref{fig:results_collection}. The python application used to plot the graphs was trivial and not shared in this report. The full system specification of the desktop and Raspberry Pi devices used to collect data can be found in the [Appendix ~\ref{tab:spec_comparisons}]. The results are displayed via benchmark and speedup plots. A benchmark plot shows the mean run time of the application on the vertical axis with the number of cores or threads on the horizontal axis, in this case the lower is better as a lower run time indicates better performance. The error bars indicate 95\% confidence interval of the mean. A speedup plot shows the speedup ratio on the vertical axis and the number of cores or threads on the horizontal axis, here higher values are better because a higher speedup indicates better scalability of the application. The ``speedup" is a ratio of the time taken for the application using a certain number of threads compared to the original single-threaded run time. For example, if an application takes 100 ms to run using 1 thread and 20 ms to run using 8 threads. The speedup for 8 threads would be $\frac{100}{20}$ equalling \texttt{5.00}, as a ratio it would be unitless.
In summary, for desktop (\texttt{x86}) processors, two main solutions have been developed: a novel \texttt{C++} solution and a SIMD-enhanced \texttt{C++} solution as discussed in Listing~\ref{fig:_simd_algorithm}. For the Raspberry Pi devices, three solutions have been created: the first is the novel \texttt{C++} solution, identical to that on the desktop, and the other two are the SIMD-enhanced versions utilizing \texttt{NEON} instructions with single and double precision floating point variables.
\begin{figure}[htbp] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=0.40\textwidth]{~/Documents/Part_D_Modules/Individual_Project/Individual_report/figures/benchmarking_method.png} % Adjust the path and width as needed
\caption{Benchmark results collection strategy.}
\label{fig:results_collection} % Use this label to reference the figure
\end{figure}
\section{Objective 2: \texttt{MobileNet}}
% Step 1: MobileNet setup
% Step 2: Valgrind results
% Step 3: OpenMP parallelisations and SIMD
% Step 4: Raspberry Pi different compilations
Setting up MobileNet from the \texttt{GitHub} repository\cite{mobilenet_repo} was a time consuming task. Details about the setting up process can be found in the [Appendix Listing ~\ref{lst:mobilenet_updates}]. Once setup, the \texttt{MobileNet} project needed to be profiled with \texttt{Callgrind} function profiler to find bottlenecks and possible parallelisation points. To use \texttt{Callgrind}, the project was compiled with debug information and no optimisations. This required setting the build type as \texttt{Debug} and adding \texttt{-g} and \texttt{-O0} optimisation flags in the \texttt{CMakeLists.txt} file. \texttt{Callgrind} results showed that functions from the convolution and batch normalisation layer had the highest self-cost. The functions with the highest self-cost were \texttt{ConvLayer::forward()}, \texttt{BatchNormalLayer::forward()} and \texttt{ConvLayer::Addpad()}. The detailed \texttt{Callgrind} profiling results can be seen in [Appendix Figure ~\ref{fig:mobilenet_profiling}].
The CPU intensive function inside the convolution layer contained a number of nested \texttt{for} loops with large iterations. \texttt{OpenMP} library was used to parallelise the \texttt{for} loops. The \texttt{collapse} clause and dynamical scheduling type of the parallel \texttt{for} loop from the \texttt{OpenMP} library were implemented inside the \texttt{ConvLayer::forward()} function. In the OpenMP library, the \texttt{parallel for} directive, enhanced by the \texttt{collapse(3)} clause and \texttt{dynamic} scheduling, efficiently manages the execution of nested loops in parallel. By collapsing three nested loops into a single sequence, the \texttt{collapse} clause simplifies load balancing across multiple threads by increasing the pool of iterations available for distribution. Concurrently, the \texttt{dynamic} scheduling type assigns iterations to threads in real-time, adapting to varying execution times among iterations and ensuring an even workload distribution\cite{openmp_guide}.
To parallelise the other two functions \texttt{BatchNormalLayer::forward()} and \texttt{ConvLayer::Addpad()} only the \texttt{collapse} clause was utilised. The \texttt{ConvLayer::forward()} function contained some computations which were a good candidate for SIMD optimisations, using SIMD was more straight forward than compared to objective 1 as \texttt{OpenMP} library allows a simple way of using SIMD without the developer having to worry about \texttt{AVX2} or \texttt{NEON} instructions. For the Raspberry Pi processor it was decided to parallelise only the \texttt{ConvLayer::forward()}, \texttt{BatchNormalLayer::forward()} functions and not use the SIMD enhancements, as SIMD instruction's performance gains are more limited on embedded processors due to the narrower registers\cite{embedded_processors_reduced_simd_performance}. Having said that results from parallelising all the three functions and SIMD enhanced code were compared to find the optimal solution. This is discussed in the results and discussion section.
To address the different compilations, the \texttt{CMake} file was altered. A macro \texttt{EMBEDDED\_PROC} was added. To compile the \texttt{MobileNet} project for \texttt{x86} processors the macro \texttt{EMBEDDED\_PROC} was set to \texttt{OFF}, this applies parallelisation to all three functions along with the SIMD enhancements. To compile the \texttt{MobileNet} project for Raspberry Pi processor the user can specify the \texttt{EMBEDDED\_PROC} to \texttt{ON}, this would parallelise only the two functions discussed and not apply SIMD enhancements. This can be seen in lines 15-18 in Listing ~\ref{lst:mobilenet_parallel}. The parallelisation of the other two functions \texttt{BatchNormalLayer::forward()} and \texttt{ConvLayer::Addpad()} were very similar to \texttt{ConvLayer::forward} and can be found in [Appendix Listing ~\ref{lst:mobilenet_function1_parallel} and ~\ref{lst:mobilenet_function2_parallel}].
\begin{lstlisting}[
caption={Parallelising the \texttt{ConvLayer::forward()} function and applying SIMD depending on the macro \texttt{EMBEDDED\_PROC}.},
label={lst:mobilenet_parallel}
]
void ConvLayer::forward(float *pfInput)
{
// collapse 3 "for" loops and use dynamic scheduling
#pragma omp parallel for collapse(3) schedule(dynamic)
for (int g = 0; g < m_nGroup; g++)
{
for (int nOutmapIndex = 0; nOutmapIndex < m_nOutputGroupNum; nOutmapIndex++)
{
for (int i = 0; i < m_nOutputWidth; i++)
{
// other code to be ignored ...
// Only use OpenMP SIMD optimizations if EMBEDDED_PROC is not defined
#ifndef EMBEDDED_PROC
#pragma omp simd reduction(+:fSum)
#endif
for (int n = 0; n < m_nKernelWidth; n++)
{
nKernelIndex = nKernelStart + m * m_nKernelWidth + n;
nInputIndex = nInputIndexStart + m * m_nInputPadWidth + n;
fSum += m_pfInputPad[nInputIndex] * m_pfWeight[nKernelIndex];
}
}
}
}
}
\end{lstlisting}
In line 16 of the code snippet above(Listing ~\ref{lst:mobilenet_parallel}) the reduction clause on the \texttt{fSum} was deployed, this ensured that after vectorized operations, the partial results held in different elements of the SIMD vector register are correctly summed up into the single scalar \texttt{fSum}. This avoids any manual management of these partial results, simplifying the code and ensuring correctness.
Moreover the arrays must also be ``aligned" to make full use of the SIMD clause, this is important in several key ways. First, it ensures efficient utilisation of the CPU's cache lines, reducing cache misses by aligning data accesses with the processor’s cache system, thereby accelerating data retrieval. Second, it facilitates efficient SIMD operations, as modern SIMD instructions require data to be aligned with the size of their registers—here, 256 bits or 32 bytes—to prevent penalties from misalignment, such as additional cycles for data realignment. Alignment also prevents faults that occur when data is accessed at addresses not aligned to the required boundaries, particularly on systems where such misalignment leads to crashes. Lastly, by aligning memory accesses with the hardware's memory interface, the code optimises the throughput and maximizes the memory bandwidth usage. Together, these factors ensure that the SIMD-optimised processes in the convolutional layer computations are performed with maximal efficiency and stability\cite{simd_array_alignment}. A 32-byte alignment was used for arrays \texttt{m\_pfInputPad} and \texttt{m\_pfWeight}, this is shown in Listing ~\ref{lst:mobilenet_array_alignment}.
\begin{lstlisting}[
caption={Making arrays that are 32-byte aligned for SIMD clause.},
label={lst:mobilenet_array_alignment}
]
ConvLayer::ConvLayer(/*Constructor arguments not shown ... */)
{
// ignore other code ...
// Creating arrays which are 32-byte aligned for SIMD optimisations
size_t alignment = 32;
size_t inputPadSize = m_nInputNum * m_nInputPadWidth * m_nInputPadWidth * sizeof(float);
size_t weightSize = m_nOutputNum * m_nInputGroupNum * m_nKernelSize * sizeof(float);
size_t outputSize = m_nOutputNum * m_nOutputSize * sizeof(float);
m_pfInputPad = static_cast<float*>(aligned_alloc(alignment, inputPadSize));
m_pfWeight = static_cast<float*>(aligned_alloc(alignment, weightSize));
}
\end{lstlisting}
The \texttt{MobileNet} application was designed with three command line arguments. The first argument determines whether to classify only one image or all images in the \texttt{data} folder. The second argument specifies whether the runtime should be saved to a \texttt{.txt} file, and the third sets the number of threads used. For data collection, the time taken to classify a single image was recorded while varying the number of threads. The number of threads utilized by the application was adjusted using the command line and \texttt{OpenMP}'s \texttt{omp\_set\_num\_threads()} function. For benchmarking, the project was compiled with the \texttt{-O3} optimisation flag and was executed 103 times with different thread counts; the runtime and speedup plots were derived from the average of the 100 timing results stored in the \texttt{.txt} file. The method of data collection largely mirrors the process shown in Figure~\ref{fig:results_collection}, with the exception that the \texttt{taskset} command was not employed to adjust the number of CPU cores; instead, thread adjustments were made through the command line. The detailed code snippet of the command line arguments can be found in the [Appendix Listing ~\ref{lst:mobilenet_command_line_arguments}].
\section{Objective 3: \texttt{DeBaTE-FI} platform}
% Talk about profiling the application
% C++ libraries for telnetpp and integration into the Debate-FI via Pybind-11
% Multi-processing and mutli-threaidng design improvement
The application was profiled to find bottlenecks and possible points of optimisations. The tool \texttt{py-spy} was used to profile the application and save the output as a \texttt{.json} file\cite{py-spy}. The results were then viewed using the \texttt{speedscope} web application\cite{speedscope_app}. Since the application used multi-processing, the required processes were identified and their process \texttt{pid} was used to run \texttt{py-spy}, as shown below in Listing~\ref{lst:py_spy_application}:
\begin{lstlisting}[
caption={Running \texttt{py-spy} on the specified process.},
label={lst:py_spy_application}
]
py-spy record --pid <PID> --format speedscope -o profile.json
\end{lstlisting}
Profiling results showed that the functions from \texttt{Python}'s \texttt{Telnetlib} library had the highest self-cost. To optimise the application, a \texttt{C++} open-source \texttt{Telnet} library was utilized. The \texttt{C++} library, called \texttt{telnetpp},\cite{telnetpp_library} along with \texttt{serverpp},\cite{serverpp_library} was integrated into the application using the tool \texttt{pybind11}. \texttt{pybind11} is a lightweight library for integrating \texttt{C++} into \texttt{Python} applications\cite{pybind11}. The \texttt{telnetpp} library required some functions for it to be integrated into the application; thus, a wrapper class in \texttt{C++} was created that allowed \texttt{telnetpp} to seamlessly replace the previous \texttt{telnetlib} \texttt{Python} library. The wrapper \texttt{C++} class was named \texttt{telnetlibcpp} for consistency. This \texttt{C++} class is shown in the following UML class diagram (Figure~\ref{fig:telnetlibcpp_UML}). The main functions used, \texttt{Readout()}, \texttt{Exec()}, and \texttt{write()}, have the exact same names as the functions from the \texttt{Python} library. Profiling results and detailed code snippet about the \texttt{C++} functions can be found in [Appendix ~\ref{sec:app_obj3}].
\begin{figure}[htbp] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=1\textwidth, height=15cm]{~/Documents/Part_D_Modules/Individual_Project/Individual_report/figures/telnetlib_C++_class.png} % Adjust the path and width as needed
\caption{\texttt{C++} class to emulate \texttt{Python's} \texttt{telnetlib} library functions.}
\label{fig:telnetlibcpp_UML} % Use this label to reference the figure
\end{figure}
Using \texttt{CMake} and \texttt{pybind11} the \texttt{C++} class was compiled into a shared library(\texttt{.so}) file which was then moved into the same directory as the application. This allowed the \texttt{python} application to call the \texttt{C++} functions and use the \texttt{telnetpp} \texttt{C++} library.
Another solution was developed which aimed to optimise the application's multi-processing and multi-threading. A simplified version of the application's architecture is shown below in the UML sequence diagram(Figure ~\ref{fig:original_application_arch}):
\begin{figure}[htbp] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=1\textwidth, height=15cm]{~/Documents/Part_D_Modules/Individual_Project/Individual_report/figures/original_process_design.png} % Adjust the path and width as needed
\caption{The sequence of the original application architecture. CPU Core 2 has to split CPU time between handling serial port data and sending commands.}
\label{fig:original_application_arch} % Use this label to reference the figure
\end{figure}
Figure ~\ref{fig:original_application_arch} summarises how the original application had a convoluted design leading to poor performance:
\begin{enumerate}
\item Two classes \texttt{device} and \texttt{deviceControl} spawned processes.
\item The \texttt{deviceControl} process had its workload split between sending commands to MCU and saving serial port data to \texttt{.txt} file.
\item The \texttt{device} process seem to be performing a redundant task of saving serial port data into a queue.
\end{enumerate}
This design was improved as follows:
\begin{enumerate}
\item Two classes \texttt{device} and \texttt{deviceControl} spawn processes with each having their own designated tasks.
\item The \texttt{deviceControl} process now only sends data to the MCU.
\item The \texttt{device} process only focuses on receiving data from the serial port and saving it to the \texttt{.txt} file.
\end{enumerate}
The key in the above improved design is removing the redundant work as previously both of the processes were handling the serial port data. Handling serial port data does not require two processes it can be done using only one. When threads are used inside a process, CPU time is split between the two threads. In the improved design both processes only perform one task and more importantly the \texttt{deviceControl} process has a reduced workload, this is important as the CPU time will be fully allocated to sending commands to the MCU . The challenge here was to make sure both of the processes were synchronised. This was accomplished using the \texttt{Event()} object from the \texttt{multiprocessing} \texttt{Python} library. When the \texttt{deviceControl} process started it sent a signal to the \texttt{device} process, to start checking and storing serial port data, and when the \texttt{deviceControl} process finished it sent a signal to stop the \texttt{device} process, stopping the \texttt{device} process would then save the accumulated serial port data into the \texttt{.txt} file. This can be visualised in the following UML sequence diagram (Figure ~\ref{fig:improved_application_design}):
\begin{figure}[htbp] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=1\textwidth, height=15cm]{~/Documents/Part_D_Modules/Individual_Project/Individual_report/figures/improved_process_design.png} % Adjust the path and width as needed
\caption{The improved multi-processing design of the application. CPU Core 2 now allocates all the CPU time to sending commands to the MCU.}
\label{fig:improved_application_design} % Use this label to reference the figure
\end{figure}
Results were collected using four \texttt{STM32F767ZI} MCUs for both of these solutions. In collaboration with Alex Henneman, results using the main setup(see Figure ~\ref{fig:debate_fi_setup}) that utilised 36 MCUs running the second solution were also collected to verify the performance observed on the local setup. To collect benchmark data the application was run only once with varying number of boards and the run time was saved into a \texttt{.csv} file. The run time usually spanned in minutes thereby making multiple runs extremely time consuming, moreover the variation in run times was negligible unlike the applications in the first two objectives where the run times were in milliseconds and were more prone to variation. The benchmark collection strategy was similar to the one shown in Figure ~\ref{fig:results_collection} except the application was run only once and the times were saved in a \texttt{.csv} file instead of a \texttt{.txt} file.
\begin{figure}[htbp] % Positioning preference: here, top, bottom, page
\centering
\includegraphics[width=1\textwidth, height=15cm]{~/Documents/Part_D_Modules/Individual_Project/Individual_report/figures/debate_fi_setup.png} % Adjust the path and width as needed
\caption{\texttt{DeBaTE-FI} platform application on the main setup using 36 \texttt{STM32} boards\cite{debate_fi_publication}.}
\label{fig:debate_fi_setup} % Use this label to reference the figure
\end{figure}