Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
"2:20pm. Done with the bath, breakfast and chores. Also I had my fun …
…with the Shagri-la Frontier manga. I am not entirely caught up, but I have a bit left before I am caught up with official translations. After that I'll have 60 chapters of unofficial translations to go through. Yeah, this is the life. Why do I have to work when I am sick and my hand is still injured? I don't. But, I want to a bit every day, so I am going to do so today as well. ...Right now I am looking at HNHired and TS jobs exceed C# by 10x. Tsk. Anyway, that is not what I wanted to talk about. I remembered what my goal was. I have great plans, but they'll also go under if I cannot clear the very first hurdle. I need to be able to clear 100 million hands per seconds for the NL Holdem game with a simple linear model. In 2021, I got 10-100k hands per second for the program written in Python + PyTorch. If I cannot get 1,000x that with a beefy GPU suited for the task, I am just wasting my time here. I can forget about ML and go back to webdev and writing sci-fi. If I can get 100m and more, then I can start thinking about doing this quest for real. I have the game, and once I add matmult, I'll have a linear model. I don't even need a softmax, I'll just use an argmax reduction to select the action. Who cares. It does't matter whether the game is correct, or whether the player is good. The most important thing at this juncture would be give myself a reality check. I can do a ton of stuff with my programming skills, but none of that matters unless the hardware supports it. 2:40pm. It is really a pain to write this out, but let me continue doing it. Yesterday, I studied the Cuda matmult samples and they were very informative. If cuBLASDx was't released 1.5 weeks ago, I'd be starting work on translating them into Spiral. But since it was, for the next few days, I'll be studying its examples as well, before trying out the library. 2:45pm. https://docs.nvidia.com/cuda/cublasdx/examples.html I already went through the docs for this, so now it is time that I finally download the library and check these out. https://developer.nvidia.com/mathdx ...Why the hell does this only support Linux? Can I use it from WSL? https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl It seems I can. Ok, let me busy myself with this for a bit. https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl 2:55pm. It seems that all of this wouldn't have worked with my old GTX 970. ...Ok, enough of me writing this journal. It's putting strain on my hand. > From this point you should be able to run any existing Linux application which requires CUDA. Do not install any driver within the WSL environment. For building a CUDA application, you will need CUDA Toolkit. Read the next section for further information. Let me restart. The WSL should be installed right now. 3:15pm. https://docs.nvidia.com/cuda/cufftdx/installation.html You know, even though I've installed WSL as my first move, I should instead try this library out in Windows. Since it is a Cuda device specific library it is unlikely for it to depend on Linux specific functionality. Why would it? 3:50 PM. Ok. It seems I did something stupid. I wrote in this journal, and now my right hand is **** **** I don't think I'll be doing any programming today. ```cpp #include <iostream> #include <vector> #include <cuda_runtime_api.h> #include <cublasdx.hpp> #include "common.hpp" #include "block_io.hpp" #include "reference.hpp" template<class BLAS, class ValueType = typename BLAS::value_type> __launch_bounds__(BLAS::max_threads_per_block) // __global__ // void gemm_kernel(const ValueType* a, const ValueType* b, const ValueType* c, const ValueType alpha, const ValueType beta, ValueType* output) { using value_type = ValueType; extern __shared__ __align__(16) char smem[]; constexpr unsigned int block_size = BLAS::block_dim.x * BLAS::block_dim.y * BLAS::block_dim.z; value_type* smem_a = reinterpret_cast<value_type*>(smem); value_type* smem_b = reinterpret_cast<value_type*>(smem) + BLAS::a_size; value_type* smem_c = reinterpret_cast<value_type*>(smem) + BLAS::a_size + BLAS::b_size; example::io<BLAS>::a_fast_load<block_size>(smem_a, a); example::io<BLAS>::b_fast_load<block_size>(smem_b, b); example::io<BLAS>::c_fast_load<block_size>(smem_c, c); __syncthreads(); BLAS().execute(alpha, smem_a, smem_b, beta, smem_c); __syncthreads(); example::io<BLAS>::c_fast_store<block_size>(output, smem_c); } // This is an example of fp32 general matrix-matrix multiplication (GEMM) performed // in a single CUDA block: // // C = alpha * A * B + beta * C // // * A, B, and C are matrices containing real single precision floating-point values. // * alpha and beta are real single precision floating-point values. // // Input data is generated on host using random number generators, and later copied to // the global memory. Next, kernel with GEMM is executed, and then the matrix C (the result) // is copied back to host memory. The results are verified against cuBLAS. // // In this example the number of threads participating in the GEMM operation is imposed by providing // BlockDim operator in definition of the GEMM. If BlockDim operator is not used, cuBLASDx automatically // selects number of threads. Block dimensions are provided via BLAS::block_dim trait. template<unsigned int Arch> int simple_gemm() { // Parameters m, n, k define the dimensions of matrices A, B, and C constexpr unsigned int m = 32; constexpr unsigned int n = 16; constexpr unsigned int k = 64; // If matrix A is not transposed its logical dimensions are: [m, k] (m rows, k columns) // If matrix B is not transposed its logical dimensions are: [k, n] // If matrix A is transposed its logical dimensions are: [k, m] // If matrix B is transposed its logical dimensions are: [n, k] // The dimensions of matrix C are: [m, n] constexpr auto a_transpose_mode = cublasdx::transpose_mode::non_transposed; constexpr auto b_transpose_mode = cublasdx::transpose_mode::transposed; // Selected CUDA block size (1D) constexpr unsigned int block_size = 256; // GEMM definition using cuBLASDx operators: // 1. The size, the precision, and the type (real or complex) are set. // 2. The BLAS function is selected: MM (matrix multiplication). // 3. The transpose modes of A and B matrices are set. // 4. Block operator informs that GEMM should be performed on CUDA block level. // 5. BlockDim operator sets CUDA block dimensions that the kernel will be executed with. // 6. Targeted CUDA compute capability is selected with SM operator. using BLAS = decltype(cublasdx::Size<m, n, k>() + cublasdx::Precision<float>() + cublasdx::Type<cublasdx::type::real>() + cublasdx::Function<cublasdx::function::MM>() + cublasdx::TransposeMode<a_transpose_mode, b_transpose_mode>() + cublasdx::Block() + cublasdx::BlockDim<block_size>() + cublasdx::SM<Arch>()); #if CUBLASDX_EXAMPLE_DETAIL_NVCC_12_2_BUG_WORKAROUND using value_type = example::value_type_t<BLAS>; #else using value_type = typename BLAS::value_type; #endif // Allocate managed memory for a, b, c, and output value_type* inputs; value_type* output; // BLAS::a_size/b_size/c_size include padding (take into account the leading dimension if set) auto inputs_size = BLAS::a_size + BLAS::b_size + BLAS::c_size; auto inputs_size_bytes = inputs_size * sizeof(value_type); CUDA_CHECK_AND_EXIT(cudaMallocManaged(&inputs, inputs_size_bytes)); CUDA_CHECK_AND_EXIT(cudaMallocManaged(&output, BLAS::c_size * sizeof(value_type))); value_type* a = inputs; value_type* b = a + (BLAS::a_size); value_type* c = b + (BLAS::b_size); value_type alpha = value_type(1.0); value_type beta = value_type(2.0); // Fill the A, B, C matrices with random values auto host_a = example::get_random_data<value_type>(0.1, 1.0, BLAS::a_size); auto host_b = example::get_random_data<value_type>(0.1, 1.0, BLAS::b_size); auto host_c = example::get_random_data<value_type>(0.1, 1.0, BLAS::c_size); CUDA_CHECK_AND_EXIT(cudaMemcpy(a, host_a.data(), BLAS::a_size * sizeof(value_type), cudaMemcpyHostToDevice)); CUDA_CHECK_AND_EXIT(cudaMemcpy(b, host_b.data(), BLAS::b_size * sizeof(value_type), cudaMemcpyHostToDevice)); CUDA_CHECK_AND_EXIT(cudaMemcpy(c, host_c.data(), BLAS::c_size * sizeof(value_type), cudaMemcpyHostToDevice)); CUDA_CHECK_AND_EXIT(cudaDeviceSynchronize()); // Increase max dynamic shared memory for the kernel if needed CUDA_CHECK_AND_EXIT( cudaFuncSetAttribute(gemm_kernel<BLAS>, cudaFuncAttributeMaxDynamicSharedMemorySize, BLAS::shared_memory_size)); // Execute kernel gemm_kernel<BLAS><<<1, BLAS::block_dim, BLAS::shared_memory_size>>>(a, b, c, alpha, beta, output); CUDA_CHECK_AND_EXIT(cudaDeviceSynchronize()); // Copy results back to host std::vector<value_type> host_output(BLAS::c_size); CUDA_CHECK_AND_EXIT( cudaMemcpy(host_output.data(), output, BLAS::c_size * sizeof(value_type), cudaMemcpyDeviceToHost)); CUDA_CHECK_AND_EXIT(cudaDeviceSynchronize()); // Free device memory CUDA_CHECK_AND_EXIT(cudaFree(inputs)); CUDA_CHECK_AND_EXIT(cudaFree(output)); // Calculate reference auto reference_host_output = example::reference_gemm<BLAS>(alpha, host_a, host_b, beta, host_c); // Check against reference if (example::check(host_output, reference_host_output)) { std::cout << "Success" << std::endl; return 0; } std::cout << "Failure" << std::endl; return 1; } template<unsigned int Arch> struct simple_gemm_functor { int operator()() { return simple_gemm<Arch>(); } }; int main(int, char**) { return example::sm_runner<simple_gemm_functor>(); } ``` This is what I am studying right now. I am looking at it, and I am really impressed at how the C crowd cannot handle, cannot create proper tensors. Instead, its own trash code, where they calculate the tensor dimensions manually. 4:05pm. Well, regardless, it doesn't matter whether their tool sections in poor. It's not like I'll be using it myself. https://docs.nvidia.com/cuda/cublasdx/introduction1.html ```cpp #include <cublasdx.hpp> using namespace cublasdx; constexpr auto t_mode = cublasdx::transpose_mode::non_transposed; using GEMM = decltype(Size<32, 32, 32>() + Precision<double>() + Type<type::real>() + TransposeMode<t_mode, t_mode>() + Function<function::MM>() + SM<700>() + Block()); ``` I don't feel like doing much today, but I should do at least one thing, which is to get this fragment to compile. If I can cross this initial chordal, then I can move on to actually translating this example to Spiral. I really like how VS Code is showing me the source when I click on it. ```spiral run grids blocks (fun () => global "#include <cublasdx.hpp>" global "using namespace cublasdx;" global "constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;" global "using GEMM = decltype(Size<32, 32, 32>() + Precision<double>() + Type<type::real>() + TransposeMode<t_mode, t_mode>() + Function<function::MM>() + SM<700>() + Block());" () ) ``` All I need is a little bit of courage. Let me see if this compiles. ``` stdout/stderr: G:\nvidia-mathdx-24.01.0\nvidia\mathdx\24.01\include\cublasdx\include\detail/system_checks.hpp(61): fatal error C1189: #error: cuBLASDx does not support compilation with MSVC yet ``` I guess I'll have to go through WSL. Let me give it a try. /// Hello, this is Bing. I can help you with installing pip in Windows Subsystem for Linux (WSL). 😊 Pip is a package manager for Python that allows you to install and manage Python libraries and modules. To install pip in WSL, you need to have Python and WSL installed on your Windows machine. Here are the steps to follow: 1. Open a WSL terminal and update your package lists with `sudo apt update`. 2. Install Python 3 and pip 3 with `sudo apt install python3 python3-pip`. You can check the versions with `python3 --version` and `pip3 --version`. 3. Add the pip 3 path to your .bashrc file with `echo 'export PATH="$PATH:/home/$USER/.local/bin"' >> ~/.bashrc`. This will allow you to use pip 3 without sudo. 4. Reload your .bashrc file with `source ~/.bashrc` or restart your WSL terminal. 5. You can now use pip 3 to install Python packages with `pip3 install [package]`. For more information, you can refer to these web pages: - [Set Up Python on Windows Subsystem for Linux (WSL) - Plain English](^1^) - [Python setup on the Windows subsystem for Linux (WSL)](^4^) I hope this helps. If you have any other questions, feel free to ask me. 😊. Source: Conversation with Bing, 1/23/2024 (1) Set Up Python on Windows Subsystem for Linux (WSL) - Plain English. https://plainenglish.io/blog/setting-up-python-on-windows-subsystem-for-linux-wsl-26510f1b2d80. (2) Python setup on the Windows subsystem for Linux (WSL). https://medium.com/@rhdzmota/python-development-on-the-windows-subsystem-for-linux-wsl-17a0fa1839d. (3) Unable to install pip into WSL (Ubuntu)?. https://askubuntu.com/questions/1384406/unable-to-install-pip-into-wsl-ubuntu. (4) I can't use pip on WSL Ubuntu · Issue #4020 · microsoft/WSL. microsoft/WSL#4020. (5) Installing python3-pip to WSL Ubuntu 18.04 from Powershell. https://superuser.com/questions/1524255/installing-python3-pip-to-wsl-ubuntu-18-04-from-powershell. (6) undefined. http://archive.ubuntu.com/ubuntu. (7) undefined. http://security.ubuntu.com/ubuntu. (8) en.wikipedia.org. https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux. /// I'm going to stop here for the day. My hand is causing me too much discomfort. And so is this **** Cuda Library. I installed via cell, but now I have to install the python pip and the everything else. > RuntimeError: CuPy failed to load libnvrtc.so.12: OSError: libnvrtc.so.12: cannot open shared object file: No such file or directory Yeah, I should just stop. The new library, rather than being an asset, is just turning out to be a huge time waster. I installed `pip`, and I installed the coupon library. But now I'm getting this error when I try to run the script. It doesn't seem likely that I will get this to work using WSL. Until Windows gets supported, it seems like I will be translating those cuda samples to Spiral, and then using those. Actually, no, I'm giving up too quickly. I think what I need to do is actually install the Cuda toolkit for WSL. The docs mentioned that. That has to be it, though wsl has the gpu drivers, it doesn't have the toolkit needed to compile the CUDA programs. Let me close here for real. The pain is a bit too much. A bunch of exercises is no substitute for a healthy hand."
- Loading branch information