[BUG] regexp: inconsistent number of tokens produced by string `split` and word boundaries `\b` and `\B` #11102

anthony-chang · 2022-06-13T19:06:48Z

Describe the bug
The number of tokens is different between cuDF and Python when using split with \b and \B word boundaries.

Steps/Code to reproduce bug

>>> import pandas as pd
>>> import cudf
>>> cudf.Series(['a', 'ab', '-+']).str.split(r'\b', regex=True)
0     [, a]
1    [, ab]
2      [-+]
dtype: list
>>> pd.Series(['a', 'ab', '-+']).str.split(r'\b', regex=True)
0     [, a, ]
1    [, ab, ]
2        [-+]
dtype: object

>>> cudf.Series(['a', 'ab', '-+']).str.split(r'\B', regex=True)
0         [a]
1      [a, b]
2    [, -, +]
dtype: list
>>> pd.Series(['a', 'ab', '-+']).str.split(r'\B', regex=True)
0           [a]
1        [a, b]
2    [, -, +, ]
dtype: object

Expected behavior
I expect the number of tokens to be the same.

Environment overview (please complete the following information)

Environment location: bare-metal
Method of cuDF install: miniconda

Environment details

Click here to see environment details

 **git***
 Not inside a git repository

 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
 NAME="Ubuntu"
 VERSION="18.04.5 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.5 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux c240m5-01 5.4.0-109-generic #123~18.04.1-Ubuntu SMP Fri Apr 8 09:48:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

 ***GPU Information***
 Mon Jun 13 12:05:43 2022
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  Tesla T4            On   | 00000000:19:00.0 Off |                    0 |
 | N/A   43C    P0    26W /  70W |   2535MiB / 15109MiB |     12%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   1  Tesla T4            On   | 00000000:5E:00.0 Off |                    0 |
 | N/A   45C    P0    27W /  70W |   1252MiB / 15109MiB |     12%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   2  Tesla T4            On   | 00000000:86:00.0 Off |                    0 |
 | N/A   43C    P0    27W /  70W |   1252MiB / 15109MiB |     12%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   3  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
 | N/A   44C    P0    27W /  70W |  10079MiB / 15109MiB |     12%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+

 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
 |    0   N/A  N/A     11278      C   /opt/conda/bin/python            1283MiB |
 |    0   N/A  N/A     45995      C   python                           1245MiB |
 |    1   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
 |    1   N/A  N/A     19372      C   python                           1245MiB |
 |    2   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
 |    2   N/A  N/A     19372      C   python                           1245MiB |
 |    3   N/A  N/A      1924      G   /usr/lib/xorg/Xorg                  4MiB |
 |    3   N/A  N/A     11278      C   /opt/conda/bin/python            8827MiB |
 |    3   N/A  N/A     45995      C   python                           1245MiB |
 +-----------------------------------------------------------------------------+

 ***CPU***
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              72
 On-line CPU(s) list: 0-71
 Thread(s) per core:  2
 Core(s) per socket:  18
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
 Stepping:            4
 CPU MHz:             1200.042
 CPU max MHz:         3700.0000
 CPU min MHz:         1200.0000
 BogoMIPS:            6000.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            25344K
 NUMA node0 CPU(s):   0-17,36-53
 NUMA node1 CPU(s):   18-35,54-71
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d

 ***CMake***

 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 9.3.0-11ubuntu0~18.04.1) 9.3.0
 Copyright (C) 2019 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


 ***nvcc***

 ***Python***
 /home/antchang/miniconda3/envs/cudf/bin/python
 Python 3.9.13

 ***Environment Variables***
 PATH                            : /home/antchang/miniconda3/envs/cudf/bin:/home/antchang/.poetry/bin:/home/antchang/miniconda3/condabin:/home/antchang/.pyenv/shims:/home/antchang/.pyenv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/antchang/spark/bin:/home/antchang/spark/sbin
 LD_LIBRARY_PATH                 :
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /home/antchang/miniconda3/envs/cudf
 PYTHON_PATH                     :

 ***conda packages***
 /home/antchang/miniconda3/condabin/conda
 # packages in environment at /home/antchang/miniconda3/envs/cudf:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                 conda_forge    conda-forge
 _openmp_mutex             4.5                       2_gnu    conda-forge
 abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
 arrow-cpp                 7.0.0           py39he577829_7_cuda    conda-forge
 arrow-cpp-proc            3.0.0                      cuda    conda-forge
 aws-c-cal                 0.5.11               h95a6274_0    conda-forge
 aws-c-common              0.6.2                h7f98852_0    conda-forge
 aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
 aws-c-io                  0.10.5               hfb6a706_0    conda-forge
 aws-checksums             0.1.11               ha31a3da_7    conda-forge
 aws-sdk-cpp               1.8.186              hb4091e7_3    conda-forge
 bzip2                     1.0.8                h7f98852_4    conda-forge
 c-ares                    1.18.1               h7f98852_0    conda-forge
 ca-certificates           2022.5.18.1          ha878542_0    conda-forge
 cachetools                5.0.0              pyhd8ed1ab_0    conda-forge
 cuda-python               11.7.0           py39h3fd9d12_0    nvidia
 cudatoolkit               11.6.0               habf752d_9    nvidia
 cudf                      22.06.00a220531 cuda_11_py39_gd0b4e3032c_317    rapidsai-nightly
 cupy                      10.5.0           py39hc3c280e_0    conda-forge
 dlpack                    0.5                  h9c3ff4c_0    conda-forge
 fastavro                  1.4.12           py39hb9d737c_0    conda-forge
 fastrlock                 0.8              py39h5a03fae_2    conda-forge
 fsspec                    2022.5.0           pyhd8ed1ab_0    conda-forge
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 glog                      0.6.0                h6f12383_0    conda-forge
 grpc-cpp                  1.45.2               hd8f4eba_3    conda-forge
 keyutils                  1.6.1                h166bdaf_0    conda-forge
 krb5                      1.19.3               h3790be6_0    conda-forge
 ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
 libblas                   3.9.0           14_linux64_openblas    conda-forge
 libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
 libbrotlidec              1.0.9                h166bdaf_7    conda-forge
 libbrotlienc              1.0.9                h166bdaf_7    conda-forge
 libcblas                  3.9.0           14_linux64_openblas    conda-forge
 libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
 libcudf                   22.06.00a220531 cuda11_gd0b4e3032c_317    rapidsai-nightly
 libcurl                   7.83.1               h7bff187_0    conda-forge
 libedit                   3.1.20191231         he28a2e2_2    conda-forge
 libev                     4.33                 h516909a_1    conda-forge
 libevent                  2.1.10               h9b69904_4    conda-forge
 libffi                    3.4.2                h7f98852_5    conda-forge
 libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
 libgfortran-ng            12.1.0              h69a702a_16    conda-forge
 libgfortran5              12.1.0              hdcd56e2_16    conda-forge
 libgomp                   12.1.0              h8d9b700_16    conda-forge
 libgoogle-cloud           1.40.2               habd0e3a_0    conda-forge
 liblapack                 3.9.0           14_linux64_openblas    conda-forge
 libllvm11                 11.1.0               hf817b99_3    conda-forge
 libnghttp2                1.47.0               h727a467_0    conda-forge
 libnsl                    2.0.0                h7f98852_0    conda-forge
 libopenblas               0.3.20          pthreads_h78a6416_0    conda-forge
 libprotobuf               3.20.1               h6239696_0    conda-forge
 librmm                    22.06.00a220531 cuda11_g914cb4c8_75    rapidsai-nightly
 libssh2                   1.10.0               ha56f1ee_2    conda-forge
 libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
 libthrift                 0.16.0               h519c5ea_1    conda-forge
 libutf8proc               2.7.0                h7f98852_0    conda-forge
 libuuid                   2.32.1            h7f98852_1000    conda-forge
 libzlib                   1.2.12               h166bdaf_0    conda-forge
 llvmlite                  0.38.1           py39h7d9a04d_0    conda-forge
 lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
 ncurses                   6.3                  h27087fc_1    conda-forge
 numba                     0.55.1           py39h66db6d7_1    conda-forge
 numpy                     1.21.6           py39h18676bf_0    conda-forge
 nvtx                      0.2.3            py39h3811e60_1    conda-forge
 openssl                   1.1.1o               h166bdaf_0    conda-forge
 orc                       1.7.3                h6c59b99_1    conda-forge
 packaging                 21.3               pyhd8ed1ab_0    conda-forge
 pandas                    1.4.2            py39h1832856_2    conda-forge
 parquet-cpp               1.5.1                         2    conda-forge
 pip                       22.1.2             pyhd8ed1ab_0    conda-forge
 protobuf                  3.20.1           py39h5a03fae_0    conda-forge
 ptxcompiler               0.2.0            py39h107f55c_0    rapidsai-nightly
 pyarrow                   7.0.0           py39h1ed2e5d_7_cuda    conda-forge
 pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
 python                    3.9.13          h9a8a25e_0_cpython    conda-forge
 python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
 python_abi                3.9                      2_cp39    conda-forge
 pytz                      2022.1             pyhd8ed1ab_0    conda-forge
 re2                       2022.04.01           h27087fc_0    conda-forge
 readline                  8.1                  h46c0cb4_0    conda-forge
 rmm                       22.06.00a220531 cuda11_py39_g914cb4c8_75    rapidsai-nightly
 s2n                       1.0.10               h9b69904_0    conda-forge
 setuptools                62.3.2           py39hf3d152e_0    conda-forge
 six                       1.16.0             pyh6c4a22f_0    conda-forge
 snappy                    1.1.9                hbd366e4_1    conda-forge
 spdlog                    1.8.5                h4bd325d_1    conda-forge
 sqlite                    3.38.5               h4ff8645_0    conda-forge
 tk                        8.6.12               h27826a3_0    conda-forge
 typing_extensions         4.2.0              pyha770c72_1    conda-forge
 tzdata                    2022a                h191b570_0    conda-forge
 wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
 xz                        5.2.5                h516909a_1    conda-forge
 zlib                      1.2.12               h166bdaf_0    conda-forge
 zstd                      1.5.2                h8a70e8d_1    conda-forge

Additional context
None

The text was updated successfully, but these errors were encountered:

Closes #11102 Fixes the matching logic for detecting word boundaries `\b` and `\B` at the end of a string in `split_re()` and `split_record_re()` as well as the internal `count_matches()` function. Additional specific gtests are included to check the corrected behavior. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #11106

anthony-chang added Needs Triage Need team to review and classify bug Something isn't working labels Jun 13, 2022

anthony-chang mentioned this issue Jun 13, 2022

[BUG] regexp: word boundaries \b and \B inconsistent with Java/Python around _ #11062

Closed

davidwendt self-assigned this Jun 13, 2022

davidwendt mentioned this issue Jun 14, 2022

Fix split_re matching logic for word boundaries #11106

Merged

anthony-chang mentioned this issue Jun 22, 2022

[BUG] Inconsistent handling of word boundaries \b and \B with StringSplit for regular expressions NVIDIA/spark-rapids#5478

Open

rapids-bot bot closed this as completed in #11106 Jun 28, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] regexp: inconsistent number of tokens produced by string `split` and word boundaries `\b` and `\B` #11102

[BUG] regexp: inconsistent number of tokens produced by string `split` and word boundaries `\b` and `\B` #11102

anthony-chang commented Jun 13, 2022

[BUG] regexp: inconsistent number of tokens produced by string split and word boundaries \b and \B #11102

[BUG] regexp: inconsistent number of tokens produced by string split and word boundaries \b and \B #11102

Comments

anthony-chang commented Jun 13, 2022

[BUG] regexp: inconsistent number of tokens produced by string `split` and word boundaries `\b` and `\B` #11102

[BUG] regexp: inconsistent number of tokens produced by string `split` and word boundaries `\b` and `\B` #11102