Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

lqshow · 2018-01-07T03:46:38Z

Overview

Spark 2.2.0以下版本对于读取csv单元格内存在多行值(LF)是存在问题的，spark 2.2.0版本虽然修复了该问题，添加了multiLine参数，但是加上该参数后，encoding参数会失效，对于中文非utf8编码来说读取出来是一堆乱码。

Issue

Solution

使用 newAPIHadoopFile && CSVInputFormat

Apache crunch maven dependency

<dependency>
  <groupId>org.apache.crunch</groupId>
  <artifactId>crunch-core</artifactId>
  <version>0.15.0</version>
</dependency>

Configuration conf = new Configuration();
conf.set("csv.inputfileencoding", "gb18030");

jsc.newAPIHadoopFile("/Users/linqiong/Downloads/bb_gbk.csv",
  CSVInputFormat.class, null, null, conf)
  .map(s -> s._2().toString());

通过读取二进制文件来解决

opencsv maven dependency

<dependency>
  <groupId>com.opencsv</groupId>
  <artifactId>opencsv</artifactId>
  <version>4.1</version>
</dependency>

JavaRDD<Row> rowRDD = jsc.binaryFiles("/Users/linqiong/Downloads/bb_gbk.csv")
    .flatMap(line -> {
        PortableDataStream ds = line._2();
        DataInputStream dis = ds.open();
        List<String[]> data = new ArrayList<>();
        try (CSVReader reader = new CSVReader(new BufferedReader(new InputStreamReader(dis, GB18030)))) {
            String[] nextLine;
            while ((nextLine = reader.readNext()) != null) {
                // nextLine[] is an array of values from the line
                data.add(nextLine);
            }
        }
        return data.iterator();
    }).map(line -> RowFactory.create(line));

Reference

The text was updated successfully, but these errors were encountered:

lqshow added the spark label Jan 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

lqshow commented Jan 7, 2018

Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

Comments

lqshow commented Jan 7, 2018

Overview

Issue

Solution

Apache crunch maven dependency

opencsv maven dependency

Reference