Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

Open
lqshow opened this issue Jan 7, 2018 · 0 comments
Open

Spark 2.2.0 读取csv单元格内双引号多行数据的问题 #7

lqshow opened this issue Jan 7, 2018 · 0 comments
Labels

Comments

@lqshow
Copy link
Owner

lqshow commented Jan 7, 2018

Overview

Spark 2.2.0以下版本对于读取csv单元格内存在多行值(LF)是存在问题的,spark 2.2.0版本虽然修复了该问题,添加了multiLine参数,但是加上该参数后,encoding参数会失效,对于中文非utf8编码来说读取出来是一堆乱码。

Issue

Solution

  • 使用 newAPIHadoopFile && CSVInputFormat

Apache crunch maven dependency

<dependency>
  <groupId>org.apache.crunch</groupId>
  <artifactId>crunch-core</artifactId>
  <version>0.15.0</version>
</dependency>
Configuration conf = new Configuration();
conf.set("csv.inputfileencoding", "gb18030");

jsc.newAPIHadoopFile("/Users/linqiong/Downloads/bb_gbk.csv",
  CSVInputFormat.class, null, null, conf)
  .map(s -> s._2().toString());
  • 通过读取二进制文件来解决

opencsv maven dependency

<dependency>
  <groupId>com.opencsv</groupId>
  <artifactId>opencsv</artifactId>
  <version>4.1</version>
</dependency>
JavaRDD<Row> rowRDD = jsc.binaryFiles("/Users/linqiong/Downloads/bb_gbk.csv")
    .flatMap(line -> {
        PortableDataStream ds = line._2();
        DataInputStream dis = ds.open();
        List<String[]> data = new ArrayList<>();
        try (CSVReader reader = new CSVReader(new BufferedReader(new InputStreamReader(dis, GB18030)))) {
            String[] nextLine;
            while ((nextLine = reader.readNext()) != null) {
                // nextLine[] is an array of values from the line
                data.add(nextLine);
            }
        }
        return data.iterator();
    }).map(line -> RowFactory.create(line));

Reference

@lqshow lqshow added the spark label Jan 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant