We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 2.2.0以下版本对于读取csv单元格内存在多行值(LF)是存在问题的,spark 2.2.0版本虽然修复了该问题,添加了multiLine参数,但是加上该参数后,encoding参数会失效,对于中文非utf8编码来说读取出来是一堆乱码。
wholeFile
multiLine
<dependency> <groupId>org.apache.crunch</groupId> <artifactId>crunch-core</artifactId> <version>0.15.0</version> </dependency>
Configuration conf = new Configuration(); conf.set("csv.inputfileencoding", "gb18030"); jsc.newAPIHadoopFile("/Users/linqiong/Downloads/bb_gbk.csv", CSVInputFormat.class, null, null, conf) .map(s -> s._2().toString());
<dependency> <groupId>com.opencsv</groupId> <artifactId>opencsv</artifactId> <version>4.1</version> </dependency>
JavaRDD<Row> rowRDD = jsc.binaryFiles("/Users/linqiong/Downloads/bb_gbk.csv") .flatMap(line -> { PortableDataStream ds = line._2(); DataInputStream dis = ds.open(); List<String[]> data = new ArrayList<>(); try (CSVReader reader = new CSVReader(new BufferedReader(new InputStreamReader(dis, GB18030)))) { String[] nextLine; while ((nextLine = reader.readNext()) != null) { // nextLine[] is an array of values from the line data.add(nextLine); } } return data.iterator(); }).map(line -> RowFactory.create(line));
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Overview
Spark 2.2.0以下版本对于读取csv单元格内存在多行值(LF)是存在问题的,spark 2.2.0版本虽然修复了该问题,添加了multiLine参数,但是加上该参数后,encoding参数会失效,对于中文非utf8编码来说读取出来是一堆乱码。
Issue
wholeFile
tomultiLine
for both CSV and JSONSolution
Apache crunch maven dependency
opencsv maven dependency
Reference
The text was updated successfully, but these errors were encountered: