Skip to content

Commit

Permalink
[SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?
Currently the no. of partition files are limited to 10000 files (%05d format). If there are more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts them by string. More details can be found in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).

## How was this patch tested?
I tested this patch by checkpointing a RDD and then manually renaming part files to the old format and tried to access the RDD. It was successfully created from the old format. Also verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit test from my local box, so will wait for the Jenkins output.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15370 from dhruve/bug/SPARK-17417.

(cherry picked from commit 4bafaca)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
  • Loading branch information
dhruve authored and Tom Graves committed Oct 10, 2016
1 parent d27df35 commit d719e9a
Showing 1 changed file with 2 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag](
val inputFiles = fs.listStatus(cpath)
.map(_.getPath)
.filter(_.getName.startsWith("part-"))
.sortBy(_.toString)
.sortBy(_.getName.stripPrefix("part-").toInt)
// Fail fast if input files are invalid
inputFiles.zipWithIndex.foreach { case (path, i) =>
if (!path.toString.endsWith(ReliableCheckpointRDD.checkpointFileName(i))) {
if (path.getName != ReliableCheckpointRDD.checkpointFileName(i)) {
throw new SparkException(s"Invalid checkpoint file: $path")
}
}
Expand Down

0 comments on commit d719e9a

Please sign in to comment.