Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter #2044

Closed
wants to merge 1 commit into from

Conversation

chutium
Copy link
Contributor

@chutium chutium commented Aug 19, 2014

if (!fs.getFileStatus(path).isDir) throw Exception make no sense after this commit #1370

be careful if someone is working on SPARK-2551, make sure the new change passes test case test("Read a parquet file instead of a directory")

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have started for PR 2044 at commit 4ae477f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have finished for PR 2044 at commit 4ae477f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

A normal, complete Parquet file on HDFS should be directory with a _metadata file. If I understand this PR correctly, it actually tries to handle a special yet abnormal case. Personally, I don't think we should encourage users to generate and/or use potentially broken file segments.

@chutium
Copy link
Contributor Author

chutium commented Aug 20, 2014

i am confused about this definition "Parquet file on HDFS should be directory"...

this was reported in user list http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-td12345.html

i think he is also right, parameter of sqlContext.parquetFile should be a dir of parquet files (better with a _metadata file in it), but this should not be a mandatory requirement, our javadoc
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html#parquetFile(java.lang.String)
says also, this method "Loads a Parquet file", not "loads a dir of parqauet files"...

otherwise maybe we should rename this method to sqlContext.parquetDir or somthing

and this PR is not "try to handle", it is more like "try to ignore useless check" :) it is already handled by children.find(...)

@marmbrus
Copy link
Contributor

I tend to agree here with attempting to provide more functionality as long as it doesn't complicate the code base. Seems like parquet works fine without this check.

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2044 at commit 4ae477f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2044 at commit 4ae477f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ExplainCommand(plan: LogicalPlan, extended: Boolean = false) extends Command

@liancheng
Copy link
Contributor

Hmm... the last Jenkins build info has already been cleaned up, don't know why it failed.

@liancheng
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2044 at commit 4ae477f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 2044 at commit 4ae477f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

The build failure was caused by unrelated test suites, should be good to go.

asfgit pushed a commit that referenced this pull request Aug 27, 2014
…gle file as parameter

```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370

be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")```

Author: chutium <teng.qiu@gmail.com>

Closes #2044 from chutium/parquet-singlefile and squashes the following commits:

4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter

(cherry picked from commit 48f4278)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@marmbrus
Copy link
Contributor

Thanks! I've merged this into master and 1.1.

@asfgit asfgit closed this in 48f4278 Aug 27, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…gle file as parameter

```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit apache#1370

be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")```

Author: chutium <teng.qiu@gmail.com>

Closes apache#2044 from chutium/parquet-singlefile and squashes the following commits:

4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter
@mohnishkodnani
Copy link

Can we have this take a parent directory that has other subdirectories with parquet files, instead of the parquet files being the immediate children of the directory.

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
)

Co-authored-by: Steve Vaughan Jr <s_vaughan@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants