-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements #1071
Changes from 2 commits
0ad343a
74789c1
cc64f32
48aa2e5
5c7e680
1d00937
f6c7715
d005b03
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,45 +22,69 @@ import org.apache.spark.rdd.RDD | |
import org.apache.spark.sql.{SQLContext, Row} | ||
import org.apache.spark.sql.catalyst.expressions.{GenericRow, Attribute} | ||
|
||
trait PhysicalCommand { | ||
/** | ||
* A concrete command should override this lazy field to wrap up any side effects caused by the | ||
* command or any other computation that should be evaluated exactly once. The value of this field | ||
* can be used as the contents of the corresponding RDD generated from the physical plan of this | ||
* command. | ||
* | ||
* The `execute()` method of all the physical command classes should reference `sideEffect` so | ||
* that the command can be executed eagerly right after the command query is created. | ||
*/ | ||
protected[sql] lazy val sideEffectResult: Seq[Any] = Seq.empty[Any] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By default, maybe we could have this return "OK" like hive does? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. And a little off topic, why is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for following the same-name convention There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @marmbrus Checked Hive 0.12 source code and confirmed that the "OK" is actually a log rather than the result. Thus essentially we can't distinguish a command (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, fine to not return OK. Regarding emptyResult that is a bad name, but it is used for things like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After some thought I think it's not only about naming, the semantics is wrong: en There does exist a Currently There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We use a Project to build the actual result. However, looks like I already did the separation... so that plan sounds good to me :) |
||
} | ||
|
||
/** | ||
* :: DeveloperApi :: | ||
*/ | ||
@DeveloperApi | ||
case class SetCommandPhysical(key: Option[String], value: Option[String], output: Seq[Attribute]) | ||
(@transient context: SQLContext) extends LeafNode { | ||
def execute(): RDD[Row] = (key, value) match { | ||
// Set value for key k; the action itself would | ||
// have been performed in QueryExecution eagerly. | ||
case (Some(k), Some(v)) => context.emptyResult | ||
case class SetCommandPhysical( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general we have not been naming our physical operators differently, but have been relying on the package to differentiate as this is less redundant. (i.e. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, just followed @concretevitamin's naming style, would love to rename all these physical commands. |
||
key: Option[String], value: Option[String], output: Seq[Attribute])( | ||
@transient context: SQLContext) | ||
extends LeafNode with PhysicalCommand { | ||
|
||
override protected[sql] lazy val sideEffectResult: Seq[(String, String)] = (key, value) match { | ||
// Set value for key k. | ||
case (Some(k), Some(v)) => | ||
context.set(k, v) | ||
Array.empty[(String, String)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have we decided to not echo the newly set key-val pair? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just followed the original logic here, but I do agree that echo the pair would be more user friendly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... Just confirmed that Hive 0.12 doesn't return anything in this case (even an "OK" string), so I'd prefer to left it as is to mimic Hive. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are already differing from Hive in the behavior of "SET". I don't see a reason to stick to hive semantics in this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, made the command returns the newly set pair. |
||
|
||
// Query the value bound to key k. | ||
case (Some(k), None) => | ||
val resultString = context.getOption(k) match { | ||
case Some(v) => s"$k=$v" | ||
case None => s"$k is undefined" | ||
} | ||
context.sparkContext.parallelize(Seq(new GenericRow(Array[Any](resultString))), 1) | ||
case (Some(k), _) => | ||
Array(k -> context.getOption(k).getOrElse("<undefined>")) | ||
|
||
// Query all key-value pairs that are set in the SQLConf of the context. | ||
case (None, None) => | ||
val pairs = context.getAll | ||
val rows = pairs.map { case (k, v) => | ||
new GenericRow(Array[Any](s"$k=$v")) | ||
}.toSeq | ||
// Assume config parameters can fit into one split (machine) ;) | ||
context.sparkContext.parallelize(rows, 1) | ||
// The only other case is invalid semantics and is impossible. | ||
case _ => context.emptyResult | ||
context.getAll | ||
|
||
case _ => | ||
throw new IllegalArgumentException() | ||
} | ||
|
||
def execute(): RDD[Row] = { | ||
val rows = sideEffectResult.map { case (k, v) => new GenericRow(Array[Any](k, v)) } | ||
context.sparkContext.parallelize(rows, 1) | ||
} | ||
|
||
override def otherCopyArgs = context :: Nil | ||
} | ||
|
||
/** | ||
* :: DeveloperApi :: | ||
*/ | ||
@DeveloperApi | ||
case class ExplainCommandPhysical(child: SparkPlan, output: Seq[Attribute]) | ||
(@transient context: SQLContext) extends UnaryNode { | ||
case class ExplainCommandPhysical( | ||
child: SparkPlan, output: Seq[Attribute])( | ||
@transient context: SQLContext) | ||
extends UnaryNode with PhysicalCommand { | ||
|
||
// Actually "EXPLAIN" command doesn't cause any side effect. | ||
override protected[sql] lazy val sideEffectResult: Seq[String] = this.toString.split("\n") | ||
|
||
def execute(): RDD[Row] = { | ||
val planString = new GenericRow(Array[Any](child.toString)) | ||
context.sparkContext.parallelize(Seq(planString)) | ||
val explanation = sideEffectResult.mkString("\n") | ||
context.sparkContext.parallelize(Seq(new GenericRow(Array[Any](explanation))), 1) | ||
} | ||
|
||
override def otherCopyArgs = context :: Nil | ||
|
@@ -71,18 +95,19 @@ case class ExplainCommandPhysical(child: SparkPlan, output: Seq[Attribute]) | |
*/ | ||
@DeveloperApi | ||
case class CacheCommandPhysical(tableName: String, doCache: Boolean)(@transient context: SQLContext) | ||
extends LeafNode { | ||
extends LeafNode with PhysicalCommand { | ||
|
||
lazy val commandSideEffect = { | ||
override protected[sql] lazy val sideEffectResult = { | ||
if (doCache) { | ||
context.cacheTable(tableName) | ||
} else { | ||
context.uncacheTable(tableName) | ||
} | ||
Seq.empty[Any] | ||
} | ||
|
||
override def execute(): RDD[Row] = { | ||
commandSideEffect | ||
sideEffectResult | ||
context.emptyResult | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized that many
SchemaRDD
actions other thancollect()
and DSL methods reuseslogicalPlan
and breaks the "exactly once" constraints when planning the local plan (new physical plan node for DDL/command statements are created, causing the side effect taking place again).So I replaced
logicalPlan
with the executed physical plan wrapped with aSparkLogicalPlan
to prevent multiple physical plan instantiations for the same DDL/command statement.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This is probably the problem I was seeing with double "UNCACHE TABLE".