-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHC with Spark Structured Streaming #205
Comments
You can write your custom sink provider, inherited from StreamSinkProvider, this is my implementation:
This is example, how to use ():
|
Thanks for your answer - exactly the type of solution I was looking for. I only had time to test it quickly, but seems to be working perfectly! |
Excellent, glad to help!!! |
Thank you, much helps |
I've implemented your solution with HBaseSinkProvider by following steps:
My code is written in python, I'm including it below. The error is: pyspark.sql.utils.StreamingQueryException: u'Queries with streaming sources must be executed with writeStream.start();; def consume(schema_name, brokers, topic, group_id):
|
Try to write to the console orfile, will there be the same error? |
No, when I'm trying to write records to console, everything is OK. I'm using following python code instead
Output is somethinf like this:
|
Unfortunately I am now without a computer, try to run the outputMode to update, if it does not help and will not be able to find a solution, then email me after July 5, I will try to help. |
Unfortunatelly, "update" mode also not work. I've received the same error (see below). Thank you in advance.
|
Please try to use this version of shc (https://github.com/sutugin/shc) and compile with corresponding hbase/phoenix version. I have tried to use this without arvo format perfectly.
|
Hello @sutugin .. I've implemented your solution. However data is not getting updated in HBase. It's not even throwing any exception too. Can you suggest anything in this regard? |
Hi @swarup5s ,if you give me the implementation code and how you use it, show me the logs, maybe we can find the problem together. |
Hi @sutugin thanks for your help. Appreciate it. //this class is under ...org/apache/spark/sql/execution/streaming/ import org.apache.spark.sql.execution.datasources.hbase.Logging import org.apache.spark.sql.execution.datasources.hbase._ class HBaseSink(options: Map[String, String]) extends Sink with Logging { override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized { class HBaseSinkProvider extends StreamSinkProvider with DataSourceRegister { def shortName(): String = "hbase" /** def catalog = s"""{ def withCatalog(cat: String): DataFrame = { val streamdf = spark
//... //some biz logic here and some join between the batch and streaming dataframe and finaldf is the streaming DF try{
I've done some changes. now some StreamingQueryException is thrown but not sure what is getting wrong being a novice. Here's the logs: org.apache.spark.sql.streaming.StreamingQueryException: null Current State: ACTIVE Logical Plan: On the other hand message is successfully written to the console. |
Hello @sutugin, first of all thank your your great help. I'm experiencing @swarup5s 's problem when i call:
The same happens whenever i call something like I think the problem is outside your code, somewhere else. Maybe more spark related? |
Hi @sympho410! |
Hello @sutugin, I noticed just now you texted back!
Now it works properly - Thank you again for your help :) |
Hello @sutugin and @sympho410 I am also working on a similar kind of problem and I want to make bulk put to HBase from structured spark streaming. I see the code above tries to does the but what I am not able to understand is the use of catalog here. It seems like a predefined schema kind of thing. but since Hbase is schema-less means I can add any new column as well in future so how can I fix a catalog prior? Thanks in advance! |
It seems to me - the meaning of the catalog is to properly structure the data for serialization and deserialization. The need to specify the scheme is a feature of the implementation of this library and is not tied to the structured streaming. You can try to work around these limitations by generating a schema on the fly, based on the schema of the data inside each butch, but you must be sure that all strings inside the Butch have the same schema or try to use foreach writer and get the schema for each row separately. |
Can anyone please provide a compiled jar with HbaseSink compiled in, I tried building the shc project but i get the error
I tried implementing "hbaseSink" class in my project and used SHC in maven dependency, but it is now working, i get an error as
It would be very helpful if I can get the compiled jar . Thanks |
Try to build from my fork (https://github.com/sutugin/shc), though I have not updated it for a long time. Only in pom.xml specify the actual version of spark for you, for me it is |
@sutugin Thanks for replying, I'm working on databricks platform which hosts spark 2.4.3, so I have access to ".foreachbatch" api hence above is not needed anymore. |
@omkarahane Good idea, I already wrote someone about this method. #238 (comment) |
@sutugin I'm still facing an issue, it is different form the one I mentioned above, here is the link where I have mentioned all the details, please see if you can help. Thanks. |
@omkarahane , try make "fat" jar with sbt dependency libraryDependencies += "com.hortonworks.shc" % "shc-core" % "1.1.0.3.1.2.1-1". |
I tried running a the job with a fat jar which was created using maven, still the issue wasn't resolved. I guess the fat jars created with sbt and maven would almost be the same? |
@omkarahane, maybe this will help you #223 (comment) |
@sutugin, Thanks a lot, you pointed me in the right direction, hbase jars were missing, I have added those jars and installed them as a library on the cluster so job has access to them, it solved my initial problem, but now I'm getting another exception: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue; This also seems to be a dependency issue, This is what I have tried,
Still getting the same error. |
@omkarahane, Similar problems are described here:
|
@sutugin @merfill But I get error I use Java
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import java.io.IOException;
import java.io.Serializable;
import java.util.Properties;
public class KafkaStructStream implements Serializable {
private String servers;
private String jks;
private String schema;
public KafkaStructStream(String[] args) {
// this.servers = args[0];
// this.jks = args[1];
}
private Dataset<Row> initStructKafka() throws IOException {
Properties prop = Config.getProp();
this.schema = prop.getProperty("hbase.traffic.schema");
SparkSession spark = SparkSession
.builder()
.appName("Kafka")
.master("local[*]")
.getOrCreate();
return spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", prop.getProperty("kafka.broker.list"))
.option("kafka.ssl.truststore.location", Config.getPath(Config.KAFKA_JKS))
// .option("kafka.bootstrap.servers", this.servers)
// .option("kafka.ssl.truststore.location", this.jks)
.option("kafka.ssl.truststore.password", prop.getProperty("kafka.jks.passwd"))
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.endpoint.identification.algorithm", "")
.option("startingOffsets", "latest")
// .option("subscribe", kafkaProp.getProperty("kafka.topic"))
.option("subscribe", "traffic")
.load()
.selectExpr("CAST(topic AS STRING)", "CAST(value AS STRING)");
}
private void run() {
Dataset<Row> df = null;
try {
df = initStructKafka();
} catch (IOException e) {
e.printStackTrace();
System.exit(1);
}
df.printSchema();
StructType trafficSchema = new StructType()
.add("guid", DataTypes.StringType)
.add("time", DataTypes.LongType)
.add("end_time", DataTypes.LongType)
.add("srcip", DataTypes.StringType)
.add("srcmac", DataTypes.StringType)
.add("srcport", DataTypes.IntegerType)
.add("destip", DataTypes.StringType)
.add("destmac", DataTypes.StringType)
.add("destport", DataTypes.IntegerType)
.add("proto", DataTypes.StringType)
.add("appproto", DataTypes.StringType)
.add("upsize", DataTypes.LongType)
.add("downsize", DataTypes.LongType);
Dataset<Row> ds = df.select(functions.from_json(df.col("value").cast(DataTypes.StringType), trafficSchema).as("data")).select("data.*");
StreamingQuery query = ds.writeStream()
.format("HBase.HBaseSinkProvider")
.option("HBaseTableCatalog.tableCatalog", this.schema)
.option("checkpointLocation", "/tmp/checkpoint")
.start();
// StreamingQuery query = ds.writeStream().format("console")
// .trigger(Trigger.Continuous("2 seconds"))
// .start();
try {
query.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
KafkaStructStream k = new KafkaStructStream(args);
k.run();
}
} ERROR
|
Hi @631068264, if you build from my fork, try specifying the format: "org.apache.spark.sql.execution.streaming.HBaseStreamSinkProvider" or "hbase" |
Hi, I'm doing a structured spark streaming of the kafka ingested messages and storing the data in hbase post processing. The issue that is popping up is, **ERROR ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper. It should have been written by the master. Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.** I tried passing the hbase-site.xml in the spark-submit, but no luck. The hbase-site.xml has the property "zookeeper.znode.parent", which is "/hbase-unsecure" My spark-submit parameters are, The stack version in the cluster is given below: Hadoop 2.7.3 Please find the build.sbt and the scala classes attached for your reference. Kindly let me know if there is any hbase configuration (zookeeper quorum, zookeeper clientport, zookeeper znode parent) which we can set in the step where we are writing data to a table, which is, df.write. |
Hi, @Saimukunth! |
If I do it through legacy streaming, the error is not reproduced. I'm able
to insert the data into HBase. Because, I create the HBase connection with
the appropriate configuration.
Thanks,
Venkatesh Raman
…On Fri, 24 Apr 2020, 3:43 pm Andrey Sutugin, ***@***.***> wrote:
Hi, @Saimukunth <https://github.com/Saimukunth>!
If you write in batch mode, not structured streaming, then the error is
reproduced?
I think the problem is the same as #150
<#150>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#205 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKNISGIY7HMPJYIGNN5BGBTROFQ3NANCNFSM4EHFJCIA>
.
|
@Saimukunth Try to do as described here (#150 (comment)) in the second case |
Hi, Thanks, I was able to resolve the above issue using shc jar. Data is getting inserted into hbase, but not in a way I wanted. **Expected:- Actual:- This is my hbase catalog file, In the HBaseStreamSinkProvider, I'm writing to hbase using,
Is there any way I can play around the HBaseTableCatalog class to get the desired result. Thank and Regards, |
Hi,
I have a Spark Structured Streaming application where I'd like to write streaming data to HBase using SHC. It reads data from a location where new csv files continuously are being created. The defined catalog works for writing a DataFrame with identical data into HBase.
The key components of my streaming application are a DataStreamReader and a DataStreamWriter.
When running the application I'm getting the following message:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.execution.datasources.hbase does not support streamed writing at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:285) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286) at my.package.SHCStreamingApplication$.main(SHCStreamingApplication.scala:153) at my.package.SHCStreamingApplication.main(SHCStreamingApplication.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Does anyone know a solution or way/workaround to still use the SHC for writing structured streaming data to HBase?
Thanks in advance!
The text was updated successfully, but these errors were encountered: