-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2377] Python API for Streaming #2538
Changes from all commits
1fd12ae
c05922c
1f68b78
3dda31a
7f96294
fa75d71
8efa266
3a671cc
774f18d
33c0f94
4f2d7e6
9767712
35933e1
7051a84
99e4bb3
580fbc2
94f2b65
e9fab72
4aa99e4
6d8190a
14d4c0e
97742fe
e162822
e70f706
f1798c4
185fdbf
199e37f
58150f5
09a28bf
268a6a5
4dedd2d
171edeb
f0ea311
1d84142
583e66d
b7dab85
0d30109
24f95db
9c85e48
7339df2
9d1de23
4f82c89
50fd6f9
93f7637
acfcaeb
3b27bd4
2ea769e
c97377c
67473a9
d9d59fe
1fd6bc7
4afa390
d68b568
da09768
f5bfb70
fdc9125
ee50c5a
f7bc8f9
150b94c
454981d
b406252
87438e2
d7b4d6f
1a0f065
17a74c6
494cae5
e1df940
5bac7ec
d2099d8
224fc5e
bb7ccf3
f746109
0d1b954
ccfd214
b31446a
dc6995d
c455c8d
6f98e50
15feea9
d3ee86a
72b9738
bab31c1
0a8bbbb
678e854
b1d2a30
05e991b
9ab8952
84a9668
3b498e1
b349649
3c45cd2
d2c01ba
c462bb3
4d40d63
29c2bc5
fe648e3
8a0fbbc
1523b66
1df77f5
9ad6855
ce2acd2
878bad7
f21cab3
3d37822
253a863
bb10956
270a9e1
bcdec33
ff14070
3000b2b
13fb44c
18c8723
f76c182
74535d4
16aa64f
e54f986
10b5b04
10ab87b
5625bdc
c214199
0b99bec
41886c2
66fcfff
38adf95
4bcb318
247fd74
dd6de81
f485b1d
0df7111
58591d2
98c2a00
eb4bf48
6197a11
2ad7bd3
fe02547
4f07163
54b5358
88f7506
1b83354
92e333e
0b09cff
932372a
376e3ac
1934726
019ef38
5c04a5f
bd3ba53
9cde7c9
b3b0362
99410be
c1d546e
af610d3
953deb0
f67cf57
1e126bf
795b2cd
8dcda84
c5ecfc1
2a06cdb
99ce042
ddd4ee1
af336b7
455e5af
58e41ff
e80647e
c00e091
3166d31
f198d14
b171ec3
f04882c
62dc7a3
7dc7391
6ae3caa
fa4af88
066ba90
8ed93af
fbed8da
bebb3f3
b0f2015
f385976
c0a06bc
2fdf0de
d542743
d39f102
63c881a
d5f5fcb
8ffdbf1
4a59e1e
2d32a74
e685853
5cdb6fa
550dfd9
df098fc
7f53086
7339be0
bd27874
9a57685
eec401e
bd13026
d357b70
c28f520
3f0fb4b
c499ba0
604323f
b32774c
74df565
26ea396
7001b51
fce0ef5
e059ca2
847f9b9
b983f0f
98ac6c2
c40c52d
6ebceca
19797f9
338580a
069a94c
e00136b
eed6e2a
b98d63f
9a16bd1
8466916
a13ff34
fa7261b
6f0da2f
d328aca
ff88bec
7797c70
bd8a4c2
7a88f9f
c2b31cb
54bd92b
4d0ea8b
6bb9d91
c7bbbce
8071541
be5e5ff
d05871e
37fe06f
e108ec1
52c535b
8380064
6db00da
bebeb4a
02d0575
182be73
3e2492b
331ecce
64561e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
""" | ||
Counts words in new text files created in the given directory | ||
Usage: hdfs_wordcount.py <directory> | ||
<directory> is the directory that Spark Streaming will use to find and read new text files. | ||
|
||
To run this on your local machine on directory `localdir`, run this example | ||
$ bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localdir | ||
|
||
Then create a text file in `localdir` and the words in the file will get counted. | ||
""" | ||
|
||
import sys | ||
|
||
from pyspark import SparkContext | ||
from pyspark.streaming import StreamingContext | ||
|
||
if __name__ == "__main__": | ||
if len(sys.argv) != 2: | ||
print >> sys.stderr, "Usage: hdfs_wordcount.py <directory>" | ||
exit(-1) | ||
|
||
sc = SparkContext(appName="PythonStreamingHDFSWordCount") | ||
ssc = StreamingContext(sc, 1) | ||
|
||
lines = ssc.textFileStream(sys.argv[1]) | ||
counts = lines.flatMap(lambda line: line.split(" "))\ | ||
.map(lambda x: (x, 1))\ | ||
.reduceByKey(lambda a, b: a+b) | ||
counts.pprint() | ||
|
||
ssc.start() | ||
ssc.awaitTermination() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
""" | ||
Counts words in UTF8 encoded, '\n' delimited text received from the network every second. | ||
Usage: network_wordcount.py <hostname> <port> | ||
<hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data. | ||
|
||
To run this on your local machine, you need to first run a Netcat server | ||
`$ nc -lk 9999` | ||
and then run the example | ||
`$ bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999` | ||
""" | ||
|
||
import sys | ||
|
||
from pyspark import SparkContext | ||
from pyspark.streaming import StreamingContext | ||
|
||
if __name__ == "__main__": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i forgot to mention, can you add instruction on how to run the example (along with nc, etc.) as doc comments? See the comments in the scala / java NetworkWordCount. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
if len(sys.argv) != 3: | ||
print >> sys.stderr, "Usage: network_wordcount.py <hostname> <port>" | ||
exit(-1) | ||
sc = SparkContext(appName="PythonStreamingNetworkWordCount") | ||
ssc = StreamingContext(sc, 1) | ||
|
||
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) | ||
counts = lines.flatMap(lambda line: line.split(" "))\ | ||
.map(lambda word: (word, 1))\ | ||
.reduceByKey(lambda a, b: a+b) | ||
counts.pprint() | ||
|
||
ssc.start() | ||
ssc.awaitTermination() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
""" | ||
Counts words in UTF8 encoded, '\n' delimited text received from the | ||
network every second. | ||
|
||
Usage: stateful_network_wordcount.py <hostname> <port> | ||
<hostname> and <port> describe the TCP server that Spark Streaming | ||
would connect to receive data. | ||
|
||
To run this on your local machine, you need to first run a Netcat server | ||
`$ nc -lk 9999` | ||
and then run the example | ||
`$ bin/spark-submit examples/src/main/python/streaming/stateful_network_wordcount.py \ | ||
localhost 9999` | ||
""" | ||
|
||
import sys | ||
|
||
from pyspark import SparkContext | ||
from pyspark.streaming import StreamingContext | ||
|
||
if __name__ == "__main__": | ||
if len(sys.argv) != 3: | ||
print >> sys.stderr, "Usage: stateful_network_wordcount.py <hostname> <port>" | ||
exit(-1) | ||
sc = SparkContext(appName="PythonStreamingStatefulNetworkWordCount") | ||
ssc = StreamingContext(sc, 1) | ||
ssc.checkpoint("checkpoint") | ||
|
||
def updateFunc(new_values, last_sum): | ||
return sum(new_values) + (last_sum or 0) | ||
|
||
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) | ||
running_counts = lines.flatMap(lambda line: line.split(" "))\ | ||
.map(lambda word: (word, 1))\ | ||
.updateStateByKey(updateFunc) | ||
|
||
running_counts.pprint() | ||
|
||
ssc.start() | ||
ssc.awaitTermination() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ Contents: | |
|
||
pyspark | ||
pyspark.sql | ||
pyspark.streaming | ||
pyspark.mllib | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,6 +114,9 @@ def __ne__(self, other): | |
def __repr__(self): | ||
return "<%s object>" % self.__class__.__name__ | ||
|
||
def __hash__(self): | ||
return hash(str(self)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar question: are the changes in this file necessary for streaming or was part of the refactoring? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is necessary, we need to check the serializers of dstreams. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha. |
||
|
||
|
||
class FramedSerializer(Serializer): | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
from pyspark.streaming.context import StreamingContext | ||
from pyspark.streaming.dstream import DStream | ||
|
||
__all__ = ['StreamingContext', 'DStream'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to network_wordcount.py, can you add comments on how to run this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done