forked from sriksun/Ivory
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
161 lines (110 loc) · 5.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
Ivory Overview
Ivory is a feed processing and feed management system aimed at making it
easier for end consumers to onboard their feed processing and feed
management on hadoop clusters.
Why?
* Dependencies across various data processing pipelines are not easy to
establish. Gaps here typically leads to either incorrect/partial
processing or expensive reprocessing. Repeated duplicate definition of
a single feed multiple times can lead to inconsistencies / issues.
* Input data may not arrive always on time and it is required to kick off
the processing without waiting for all data to arrive and accommodate
late data separately
* Feed management services such as feed retention, replications across
clusters, archival etc are tasks that are burdensome on individual
pipeline owners and better offered as a service for all customers.
* It should be easy to onboard new workflows/pipelines
* Smoother integration with metastore/catalog
* Provide notification to end customer based on availability of feed
groups (logical group of related feeds, which are likely to be used
together)
Usage
a. Setup cluster definition
$IVORY_HOME/bin/ivory entity -submit -type cluster -file /cluster/definition.xml -url http://ivory-server:ivory-port
b. Setup feed definition
$IVORY_HOME/bin/ivory entity -submit -type feed -file /feed1/definition.xml -url http://ivory-server:ivory-port
$IVORY_HOME/bin/ivory entity -submit -type feed -file /feed2/definition.xml -url http://ivory-server:ivory-port
c. Setup process definition
$IVORY_HOME/bin/ivory entity -submit -type process -file /process/definition.xml -url http://ivory-server:ivory-port
d. Once submitted, entity definition, status and dependency can be queried.
$IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -name <<name>> [-definition|-status|-dependency] -url http://ivory-server:ivory-port
or entities for a particular type can be listed through
$IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -list
e. Schedule process
$IVORY_HOME/bin/ivory entity -type process -name process -schedule -url http://ivory-server:ivory-port
f. Once scheduled entities can be suspended, resumed or deleted (post submit)
$IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -name <<name>> [-suspend|-delete|-resume] -url http://ivory-server:ivory-port
g. Once scheduled process instances can be managed through irovy CLI
$IVORY_HOME/bin/ivory instance -processName <<name>> [-kill|-suspend|-resume|-re-run] -start "yyyy-MM-dd'T'HH:mm'Z'" -url http://ivory-server:ivory-port
Example configurations
Cluster:
<?xml version="1.0"?>
<!--
Production cluster configuration
-->
<cluster colo="ua2" description="" name="staging-red" xmlns="uri:ivory:cluster:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<interfaces>
<interface type="readonly" endpoint="hftp://gsgw1001.red.ua2.inmobi.com:50070"
version="0.20.2-cdh3u0" />
<interface type="write" endpoint="hdfs://gsgw1001.red.ua2.inmobi.com:54310"
version="0.20.2-cdh3u0" />
<interface type="execute" endpoint="gsgw1001.red.ua2.inmobi.com:54311" version="0.20.2-cdh3u0" />
<interface type="workflow" endpoint="http://gs1134.blue.ua2.inmobi.com:11000/oozie/"
version="3.1.4" />
<interface type="messaging" endpoint="tcp://gs1134.blue.ua2.inmobi.com:61616?daemon=true"
version="5.1.6" />
</interfaces>
<locations>
<location name="staging" path="/projects/ivory/staging" />
<location name="temp" path="/tmp" />
<location name="working" path="/projects/ivory/working" />
</locations>
<properties/>
</cluster>
Feed:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Hourly ad carrier summary. Generated by hourly processing of rr logs
-->
<feed description="RRHourlyAdCarrierSummary" name="RRHourlyAdCarrierSummary" xmlns="uri:ivory:feed:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<partitions/>
<groups>rmchourly</groups>
<frequency>hours</frequency>
<periodicity>1</periodicity>
<late-arrival cut-off="hours(6)" />
<clusters>
<cluster name="staging-red" type="source">
<validity start="2009-01-01T00:00Z" end="2099-12-31T00:00Z" timezone="UTC" />
<retention limit="months(24)" action="delete" />
</cluster>
</clusters>
<locations>
<location type="data" path="/projects/bi/rmc/rr/${YEAR}-${MONTH}-${DAY}-${HOUR}.concat/HourlyAdCarrierSummary" />
<location type="stats" path="/none" />
<location type="meta" path="/none" />
</locations>
<ACL owner="rmcuser" group="users" permission="0755" />
<schema location="/none" provider="none" />
<properties/>
</feed>
Process:
<?xml version="1.0" encoding="UTF-8"?>
<!--
RMC Daily process, produces 34 new feeds
-->
<process name="rmc-daily">
<cluster name="staging-red" />
<frequency>days(1)</frequency>
<validity start="2012-04-03T06:00Z" end="2022-12-30T00:00Z" timezone="UTC" />
<inputs>
<input name="WapAd" feed="WapAd" start="today(0,0)" end="today(0,0)" />
</inputs>
<outputs>
<output name="TrafficDailyAdSiteSummary" feed="TrafficDailyAdSiteSummary" instance="yesterday(0,0)" />
</outputs>
<properties>
<property name="lastday" value="${formatTime(yesterday(0,0), 'yyyy-MM-dd')}" />
</properties>
<workflow engine="oozie" path="/projects/bi/rmc/pipelines/workflow/rmcdaily" />
<retry policy="backoff" delay="5" delayUnit="minutes" attempts="3" />
</process>