-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement Request: Alerts from Vitals #158
Comments
Actually, I'm going to try and figure this out and report back here in case others want an example for the same thing. My first attempt is using the telegraf.local file: [[inputs.http]] Getting a parse error in the telegraf logs, so going to experiment some more. |
Awesome! Thanks for opening this @DerickJohnson - Let us know what you come up with. FYI - The pypowerwall proxy has a macro that aggregates all alerts if that helps: |
Try |
Thank you @BuongiornoTexas! Yes, I noticed that issue in the docs about it looking for an object or array of objects. I’m going to continue later in the evening looking at the processors for influx to convert the array to an object or array of objects. I want to experiment with something like a cached set of alerts with true/false values so as new alerts come in, they get cached and show a false value so over time I can see which are flagging and choose to filter out ones that are not relevant. I’ll update here as I make progress. |
The array is unfortunately dynamically sized for only the alerts showing up and they won’t be in the same position even as the alerts present are constant over multiple calls. |
You could follow jasoncox's example and create a pypowerwall proxy page similar to vitals that returns a json dictionary of all active alerts, or maybe even all possible alerts and true/false values for the alerts that are active/inactive. This would deal with the problem of uniqueness. |
Didn’t think of that, I’ll check it out! |
Alright, so I got something working, but I'm still trying to figure out the best way of representing it in a Grafana Panel. I ended up creating an additional proxy page per @BuongiornoTexas's recommendation to create a dictionary with 1 values for those errors that are on. I then added a starlark processor to keep track of states that have flagged in the past so that the metric shows a value for the errors that no longer show up (there may be a way to utilize "null, undefined or NaN" values, but the couple methods I tried didn't work as I had expected in the chart). I'm going to experiment more tomorrow with the Grafana component. Here are the minor updates I made in the meantime: server.py elif self.path == '/alerts/pw': # should this be a different url? I followed temps
# Alerts in dictionary format
pwalerts = {}
idx = 1
alerts = pw.alerts()
for alert in alerts:
pwalerts[alert] = 1
message = json.dumps(pwalerts) Which shows the data at the endpoint /alerts/pw for now like this: telegraf.local [[inputs.http]]
urls = [
"http://pypowerwall:8675/alerts/pw"
]
name_override = "alerts"
method = "GET"
insecure_skip_verify = true
timeout = "4s"
data_format = "json"
[[processors.starlark]]
source = '''
state = {
"last": {}
}
def dict_union(x, y):
z = {}
z.update(x)
z.update(y)
return z
def apply(metric):
url = metric.tags.get("url")
last = state["last"]
if url and url == "http://pypowerwall:8675/alerts/pw":
base = {x: 0 for x in metric.fields.keys()} #For updating existing total key set
current = {x: 1 for x in metric.fields.keys()} #Currently flagging keys
result = dict_union(last,current)
state["last"] = dict_union(last, base)
new_metric = Metric("all_alerts")
for k, v in result.items():
new_metric.fields[str(k)] = v
return new_metric
else:
return metric
''' Which reformats the data so that errors that no longer show up still have a key with a value of 0 for continued coverage (as long as telegraf is up). I then used this query SELECT *::field from raw.all_alerts and the "Status History" visualization to get this: So still some work to do to finalize a couple of things in the visuals (to make sure the data format is right). Once things look good I can submit this as an additional commented example in the telegraf.local.sample file as well as the new endpoint in server.py (and allow people to create their own visuals) or include a visual as well. If this is something you think others will want to use of course, it could just be for me since I have so many errors all the time 😄 As a side note, I'm primarily a JavaScript developer, so I had to learn a few things to get familiar with the setup. If anything is done incorrectly, I apologize. I couldn't use any of my fancy object spread or array operators that I'm used to. |
Nice job @DerickJohnson ! On the extension of server.py, I could accept that as a PR for pypowerwall if you wanted to submit it. I think the dynamic horizontal history graph is useful. I suggest removing the "OK" and "Err" text and just use a color vs. blank (transparent) to indicate state, since some alerts are actually positive (not Err). The only problem is that some of these alerts seem to have a long TTL. For example, "FWUpdateSucceeded" stays lit, similar to the "FWUpdateFailed". I suppose it would still be useful information to see (appear/disappear). I have also thought it would be nice to have a vertical scrolling time log of alerts. Each alert would have a timestamp of when it shows up, perhaps even events we define like (PW at 100%, Storm Event, Grid offline, Reserve Level changed). It could be joined to your panel. Something like this (ignore the data - just example): |
Absolutely, I haven't perfected that visual yet since what you mentioned is true about some alerts being positive when present so not really something to want to be alerted about (all the data is probably good to keep but not to show). I'll submit that PR and keep working on the visual. I like that scrolling time log idea as well. |
Hey Jason! I'm going to close this issue as the request is complete now. I ended up using the state transition visual to help me find out why (or at least get more information around) the system getting stuck in an infinite loop and not being able to operate. I think I narrowed it down to losing connection to the meters that do all the measurements. The PVInverter comms alert flagging seems to be a leading indicator before it dies (not just in this instance but in the many others I have). Here's a fun visual to show you what I was working with and how the alerts timeline is helping. You can see the disabledRelay and PVInverterComms around the time it goes out and the PVInverterComms never returns (the reason it stays red is because even after a reboot, it's there with systemConnectedToGrid as the first alerts). All others are reset: I also did a forced grid outage in there just to see what alerts came up :-). The blue color is for "informational" alerts as discussed earlier in the thread. I used overrides in Grafana for that. |
@DerickJohnson - This is amazing! Can you share the panel or dashboard JSON for anyone else wanting to set it up? Did this help you make a case with Tesla? |
Absolutely, here is the Panel JSON: {
"id": 65,
"gridPos": {
"h": 16,
"w": 16,
"x": 0,
"y": 1
},
"type": "state-timeline",
"title": "Alerts",
"transformations": [],
"datasource": {
"type": "influxdb",
"uid": "q8odLDzgz"
},
"pluginVersion": "9.1.2",
"fieldConfig": {
"defaults": {
"custom": {
"lineWidth": 0,
"fillOpacity": 70,
"spanNulls": true
},
"color": {
"mode": "continuous-GrYlRd"
},
"mappings": [
{
"options": {
"0": {
"color": "transparent",
"index": 0
},
"1": {
"color": "red",
"index": 1
}
},
"type": "value"
},
{
"options": {
"match": "null+nan",
"result": {
"color": "transparent",
"index": 2
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "transparent",
"value": null
}
]
},
"unit": "none"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "FWUpdateSucceeded"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "GridCodesWrite"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "PINV_a067_overvoltageNeutralChassis"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "POD_w110_SW_EOC"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "PVS_a019_MciStringC"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "PVS_a020_MciStringD"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "PodCommissionTime"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "SYNC_a001_SW_App_Boot"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "SYNC_a044_IslanderDisconnectWithin2s"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "SystemConnectedToGrid"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "THC_w061_CAN_TX_FIFO_Overflow"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
},
{
"matcher": {
"id": "byName",
"options": "PodCommissionTime"
},
"properties": [
{
"id": "mappings",
"value": [
{
"options": {
"0": {
"color": "transparent",
"index": 1
},
"1": {
"color": "blue",
"index": 0
}
},
"type": "value"
}
]
}
]
}
]
},
"options": {
"mergeValues": true,
"showValue": "never",
"alignValue": "left",
"rowHeight": 0.51,
"legend": {
"showLegend": true,
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "influxdb",
"uid": "q8odLDzgz"
},
"groupBy": [
{
"params": [
"$__interval"
],
"type": "time"
},
{
"params": [
"null"
],
"type": "fill"
}
],
"hide": false,
"measurement": "alerts4",
"orderByTime": "ASC",
"policy": "raw",
"query": "SELECT *::field from raw.all_alerts where $timeFilter",
"rawQuery": true,
"refId": "A",
"resultFormat": "table",
"select": [
[
{
"params": [
"HighCPU"
],
"type": "field"
},
{
"params": [],
"type": "count"
}
]
],
"tags": []
}
]
} The "all_alerts" measurement comes from that starlark processor I posted earlier in the thread. Although for most it's probably unnecessary and could be targeted directly to the "alerts" measurement. It was my attempt to keep around the alerts that disappear over time so that they'd stay in the view (makes it easier to color-code than a value that doesn't show up too). I'll let you know if it ends up helping them root cause it! |
Thanks @DerickJohnson! I love this. I added your telegraf/starlark processor to my test rig. I really like having the alert data in the dashboard! I believe this is handy enough to add to the standard telegraf.conf along with a basic panel at the bottom of the stock dashboard.json. I agree with you that we could just add the /alerts/pw data without the starlark processor for most use cases (but we could leave a commented out version in telegraf.local if someone wanted to easily turn it on). I picked a neutral "blue" for all alerts but I like "dimming" the normal alerts (a union between what you had in there and what I had) and somehow highlighting the rest. I removed the starlark processor and tested an "Off Grid" event. Because the processor wasn't sending in zeroes, it didn't detect the "SystemConnectedToGrid" had fallen off. I switched to basic thresholds and tweaked the alerts related to grid for more color. ;) The one issue I see is that the "state timeline" panel doesn't auto-size so in your case (someone who sees a lot of different alerts), you will get a compressed squashed list. It is easy for the user to expand but not as dynamic as I would like. In any case, I do think this would be useful to the community. Thank you! |
@jasonacox, that's amazing! I just pulled in your changes and will keep that panel as well. I like the aesthetics of the different shades/thresholding. This data has been super interesting, not only am I seeing a wide range of alerts over time (this is the laundry list I have): But also to see little "blips" that I wouldn't have noticed otherwise: Thanks again for all the work on this. It's been a lot of fun! |
I agree, I never knew about the blips! The time series logging of these alerts is brilliant. Thanks @DerickJohnson . It is super clear by the number of alerts that there is something seriously wrong with your system. I hope you are able to get it fixed soon! While it's no consolation, your broken system has been a treasure of discovery for the community. Thanks for telling your story and contributing to the dashboard! I've added the new alerts you mention above to our growing list on the pyPowerwall README. Thanks again! 🙏 |
I tried to add this to my system today, but I am either doing something wrong or there is an error in the code. I've made a lot of customizations to my setup, so I don't use the upgrade or install scripts, which could be my issue. I removed pypowerwall and recreated it. When I go to http://[IP]:8675/version I get: {"version": "22.26.2 8cd8cac4", "vint": 222602} I added the inputs.http to telegraf.conf as per the current code:
Nothing is going into InfluxDB and Telegraf is logging:
When I go to http://pypowerwall:8675/alerts/pw in my browser, I get a 404 Not Found error and a redirect to the main :8675/ page When I go to http://pypowerwall:8675/alerts (without the /pw at the end) I get:
Do I have something configured wrong or should the telegraf.conf file not have that "/pw" on the end of the URL? Edit: I tried without the /pw and just got a different error.
Is this maybe a Powerwall+ only feature? I have two Powerwall 2's, but not the +. |
Hi @youzer-name - what pypowerwall version are you running? # Show pypowerwall version
docker logs pypowerwall Does it say
The http://pypowerwall:8675/alerts/pw endpoint was only added in t24. I am wondering if you might need to pull the latest version since you received a 404 error for that url. |
@mcbirse |
Thanks @mcbirse ! @youzer-name FYI - You can also get the version of pypowerwall by hitting the endpoing: http://pypowerwall:8675/stats If it won't pull latest, the way to upgrade pypowerwall (which is in upgrade.sh): # stop and delete pypowerwall
docker stop pypowerwall
docker rm pypowerwall
docker images | grep pypowerwall | awk '{print $3}' | xargs docker rmi -f
# restart stack
./compose-dash.sh up -d BTW, I would love to see screenshots of Alerts to see how they differ between systems. Over time we can better color code the alerts. |
Installed just before 5pm At 6:56pm put a kettle on which triggered the change in state in RealPowerAvailableLimited, though the reported SOC was still 100%. That was the end of export for the day. Powerwall was discharging from that point, feeding the home, but reported SOC didn't drop below 100% until 7:56pm when POD_w110_SW_EOC dropped off. |
Nice!! Thanks @wreiske - you have some interesting things going on there! One bug that I have found with the state-timeline graph (it is beta after all) is that it will glitch and not always align the labels to the lines if you zoom the browser window. I noticed that happened to you when you posted that. A refresh will fix it but is a bit annoying. Still, the data is awesome to have in a graph form. |
Just a quick note - I know this is a closed issue, but this is where the discussion is, and I figured people who cared would see this: I had the firmware upgrade to 22.26.5 today, and this is the alerts panel covering the outage: Of most interest to me is the Max CPU alert for about 10 minutes after the upgrade. |
@BJReplay - that's cool, thanks for posting! I think it's helpful to post alert examples that are seen under certain conditions. I have some examples from today also, and not sure where else to post these, so this seems like a good place. Just curious, have you updated to the latest Oh and, just noting your firmware upgrade - guessing mine may be imminent in that case as I think you are in Aus too? Anyway, today I had some electrical work being done on my house, so it was a great opportunity to obtain some alert data from the Powerwall. The electrician had to shut off the grid supply, as well as switch off all breakers in the gateway. My server running Powerwall-Dashboard is on a UPS though, and connected via ethernet direct to the gateway - so monitoring was still active (at least until the server shutdown after about 50mins when the UPS got low on battery). Below are the alerts received during this scenario (power switched off just after 10:40am to just before midday). And what the Tesla app was showing: I also have a shell script running that polls the gateway for some basic data such as grid status etc. and sends alert e-mails on changes. This highlighted some never seen before grid status values of:
As below:
The system switched back and forth between
Unfortunately I did not get any alert data or grid status codes from when all the power was turned back on, as my server had shut down by that time. |
@BJReplay Yes! I see the Max CPU All the time when the gateway has been rebooted (and you see "updating devices") on the gateway 192.168.91.1 page. @mcbirse I also had a bunch of those same alerts when my system unexpectedly when down during a grid outage (the wait for user one I remember specifically). Really cool to see all the different alert scenarios. It's definitely helped me understand more about the system operation. For example, I think the _EOC alert we see might mean "end of charge" or something like that since it happens when my battery is full. I also found a couple of explanations for other alerts like "battery unexpected power" in this document: https://sunbridgesolar.com/wp-content/uploads/2021/03/Tesla_Powerhub_Manual_User.pdf. Most are the self explanatory ones, but a couple were helpful. |
@mcbirse No, I haven't - I have a heavily customised dashboard, and I haven't bothered. I guess I will load it, save the panel as a library panel, then load that panel into my customised panel. |
@BJReplay - no worries, all good. I have a custom dashboard setup as well. Typically I merge new changes into mine in a somewhat manual process... by comparing the .json files in VSCode (or BeyondCompare sometimes) and then merging the new elements I want. Considering yours is heavily customised though, it might be easier to just edit your alerts panel manually if you like and add the rename by regex, as below: |
@mcbirse and @BJReplay - this is gold! Thanks for documenting. @DerickJohnson It would be good to capture some of the info in the Alerts Table in that document - seems most of it would apply to residential Powerwall users too. I have been documenting Device names and Alerts as I discover them here: https://github.com/jasonacox/pypowerwall#devices-and-alerts - I'll try to add what makes sense.
@mcbirse great discovery!! This is something I should add to pypowerwall's |
Hi @jasonacox - I agree that makes sense, those status responses should be classified as "DOWN" rather than returning Null/None. The Tesla app itself was displaying "Grid outage" but with some extra text after that in this case of "Powerwall Inactive". Regardless, it classified it as a grid outage as well. |
Thanks for the Alerts feature, it seems to have caught one of my Powerwalls dying.
Thought others might find this info useful. |
Thanks @ibmaster ! |
Hi Jason!
I love the work you've done with pypowerwall and the dashboard. It's helped IMMENSELY when trying to talk to support when they often don't have the information they need to help.
I was wondering if there was an easy way to add the alerts information from :8675/vitals to the influxdb for monitoring long term (to see how they change over time in different scenarios). I want to try and create a panel for the information but I didn't see it in the current set of vitals.
My powerwall seems to get stuck in infinite loops trying to update firmware and then failing to update (FWUpdateFailed) bringing the whole system down. I use the CLI to pull the alerts but they change constantly. It would be nice to capture how they change in the dashboard to see if there are any patterns. Let me know if there's an easy way to get these into influx.
Thank you again for all the great work!
The text was updated successfully, but these errors were encountered: