-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line protocol write API #2696
Line protocol write API #2696
Conversation
This changes the implementation of point to minimize the extra processing needed to parse and marshal point data though the system.
Measurement, tags, tag values, field names, and field values should all be able to have spaces, providing they are escaped. I assume the list of characters to escape are The default consistency should be changed to |
Overall looks good except for the updates to support spaces. Can you also update the Go client to use this endpoint and format instead? |
Ok. I'll fix the escaping and default consistency. I'll do a separate pr for the client. |
Good idea. Simple, but we can still curl it up. |
Perhaps that first sentence should just be |
Will this line protocol support Unicode characters in strings or only ASCII? |
@beckettsean you should not prominently on the docs that the tags should be sorted by the tag keys. |
Sorted for optimal performance, right? If they aren't sorted, we will sort them, but I think currently drops performance by 50%? |
@beckettsean I added a test for unicode just now. Seems to work for string field values at least. I'll need to check the other spots though. |
@pauldix For the docs, Tag Keys should be ASCII sorted? Alpha sorted? basically, what's the order for the following characters:
Are any of those illegal characters for this protocol? |
@corylanou that's right. In the docs for the protocol we should push them in the right direction. Meaning they should sort the tags. Also, even though the timestamp is optional, we should push them to include it. This becomes important if they're running a cluster and they get a response back on the client side that said it was only a |
@pauldix we will make clear in the docs the potential gotchas with not supplying a timestamp. The particular issue you describe only affects clusters, correct? For single node setups the writes are atomic so no |
@beckettsean technically it's still possible with a single server if they're writing data with different timestamps that end up dividing the points across multiple shards. However, if they do a write of a bunch of points with no timestamp specified, and it's a single server, then a partial write isn't possible. It'll either succeed entirely or fail i.e. Atomic. |
Similar to what @pauldix mentioned. It would be great if measurement, tag key, tag values, field keys, and field values all followed the query language's
And string literals are single quoted. Not sure if those rules make sense for the line format, but users will probably expect it to be similar to the query language identifier definition. |
how about rather than a new endpoint, continue to use |
@gunnaraasen Those rules don't quite make sense for the line format mainly because this is so strict we know where the tag keys vs tag values are. The query language is much more flexible so it has more limitations. Requiring double quotes around the identifiers (meaurement names, tag keys, and field keys) would be unnecessary for the line protocol for writing and would bloat the message. |
@neonstalwart the problem is that previously we didn't require a Going forward, the JSON write endpoint is going to be deprecated and this new endpoint is going to be the preferred way of writing data in. Also, in my experience, many people have trouble interacting with HTTP APIs that require you to set things in the headers. I know it's part of the thing, but if we're optimizing for ease of use, a different endpoint is the best. |
that's kind of a weak argument given timestamp -> time, name -> measurement. |
@neonstalwart I guess that's true since we broke it before. Not buying the usability argument? :) |
😞 would you reconsider? i think HTTP+JSON is more or less the de facto way i interact with things these days. it's of course not the only way but with the move towards putting each service in its own container and using HTTP to communicate between components it seems very common in my experience.
i agree that many people find it difficult to use HTTP. i just think that a |
That sounds like a huge benefit, and overall I'm definitely a fan of the line protocol approach for this use case (and of course the performance impacts as a result). However I'm wondering if this series-key optimization may be problematic if clients aren't pre-sorting the tags as suggested for performance purposes. If the parser re-sorts if necessary, but this shortcut is being used, wouldn't these keys still be the unsorted version? In the clustering redesign blog post, an assumption was made that equal points are duplicates. Would there be issues with receiving these two ('equal') points, but not really considering them duplicates?
It seems highly unlikely that a client would send 'duplicates' in different orders like this, but not knowing the full impact of the assumptions being made around these keys and handling of duplicates, my developer's spidey sense was tingling with potential edge case issues when I read the above quote.. |
The problem you speak of @allgeek is accounted for deep in the system. Rest-assured the two example points you show are considered the same point, and the tags are sorted by key on every point before performing the identity check: https://github.com/influxdb/influxdb/blob/master/tsdb/meta.go#L1099 |
@allgeek For best performance, you should send them in pre-sorted if you can. If they are not sorted, they will be sorted before being stored. If they are already sorted, we don't attempt a sort. The parsing throughput drops by approximately 50% for unsorted tags and is proportional to the number tags present. It is still ~16x faster than JSON though and moves the bottleneck closer to the disks. The two points you show would would have the same key but different timestamp since a timestamp is not shown. They would be two different points in the same series. If they both had the same timestamps, then they would be duplicate points. |
@jwilder is correct to point out the requirement for an identical timestamp. Just to be clear, I assumed the same timestamp for each point, in your example. |
Directly related to this PR, I hacked out a quick rubygem that facilitates using the LineProtocol. I wrote it so that I can follow this up with easy integration into our sensu infrastructure. Comments/concerns/complaints/PR's are warmly welcomed: https://github.com/randywallace/influxdb-lineprotocol-writer-ruby It does correctly from my testing sort tag keys automatically. I spent about 4 hours on this, and in that time couldn't find a reasonable way to get nanosecond/microsecond precision in ruby (although for our use cases it isn't at all useful); I also didn't test SSL. If anyone wants to help with that, its appreciated. |
spaces within strings (ie within double quotes) seem to need escaping, is this normal ?
is this normal ? why is this escaping needed ? |
@xfmoulet Tag values should not be double-quoted. Just escape the spaces with
|
@jwilder I would like to revisit the assumptions made above considering the recent improvements to JSON parsing claimed here: https://github.com/buger/jsonparser#benchmarks Rather than abandoning JSON support entirely would it be better to just use a faster library? |
@edlane The JSON endpoint was disabled back in |
@jwilder Yes, That is why I asked the question. |
@edlane We've moved on from JSON on the write path and are very unlikely to add it back. We have a proposal for a v2 line protocol that we're considering as well. JSON also presents some problems with sending We have the same performance/memory issues on the query side related to JSON and have been adding support for other formats csv, msgpack. Switching the marshaller to something more performant for the query side might be worthwhile though. |
The which appears fixable here: |
pyformance sends nan values to influxdb I use influxdb to retrieve the Sawtooth parameters to display them on Grafana. But when I launch my Sawtooth validator component I get this error "Warning Influx: Bad requests". |
I would just like to confirm again with InfluxDB 2.0. There is no longer the possibility of writing data into the server as JSON format. Yes? only via the line protocol as shown in the documentation with cURL. My attempt with Postman using JSON Format was also unsuccesful. This was done by sending an additional header "Content-Type: application/json" |
@yonglizhong correct, the API expects line protocol as a |
@jwilder If the key of a tag has a newline character, the value of the tag is not displayed. Hope to solve this problem, thank you. |
This PR adds a new write HTTP endpoint (
/write_points
) that uses a text based line protocol instead of JSON. The protocol is a list of points separated by newlines\n
.Each point is composed of three blocks separated by whitespace. The first block is the measurement name and tags separated by commas. The second block is fields separated by commas. The last block is optional and is the timestamp for the point as a unix epoch in nanoseconds.
Each point must have a measurement name. Tags are optional. Measurement, tag and values can not have any spaces. If the value contains a comma, it needs to be escaped with
\,
.Each point must have at least one value. The format of a field is
name=value
. Fields can be one of four types: integer, float, boolean or string. Integers are all numeric and cannot have a decimal point.
. Floats are all numeric and must have a decimal point. Booleans are the valuestrue
andfalse
. Strings must be surrounded by double-quores"
. If the value contains a quote, it must be escaped\"
. There can be no spaces between consecutive field values.For example,
Points written in this format should be sent to the
/write_points
endpoint. The request should be aPOST
with the points in the body of the request. The content can also begzip
encoded.The following URL params may also be sent:
db
: required The database to write pointsrp
: optional The retention policy to write points. If not specified, the default retention policy will be used.precision
: optional The precision of the time stamps (n
,u
,ms
,s
,m
,h
). If not specified,n
is used.consistency
: optional The write consistency level required for the write to succeed. Can be one ofone
,any
,all
,quorum
. Defaults toall
.u
: optional The username for authenticationp
: optional The password for authenticationA successful response to the request will return a
204
. If a parameter or point is not valid, a400
will be returned.PR Notes:
The parser has been tuned to minimize allocations and extra work during parsing. For example, the raw byte slice read in is held onto as much as possible until there is a need to modify it. Similarly, values are not unmarshaled into Go types until necessary. It also tries to validate the input using a single pass over the data as much as possible. Tags need to be sorted so it is preferable to send them in already sorted to avoid sorting on the server. The sort has been tuned as well so that it performs consistently over a large range of inputs.
My local benchmarks have parsing performing around 750k-2m/points/sec depending on the shape of the point data.