Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line protocol write API #2696

Merged
merged 5 commits into from
May 29, 2015
Merged

Line protocol write API #2696

merged 5 commits into from
May 29, 2015

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented May 29, 2015

This PR adds a new write HTTP endpoint (/write_points) that uses a text based line protocol instead of JSON. The protocol is a list of points separated by newlines \n.

Each point is composed of three blocks separated by whitespace. The first block is the measurement name and tags separated by commas. The second block is fields separated by commas. The last block is optional and is the timestamp for the point as a unix epoch in nanoseconds.

measurement[,tag=value,tag2=value2...] field=value[,field2=value2...] [unixnano]

Each point must have a measurement name. Tags are optional. Measurement, tag and values can not have any spaces. If the value contains a comma, it needs to be escaped with \,.

Each point must have at least one value. The format of a field is name=value. Fields can be one of four types: integer, float, boolean or string. Integers are all numeric and cannot have a decimal point .. Floats are all numeric and must have a decimal point. Booleans are the values true and false. Strings must be surrounded by double-quores ". If the value contains a quote, it must be escaped \". There can be no spaces between consecutive field values.

For example,

cpu,host=serverA,region=us-west value=1.0 10000000000
cpu,host=serverB,region=us-west value=3.3 10000000000
cpu,host=serverB,region=us-east user=123415235,event="overloaded" 20000000000
mem,host=serverB,regstion=us-east swapping=true 2000000000

Points written in this format should be sent to the /write_points endpoint. The request should be a POST with the points in the body of the request. The content can also be gzip encoded.

The following URL params may also be sent:

  • db: required The database to write points
  • rp: optional The retention policy to write points. If not specified, the default retention policy will be used.
  • precision: optional The precision of the time stamps (n, u, ms, s,m,h). If not specified, n is used.
  • consistency: optional The write consistency level required for the write to succeed. Can be one of one, any, all,quorum. Defaults to all.
  • u: optional The username for authentication
  • p: optional The password for authentication

A successful response to the request will return a 204. If a parameter or point is not valid, a 400 will be returned.


PR Notes:

The parser has been tuned to minimize allocations and extra work during parsing. For example, the raw byte slice read in is held onto as much as possible until there is a need to modify it. Similarly, values are not unmarshaled into Go types until necessary. It also tries to validate the input using a single pass over the data as much as possible. Tags need to be sorted so it is preferable to send them in already sorted to avoid sorting on the server. The sort has been tuned as well so that it performs consistently over a large range of inputs.

My local benchmarks have parsing performing around 750k-2m/points/sec depending on the shape of the point data.

jwilder added 3 commits May 29, 2015 11:18
This changes the implementation of point to minimize the extra
processing needed to parse and marshal point data though the system.
@pauldix
Copy link
Member

pauldix commented May 29, 2015

Measurement, tags, tag values, field names, and field values should all be able to have spaces, providing they are escaped. I assume the list of characters to escape are , " =. The empty one being a space.

The default consistency should be changed to one.

@pauldix
Copy link
Member

pauldix commented May 29, 2015

Overall looks good except for the updates to support spaces. Can you also update the Go client to use this endpoint and format instead?

@jwilder
Copy link
Contributor Author

jwilder commented May 29, 2015

Ok. I'll fix the escaping and default consistency. I'll do a separate pr for the client.

@otoolep
Copy link
Contributor

otoolep commented May 29, 2015

Good idea. Simple, but we can still curl it up.

@beckettsean
Copy link
Contributor

The last block is optional and is the timestamp for the point as a unix epoch in nanoseconds. and precision: optional The precision of the time stamps (n, u, ms, s,m,h). If not specified, n is used. seem to be in conflict with each other.

Perhaps that first sentence should just be The last block is optional and is the timestamp for the point as a unix epoch. The documentation for the optional precision query param does say the default is nanoseconds if none is supplied.

@beckettsean
Copy link
Contributor

Will this line protocol support Unicode characters in strings or only ASCII?

@pauldix
Copy link
Member

pauldix commented May 29, 2015

@beckettsean you should not prominently on the docs that the tags should be sorted by the tag keys.

@corylanou
Copy link
Contributor

Sorted for optimal performance, right? If they aren't sorted, we will sort them, but I think currently drops performance by 50%?

@jwilder
Copy link
Contributor Author

jwilder commented May 29, 2015

@beckettsean I added a test for unicode just now. Seems to work for string field values at least. I'll need to check the other spots though.

jwilder added a commit that referenced this pull request May 29, 2015
@jwilder jwilder merged commit 99bc7d2 into alpha1 May 29, 2015
@jwilder jwilder deleted the jw-write-path branch May 29, 2015 20:38
@beckettsean
Copy link
Contributor

@pauldix For the docs, Tag Keys should be ASCII sorted? Alpha sorted?

basically, what's the order for the following characters:

1 a A å Å _ - , . <space> <tab>

Are any of those illegal characters for this protocol?

@pauldix
Copy link
Member

pauldix commented May 29, 2015

@corylanou that's right. In the docs for the protocol we should push them in the right direction. Meaning they should sort the tags.

Also, even though the timestamp is optional, we should push them to include it. This becomes important if they're running a cluster and they get a response back on the client side that said it was only a partial write. In most cases they would want to repost the data. However, they'll potentially get duplicates. UNLESS they include the timestamp for each point in their request to write.

@beckettsean
Copy link
Contributor

@pauldix we will make clear in the docs the potential gotchas with not supplying a timestamp. The particular issue you describe only affects clusters, correct? For single node setups the writes are atomic so no partial write is possible, as far as I understand it.

@pauldix
Copy link
Member

pauldix commented May 29, 2015

@beckettsean technically it's still possible with a single server if they're writing data with different timestamps that end up dividing the points across multiple shards.

However, if they do a write of a bunch of points with no timestamp specified, and it's a single server, then a partial write isn't possible. It'll either succeed entirely or fail i.e. Atomic.

@gunnaraasen
Copy link
Contributor

Similar to what @pauldix mentioned. It would be great if measurement, tag key, tag values, field keys, and field values all followed the query language's identifier definition as closely as possible.

  • unquoted identifiers must start with an upper or lowercase ASCII character or "_"
  • unquoted identifiers may contain only ASCII letters, decimal digits, and "_"
  • double quoted identifiers can contain any unicode character other than a new line
  • double quoted identifiers can contain escaped " characters (i.e., \")

And string literals are single quoted.

Not sure if those rules make sense for the line format, but users will probably expect it to be similar to the query language identifier definition.

@neonstalwart
Copy link
Contributor

how about rather than a new endpoint, continue to use /write but switch behavior based on the Content-Type header? this would be text/plain and the JSON would be application/json. that's a fairly common way to interact via HTTP.

@pauldix
Copy link
Member

pauldix commented May 29, 2015

@gunnaraasen Those rules don't quite make sense for the line format mainly because this is so strict we know where the tag keys vs tag values are. The query language is much more flexible so it has more limitations.

Requiring double quotes around the identifiers (meaurement names, tag keys, and field keys) would be unnecessary for the line protocol for writing and would bloat the message.

@pauldix
Copy link
Member

pauldix commented May 29, 2015

@neonstalwart the problem is that previously we didn't require a Content-Type to be set, so that could break the existing write API for a bunch of people.

Going forward, the JSON write endpoint is going to be deprecated and this new endpoint is going to be the preferred way of writing data in.

Also, in my experience, many people have trouble interacting with HTTP APIs that require you to set things in the headers. I know it's part of the thing, but if we're optimizing for ease of use, a different endpoint is the best.

@neonstalwart
Copy link
Contributor

the problem is that previously we didn't require a Content-Type to be set, so that could break the existing write API for a bunch of people.

that's kind of a weak argument given timestamp -> time, name -> measurement.

@pauldix
Copy link
Member

pauldix commented May 29, 2015

@neonstalwart I guess that's true since we broke it before. Not buying the usability argument? :)

@neonstalwart
Copy link
Contributor

Going forward, the JSON write endpoint is going to be deprecated and this new endpoint is going to be the preferred way of writing data in.

😞 would you reconsider? i think HTTP+JSON is more or less the de facto way i interact with things these days. it's of course not the only way but with the move towards putting each service in its own container and using HTTP to communicate between components it seems very common in my experience.

Also, in my experience, many people have trouble interacting with HTTP APIs that require you to set things in the headers. I know it's part of the thing, but if we're optimizing for ease of use, a different endpoint is the best.

i agree that many people find it difficult to use HTTP. i just think that a /write that accepts json and a /write_points that accepts plain text is kind of awkward. to be kind of brutal, as a consumer considering a product, the API feels cheap (not well crafted and thought out) which leaves me wondering about the quality of the code. the API is your public face of this product.

@allgeek
Copy link

allgeek commented Jun 9, 2015

Everywhere under the hood we use a key to identify series. We use this in the underlying storage and we use it to route requests and writes within a cluster. In the case of the line protocol, that's the bytes up to the first space. We can get at it without doing any additional work or allocations.

That sounds like a huge benefit, and overall I'm definitely a fan of the line protocol approach for this use case (and of course the performance impacts as a result). However I'm wondering if this series-key optimization may be problematic if clients aren't pre-sorting the tags as suggested for performance purposes. If the parser re-sorts if necessary, but this shortcut is being used, wouldn't these keys still be the unsorted version? In the clustering redesign blog post, an assumption was made that equal points are duplicates. Would there be issues with receiving these two ('equal') points, but not really considering them duplicates?

cpu,host=serverA,region=uswest value=23.2

cpu,region=uswest,host=serverA value=23.2

It seems highly unlikely that a client would send 'duplicates' in different orders like this, but not knowing the full impact of the assumptions being made around these keys and handling of duplicates, my developer's spidey sense was tingling with potential edge case issues when I read the above quote..

@otoolep
Copy link
Contributor

otoolep commented Jun 10, 2015

The problem you speak of @allgeek is accounted for deep in the system. Rest-assured the two example points you show are considered the same point, and the tags are sorted by key on every point before performing the identity check:

https://github.com/influxdb/influxdb/blob/master/tsdb/meta.go#L1099

@jwilder
Copy link
Contributor Author

jwilder commented Jun 10, 2015

@allgeek For best performance, you should send them in pre-sorted if you can. If they are not sorted, they will be sorted before being stored. If they are already sorted, we don't attempt a sort. The parsing throughput drops by approximately 50% for unsorted tags and is proportional to the number tags present. It is still ~16x faster than JSON though and moves the bottleneck closer to the disks.

The two points you show would would have the same key but different timestamp since a timestamp is not shown. They would be two different points in the same series. If they both had the same timestamps, then they would be duplicate points.

@otoolep
Copy link
Contributor

otoolep commented Jun 10, 2015

@jwilder is correct to point out the requirement for an identical timestamp. Just to be clear, I assumed the same timestamp for each point, in your example.

@randywallace
Copy link

Directly related to this PR, I hacked out a quick rubygem that facilitates using the LineProtocol. I wrote it so that I can follow this up with easy integration into our sensu infrastructure. Comments/concerns/complaints/PR's are warmly welcomed: https://github.com/randywallace/influxdb-lineprotocol-writer-ruby

It does correctly from my testing sort tag keys automatically.

I spent about 4 hours on this, and in that time couldn't find a reasonable way to get nanosecond/microsecond precision in ruby (although for our use cases it isn't at all useful); I also didn't test SSL. If anyone wants to help with that, its appreciated.

@xfmoulet
Copy link

spaces within strings (ie within double quotes) seem to need escaping, is this normal ?
example :

    test,a="hello" value=1 --> 204 no content
    test,a="hello there" value=1 --> 400 bad request
    test,a="hello\ there" value=1 --> 204 no content

is this normal ? why is this escaping needed ?

@jwilder
Copy link
Contributor Author

jwilder commented Jun 22, 2015

@xfmoulet Tag values should not be double-quoted. Just escape the spaces with \ and don't surround with double-quotes.

test,a=hello\ there value=1

@edlane
Copy link

edlane commented Aug 17, 2016

@jwilder I would like to revisit the assumptions made above considering the recent improvements to JSON parsing claimed here: https://github.com/buger/jsonparser#benchmarks
Json parsing performance has been an acknowledged embarrassment to the Go community in the past --- losing out to Python, Ruby, Lua, .... and of course to C.

Rather than abandoning JSON support entirely would it be better to just use a faster library?

@jwilder
Copy link
Contributor Author

jwilder commented Aug 17, 2016

@edlane The JSON endpoint was disabled back in 0.11 and removed in 0.12.

@edlane
Copy link

edlane commented Aug 17, 2016

@jwilder Yes, That is why I asked the question.

@jwilder
Copy link
Contributor Author

jwilder commented Aug 17, 2016

@edlane We've moved on from JSON on the write path and are very unlikely to add it back. We have a proposal for a v2 line protocol that we're considering as well. JSON also presents some problems with sending int64 values because it only has float numbers.

We have the same performance/memory issues on the query side related to JSON and have been adding support for other formats csv, msgpack. Switching the marshaller to something more performant for the query side might be worthwhile though.

@edlane
Copy link

edlane commented Aug 17, 2016

@Mckane10
Copy link

Mckane10 commented Jun 8, 2020

pyformance sends nan values to influxdb

I use influxdb to retrieve the Sawtooth parameters to display them on Grafana. But when I launch my Sawtooth validator component I get this error "Warning Influx: Bad requests".
Can you please help me?
If there is a correction method, which file is it on?

@yonglizhong
Copy link

yonglizhong commented Jun 11, 2021

I would just like to confirm again with InfluxDB 2.0. There is no longer the possibility of writing data into the server as JSON format. Yes? only via the line protocol as shown in the documentation with cURL. My attempt with Postman using JSON Format was also unsuccesful. This was done by sending an additional header "Content-Type: application/json"

@danxmoran
Copy link
Contributor

@yonglizhong correct, the API expects line protocol as a text/plain request.

@YYF-CHINA
Copy link

@jwilder If the key of a tag has a newline character, the value of the tag is not displayed. Hope to solve this problem, thank you.
0C36BB93-AB93-4805-B65B-38E458EE7179
7E8AAB57-4AD2-47BA-8DF0-28F558F3CF95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.