Skip to content

Commit

Permalink
[docs] Add benchmarks docs #721 (#738)
Browse files Browse the repository at this point in the history
add performance docs
  • Loading branch information
michaelvlach authored Sep 17, 2023
1 parent 0f0bb50 commit 7af754f
Show file tree
Hide file tree
Showing 4 changed files with 700 additions and 4 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The Agnesoft Graph Database (aka _agdb_) is persistent memory mapped graph datab
- [Roadmap](#roadmap)
- [Reference](#reference)
- [Efficient agdb](#efficient-agdb)
- [Performance](#performance)
- [Concepts](#concepts)
- [Queries](#queries)
- [But why?](#but-why)
Expand Down Expand Up @@ -116,6 +117,8 @@ The following are planned features in priority order:

## [Efficient agdb](docs/efficient_agdb.md)

## [Performance](docs/performance.md)

## [Concepts](docs/concepts.md)

## [Queries](docs/queries.md)
Expand Down
8 changes: 4 additions & 4 deletions docs/but_why.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,10 @@ The one file is the database and the data.

# What about sharding, replication and performance at scale?

Most databases tackle the issue of (poor) performance at scale by scaling up using replication/sharding strategies. While these techniques are definitely useful and they are planned for `agdb` they should be avoid as much as possible. The increase in complexity when using replication or sharding is dramatic and it is only worth it if there is no other choice.
Most databases tackle the issue of (poor) performance at scale by scaling up using replication/sharding strategies. While these techniques are definitely useful and they are planned for `agdb` they should be avoided as much as possible. The increase in complexity when using replication and/or sharding is dramatic and it has adverse performance impact meaning it is only worth it if there is no other choice.

The `agdb` is designed so that it performs well regardless of its size. Most read operations are O(1) and there is no limit on concurrency on them. Most write operations are O(1) amortized. The O(n) complexities are limited to individual node traversals, e.g. reading a 1000 connected nodes will take 1000 O(1) operations = O(n) same as reading 1000 rows in a table. However if you structure your data well (meaning you do not blindly connect everything to everything) you can have as large data set as your hardware can fit without issues if you can query only subset of the graph (subgraph) since your query will have performance based on that subgraph and not all the data stored in the database.
The `agdb` is designed so that it performs well regardless of the data set size. Direct access operations are O(1) and there is no limit on concurrency. Write operations are O(1) amortized however they are exclusive - there can be only one write operation running on the database at any given time preventing any other read or write operations at the same time. You will still get O(n) complexity when searching the (sub)graph as reading a 1000 connected nodes will take 1000 O(1) operations = O(n) same as reading 1000 rows in a table. However if the data does not indiscriminately connect everything to everything one can have as large data set as the hardware can fit without performance issues. The key is querying only subset of the graph (subgraph) since your query will have performance based on that subgraph and not all the data stored in the database.

The point here is that you will need to scale out only when your database starts exceeding limits of a single machine. Adding data replication/backup will be relatively easy feature. Sharding would be only slightly harder but the database has been written in a way that it can be used relatively easily. The obvious downside is the huge performance dip for such a setup. To alleviate this the local caches could be used but as mentioned this only further adds to complexity.
The point here is that scaling has significant cost regardless of technology or clever tricks. Only when the database starts exceeding limits of a single machine they shall be considered because adding data replication/backup will mean huge performance hit. To mitigate it to some extent caching can be used but it can never be as performant as local database. The features "at scale" are definitely coming you should avoid using them as much as possible even if available.

So while features "at scale" are definitely coming you should avoid using them as much as possible.
[For real world performance see dedicated documentation.](performance.md)
Loading

0 comments on commit 7af754f

Please sign in to comment.