Visum - A Cloud Cost Optimization Platform

Background

Visum is a Cost Optimization Platform for Cloud Native Applications. The first version of Visum takes a bottom up approach and focuses on cost optimization of Apache Spark Applications that are running on AWS. However, Apache Spark Applications are a private case of Cloud Native Applications. Later, Visum can be extended to handle any Cloud Native Application.

Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

Apache Spark

Apache Spark Applications on Cloud

ESG research found that 43% of respondents considering cloud as their primary deployment for Apache Spark. And it makes a lot of sense because the cloud provides scalability, reliability, availability, and massive economies of scale. Another strong selling point of cloud deployment is a low barrier of entry in the form of managed services. Each one of the Big 3 cloud providers comes with its own offering to run Apache Spark as a managed service.

Apache Spark Applications and Cloud Costs

Running high scale Apache Spark Application on Public Clouds is expensive. For example, when running on AWS EMR applications are billed for execution time * number of cores * price/core/minute + cost of storage + managed service fee. Data Processing of 1 TB of daily traffic can be as expensive as 1M USD / year. In such a scenario, any problem or bad code change that adds 10% to the execution time, will cost additional 100K USD/year. So, the intensive of keeping costs under control is very high.

Apache Spark Applications and Costs Monitoring

Spark Applications use a dynamic allocation policy, meaning that applications during the runtime may request resources when there is a demand and give resources back to the cluster if they are no longer used. Some applications will do such an allocate/release cycle a couple of times during the runtime. Spark Listeners and Spark UI can help with cost tracking and observability. The following article shows how Spark Listeners and Spark UI can be used for observability and tracking cloud costs.

Monitoring Apache Spark Jobs

Costs Monitoring and Optimization

Performing cost optimization of Apache Spark is a hard task. There are a lot of things that can potentially go wrong: executors can fail, latency to external data sources can increase, the nature of the input data can change, wrong usage of cloud APIs, problems in JVM and many more. One should be a word class expert in all 3 domains: JVM, Apache Spark and Specific Cloud Provider like AWS, Azure, GCP.

Problem Statement

Essentially there are three problems when dealing with Spark Applications Costs:

Reactive way of work. Dev Teams reacts to the past rather than anticipate the future. Cost optimizations process starts only when the cloud cost skyrockets and a lot of money already wasted. In fact, 'small' problems like 10% of additional execution time (can be 100K USD/yearly and more) are never handled at all.
The highest level of expertise is required. One should have a deep and comprehensive knowledge in all three domains: Apache Spark, AWS APIs and JVM in order to optimize the cost of Apache Spark Applications. In addition, when working with PySpark it is necessary to master Python as well. It is rare when one person has all these skills.
Long time to fix. Even when detected, cost optimization issues are not prioritized for handling. Dev teams work on what is urgent, like designing and developing new features, handling product bugs, etc. Cost Optimization tasks fail into the non-functional tasks bucket. Such tasks are hard to justify without having the exact dollar figure for money waste. And even when prioritized, it takes time to find the problem, perform code fix and deploy to the production.

Solution

The whole idea of Visum is to find 'bad patterns' automatically. To do so, Visum intercepts events from Spark scheduler, JVM, AWS, and performs data stream analytics in real time. Visum performs all steps that usually performed by analytics pipelines: ingestion, normalization, enrichment and pattern recognition.

Visum Data Flow

In the final step, Visum generates two reports. The first one is a detailed report of detected issues with estimated wasted cost of each issue, reference to the source code where the issue happened and link to the knowledge base that explains the problem.

Visum Waste Report

The second one is a Benchmark Report. The Benchmark Report contains performance analysis per method.

Visum Benchmark Report

TBD - High Level Design

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.metals		.metals
.vscode		.vscode
charts		charts
docs		docs
images		images
src/visum		src/visum
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visum - A Cloud Cost Optimization Platform

Background

Apache Spark

Apache Spark Applications on Cloud

Apache Spark Applications and Cloud Costs

Apache Spark Applications and Costs Monitoring

Costs Monitoring and Optimization

Problem Statement

Solution

About

Releases

Packages

Languages

License

dimastatz/visum

Folders and files

Latest commit

History

Repository files navigation

Visum - A Cloud Cost Optimization Platform

Background

Apache Spark

Apache Spark Applications on Cloud

Apache Spark Applications and Cloud Costs

Apache Spark Applications and Costs Monitoring

Costs Monitoring and Optimization

Problem Statement

Solution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages