Skip to content

Latest commit

 

History

History
33 lines (22 loc) · 2.42 KB

making-data-science-fast-and-scalable-with-apache-spark_nuno-diegues.md

File metadata and controls

33 lines (22 loc) · 2.42 KB

Making Data Science Fast and Scalable with Apache Spark

  • Speaker : Nuno Diegues
  • Length : 30 mins
  • Language : English

Description

Most businesses are now dealing with a large stream of incoming data, leading them to the challenge of coping with an ever growing scale, while at the same time seeking to obtain insights about it as fast as possible. Apache Spark is turning out to be a de facto solution to tackle that challenge: at its core lies a distributed, fault-tolerant computing system, that allows to express many types of computations (map-reduce, data mining, machine learning, streaming, ...) via its various APIs. At Feedzai we face exactly that challenge, as we process large amounts of data to enable our Machine Learning systems, and automatically prevent fraud across transactions amounting to billions of euros. In this talk we will tell you all about our Spark deployment and how we use it to empower our Data Science pipelines to extract meaning from the large amounts of streaming data arriving to our systems. We will talk about our use of Spark low level APIs, which we use to dynamically generate Spark programs on-the-fly that scale to process large amounts of data in parallel. While doing so, we will delve into some challenges of doing so, such as: correctly partitioning the work, optimizing Spark shuffles, reducing the serialization costs, deploying it to large clusters, and all other difficulties you would face when putting Spark to use with high performance requirements.

Speaker Bio

Speaker Image

I am a Software Engineer at Feedzai, working with both real-time systems to process data really fast and help our clients detect patterns in it (say, fraud detection), as well as processing large amounts of data in off-line jobs for Data Science pipelines. Of course this means working with lots of interesting technology and getting various types of distributed systems work in practice. Before this, I obtained a Ph.D. degree from Instituto Superior Tecnico (hopefully this will be official by October!), working also with distributed systems and making them faster, yet simpler to use.

Links

Click here to see the full calendar and pick your favorite talks