Loading…
MesosCon18 has ended
Welcome to MesosCon 2018 which will be held in The Village (969 Market St, San Francisco) between November 5th-7th, bringing together users and developers to share and learn about the project and its growing ecosystem.

Tickets are now available to purchase below.


View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, November 5
 

8:00am

Registration / Breakfast
Monday November 5, 2018 8:00am - 8:50am
Break Area

9:00am

Keynote: Opening
Monday November 5, 2018 9:00am - 9:10am
A
  • Host Organization Yelp

9:10am

Keynote: Working Group Updates
Monday November 5, 2018 9:10am - 9:40am
A
  • Host Organization Yelp

9:40am

Keynote: Adapting Mesos to Your Needs: A Migration Story
Speakers
avatar for Qui Nguyen

Qui Nguyen

Software Engineer, Yelp
Qui Nguyen is a software engineer at Yelp on Distributed Systems Compute, the team building the cluster infrastructure on top of Apache Mesos that supports Yelp's large service-oriented architecture. She has previously spoken about stream processing for Yelp ads at PyCon 2017 and... Read More →


Monday November 5, 2018 9:40am - 10:00am
A
  • Host Organization Yelp

10:00am

Break
Monday November 5, 2018 10:00am - 10:30am
Break Area

10:30am

Working Group Meeting: Containerization
In-person Mesos Containerization Working Group meeting.

Check out the working group notes.

Monday November 5, 2018 10:30am - 11:15am
A

10:30am

Working Group Meeting: Performance
In-person Mesos Performance Working Group meeting.

Check out the working group notes.

Slides can be found here.

Speakers
avatar for Benjamin Mahler

Benjamin Mahler

Software Engineer, Mesosphere
Benjamin Mahler is a committer and PMC member of Apache Mesos and has been working on Mesos since 2012. Benjamin now works at Mesosphere as a technical lead and has given Mesos related talks at several conferences and companies. His interests include distributed systems, fault tolerance... Read More →


Monday November 5, 2018 10:30am - 11:15am
B

11:15am

Working Group Meeting: API
In-person Mesos API Working Group meeting.

Check out the working group notes.

Monday November 5, 2018 11:15am - 12:00pm
B

11:15am

Working Group Meeting: Community & Operations
In-person Mesos Community & Operation Working Groups meeting.

Check out the working group notes:

Monday November 5, 2018 11:15am - 12:00pm
A

12:00pm

Lunch
Monday November 5, 2018 12:00pm - 1:00pm
Break Area

1:00pm

Scaling Mesos to thousands of frameworks
One of the strengths of Mesos is the ability to simultaneously run diverse schedulers. We recently had a use case that required running thousands of instances of different frameworks (Marathon, Cassandra, Spark, Jenkins) on a single cluster. While it is well understood that Mesos clusters can scale to tens of thousands of agents, running thousands of frameworks on them is still uncharted territory.

We set out to do scale tests to explore this territory. In this talk, we will present the treasures that we found and dragons encountered along the way.

We will describe the tooling we developed to monitor and execute the tests, the challenges we faced on the allocator such as offer starvation and fragmentation, the inevitable performance problems and surprising behaviours.

We will share our learnings, the things things we fixed, and some of the best practices that we recommend.

Speakers
avatar for Gastón Kleiman

Gastón Kleiman

Staff Software Engineer, Mesosphere
Gastón Kleiman, Apache Mesos PMC/Committer, is a Staff Software Engineer at Mesosphere. He fell in love with distributed systems and infrastructure automation while contracting for Google, where he got to use Borg, MapReduce and other cool technology. That led him to work at Amazon... Read More →
avatar for Meng Zhu

Meng Zhu

Software Engineer, Mesosphere Inc.
Meng Zhu is an Apache Mesos committer and PMC member at Mesosphere, primarily works on resource allocation in Mesos. Previously, he received his PhD in Computer Engineering from University of Rochester, where he worked on operating system resource management.



Monday November 5, 2018 1:00pm - 1:40pm
A

1:00pm

Shipping Reliably at Scale
Apache Mesos and Apache Aurora enable engineering teams at Twitter to run production services at scale, without the pain associated with traditional management of bare-metal hardware. Within Ads Serving, we face the interesting challenge of running (deploying, operating, staffing oncall) production systems that contain code from many teams. Others depend on us for shipping their feature, and we carry the pager for their code in addition to ours.

Historically, we have invested in release testing tooling to help provide signal on the viability of a future deployment. While this was helpful, it ultimately had gaps in coverage that resulted in failed canaries and delayed deployments. One example is that our ML based ads serving system uses a feedback control mechanism to trade off ad quality for service availability. This introduces additional challenges to validate the health of a change list or a release candidate.

In this talk, we will discuss the multi-faceted problems we were facing, our design for tackling these challenges, and the results we have observed since going live. We will cover Aurora-based solutions for automated load-testing of code reviews (each diff, before merge), release candidate load testing (multiple diffs in a deployment), canary analysis, and deployment of multiple logical clusters across multiple datacenters … all with two clicks :)

Speakers
avatar for Brian Brophy

Brian Brophy

Staff Site Reliability Engineer, Twitter
Brian Brophy is a Staff Site Reliability Engineer at Twitter with a passion for music, puzzles, security, automation, performance, and scale.
avatar for Jianhang Gao

Jianhang Gao

Software Engineer, Twitter
Jianhang Gao is a Staff Software Engineer who works at Twitter focusing on Adserver architecture and performance. Jianhang holds a Ph.D. in Electrical and Computer Engineering from UC Davis.



Monday November 5, 2018 1:00pm - 1:40pm
B

1:50pm

Moving toward network isolated containers in Mesos
Applications on a shared host are competing on resources access. Network resources are particularly sensitive to such usage contention, and if not managed properly can lead to great performance degradation for applications. It is very easy for one noisy neighbor to impact the observed response time of other applications.

At Criteo, we are running some network intensive business critical applications on our Mesos clusters.

We chose to develop a custom solution to offer our end users both network isolation and network bandwidth as a first class resource handled by Mesos and its schedulers.

This talk is going to present our journey so far, our approach to add a new resources offered to all users of our frameworks, a deep dive into our solution, and the challenges we faced, offering a return on experience on both Mesos ecosystem development and operations.

Speakers
avatar for Frederic Boismenu

Frederic Boismenu

SRE, Criteo
Coming from quantitative finance, with over 10 years experience developing and operating custom software managing market data streaming and storage, Frederic Boismenu is now focused on SRE things such as operating Mesos clusters at Criteo.
avatar for Clément Michaud

Clément Michaud

Site Reliability Engineer, Criteo



Monday November 5, 2018 1:50pm - 2:30pm
B

1:50pm

Peloton - Colocating Mixed Workloads on Mesos with Unified Resource Scheduling
With the increasing scale of Uber’s business, efficient use of cluster resources is important to reduce the cost per trip. As we have learned when operating Mesos clusters in production, it is a challenge to overcommit resources for latency-sensitive services due to their large spread of resource usage patterns. Uber also has significant demand on running large-scale batch jobs for marketplace intelligence, fraud detection, maps, self-driving vehicles etc.

In this talk, we will present Peloton, a Unified Resource Scheduler for collocating heterogeneous workloads in shared Mesos clusters. The goal of Peloton is to manage compute resources more efficiently while providing hierarchical max-min fairness guarantees for different teams. Peloton schedules large-scale batch jobs with millions of tasks and also supports distributed TensorFlow jobs with thousands of GPUs.

Speakers
avatar for Mayank Bansal

Mayank Bansal

Staff Engineer, Uber
Mayank Bansal is currently working as a Sr Engineer at Uber in data infrastructure team. he is Apache Hadoop Committer and Oozie PMC and Committer. Previously he was working at ebay in hadoop platform team leading YARN and MapReduce effort. Prior to that he was working at Yahoo and... Read More →
avatar for Min Cai

Min Cai

Sr. Staff Engineer, Uber
Min Cai is a Staff Engineer in Compute Platform team at Uber working on all-active datacenters, cluster management and micro-service deployment systems. He received his Ph.D. degree in Computer Science from Univ. of Southern California. Before joining Uber, he was a Sr. Staff Engineer... Read More →


Monday November 5, 2018 1:50pm - 2:30pm
A
  • Host Organization Uber

2:40pm

Autoscaling Mesos for Spark Workloads
Speakers
avatar for Stuart Elston

Stuart Elston

Software Engineer, Yelp
Stu Elston is a software engineer at Yelp on the Distributed Systems-Compute team, which maintains infrastructure to provide resources for running arbitrary workloads on Yelp’s Mesos clusters.  He has previously worked in data infrastructure and backend development roles at Yelp... Read More →
avatar for Huadong Liu

Huadong Liu

Software Engineer, Yelp
Huadong Liu is a software engineer on the Distributed Systems Compute team at Yelp. Previously, Huadong was a software engineer at NetApp, where he worked on high performance clustering infrastructure. Huadong received his PhD in computer science from the University of Tennessee... Read More →



Monday November 5, 2018 2:40pm - 3:20pm
B
  • Host Organization Yelp

2:40pm

Designing Network for Multi-Kubernetes on Mesos
Recently, Mesosphere added support for Multi-Kubernetes on Mesos. This enables both Kubernetes and Mesos workloads to run side-by-side on the same cluster. Not only that, different Kubernetes clusters can run on the same Mesos cluster. This talk will focus on the different network design alternatives to achieve this, challenges and pitfalls with each and finally provides details of the approach chosen. The talk takes references from the implementation in DC/OS, however, the approach is generic enough to be applicable on other platforms on top of Mesos.

Speakers
DG

Deepak Goel

Sr. Staff Software Engineer, Mesosphere Inc.


Monday November 5, 2018 2:40pm - 3:20pm
A

3:30pm

Kraken: p2p docker image distribution system
Docker container is a foundational building block of Uber infrastructure, but distributing the docker images to all hosts in a system is a scaling problem we have been facing at Uber for some time.

We started with a single docker registry in 2015, gradually added replicas and cache layers. However as Uber grew larger and more complex, the number of hosts and containers also grew exponentially; Also as the deployment system become more automated, rebalance and rollback happens more frequently. Docker registry couldn’t catch up with the number of container being deployed, which grew 10x every year to millions of docker pull requests per day. Eventually we decided to take a P2P distribution approach to tackle this problem - the project is called Kraken.

Kraken is a P2P docker image distribution system. It’s loosely based on BitTorrent protocol, fully compatible with docker registry API, and supports pluggable storage backends like S3, HDFS, etc. It successfully solved scaling problems we saw under different scenarios, also greatly sped up container deployment.

This talk will cover:
- Docker registry scaling challenges
- Architecture of Kraken
- Design and optimizations decisions we took
- Production performance numbers

Speakers
avatar for Yiran Wang

Yiran Wang

Senior Software Engineer, Uber
Works on building and distribution of container images at Uber.



Monday November 5, 2018 3:30pm - 4:10pm
B
  • Host Organization Uber

3:30pm

Provisioning storage for stateful services with CSI and Mesos
Statically provisioning and re-configuring dedicated storage resources has proven to be a maintenance burden when running co-located tasks of heterogeneous frameworks atop a single Mesos cluster. Users have long desired a combination of flexible, plug-and-play storage options and self-provisioning frameworks that leverage the resource accounting native to Mesos clusters.

Since its v1.5 release, Mesos has supported the Container Storage Interface (CSI) industry standard that aims to simplify storage driver integration for operators of platforms like Mesos and others, as well as for storage vendors. Resource Provider (RP) components form the backbone of the vendor resource supply chain within Mesos. They allow for both the proper accounting, and life cycle management, of resources whose control plane is implemented by external code: the CSI plugin. By leveraging recent additions to the Mesos APIs it is now possible to write frameworks that dynamically self-provision "mount"- and "block"-type disk resources that are fit for the the task at hand.

This talk provides an overview of the new storage components and primitives that have shipped in recent Mesos releases, as well as sketches an outline of a self-provisioning framework that utilizes the newly available APIs.

Speakers
avatar for Benjamin Bannier

Benjamin Bannier

Senior Software Engineer, Mesosphere
Benjamin is an Apache Mesos committer working at Mesosphere where he builds out storage support in Mesos after having spent time on ultrafast distributed databases at ParStream and researching the Quark-Gluon Plasma at RHIC in another life as particle physicist.
avatar for James DeFelice

James DeFelice

Distributed Applications Engineer, Mesosphere
James is a Tech Lead at Mesosphere, Inc., currently focused on framework development and storage. Before joining Mesosphere, he spent time building on-demand VM provisioning platforms and supporting Mesos users in the wild.
avatar for Chun-Hung Hsiao

Chun-Hung Hsiao

Senior Software Engineer, Mesosphere
Chun-Hung is an Apache Mesos committer at Mesosphere, primarily works on storage resource providers in Mesos. He received his PhD in computer science from the University of Michigan, where he worked on improving software reliability for event-driven systems.
avatar for Jan Schlicht

Jan Schlicht

Software Engineer, Mesosphere
Jan Schlicht is a Software Engineer at Mesosphere, working on storage features. His work included adding CSI support for Apache Mesos.


Monday November 5, 2018 3:30pm - 4:10pm
A

4:20pm

Break
Monday November 5, 2018 4:20pm - 4:25pm
Break Area

4:45pm

Townhall: Container Orchestration
Monday November 5, 2018 4:45pm - 5:30pm
B

4:45pm

Townhall: DC/OS
Monday November 5, 2018 4:45pm - 5:30pm
C

4:45pm

Townhall: Mesos
Monday November 5, 2018 4:45pm - 5:30pm
A
 
Tuesday, November 6
 

8:00am

Registration / Breakfast
Tuesday November 6, 2018 8:00am - 8:50am
Break Area

9:00am

Keynote: Opening
Program Committee
avatar for Gastón Kleiman

Gastón Kleiman

Staff Software Engineer, Mesosphere
Gastón Kleiman, Apache Mesos PMC/Committer, is a Staff Software Engineer at Mesosphere. He fell in love with distributed systems and infrastructure automation while contracting for Google, where he got to use Borg, MapReduce and other cool technology. That led him to work at Amazon... Read More →
avatar for Jörg Schad

Jörg Schad

Technical LEad, Mesosphere


Tuesday November 6, 2018 9:00am - 9:10am
A

9:10am

Keynote: Resource Management for ML Frameworks and Applications
Speakers
AT

Alexey Tumanov

PostDoctoral Researcher, University of California Berkeley
Alexey Tumanov is a PostDoctoral Researcher at the University of California Berkeley, working with Ion Stoica on next generation distributed systems for machine learning applications and frameworks in RISELab. He completed his PhD at Carnegie Mellon University, advised by Greg Ganger... Read More →


Tuesday November 6, 2018 9:10am - 9:40am
A

9:40am

Keynote: User Panel
Tuesday November 6, 2018 9:40am - 10:00am
A

10:00am

Break
Tuesday November 6, 2018 10:00am - 10:30am
Break Area

10:30am

Unconference
Tuesday November 6, 2018 10:30am - 12:00pm
A

12:00pm

Lunch
Tuesday November 6, 2018 12:00pm - 1:00pm
Break Area

1:00pm

Load testing Mesos
Yelp has been using Apache Mesos to power its production clusters for four years, and these clusters run at significant scale. We began to wonder how much further we could scale our clusters before Mesos bottlenecked our growth, so we performed a number of load tests to attempt to understand these limits. In this talk, we discuss the results of these load tests; specifically,

* how well the Mesos master APIs perform under heavy load, particularly with a large number of frameworks present
* how well the Mesos containerizer (aka the Universal Container Runtime, or UCR) performs compared to the Docker containerizer
* how the Mesos UCR performs when launching a group of containers, or "pods", using the Container Networking Interface to allocate an IP-per-container

All of our load test results (and the load testing harness) will be made available to the community for further analysis.

Speakers
DR

David R Morrison

Software Engineer, Yelp
David R. Morrison is a tech lead on the Distributed Systems team at Yelp, where he has developed auto-scaling code for Yelp's most expensive clusters. Previously, David was a researcher at Inverse Limit, where he received funding from DARPA and Google's ATAP. David received his PhD... Read More →


Tuesday November 6, 2018 1:00pm - 1:40pm
A
  • Host Organization Yelp

1:00pm

Secrets for services: a story about Mesos and Vault
Details of a production tested secret distribution mechanism for Mesos that is framework agnostic. A run down of the pros and cons of other ways of distributing secrets on Mesos A discussion on reliability and how we can make sure that our tasks keep running if our secret store goes down Discussion about the Mesos modules architecture and ideas for how to make it better

Speakers
MM

Matthew Mead-Briggs

Software Engineer, Yelp
Matthew Mead-Briggs is currently a Software Engineer with Yelp in London. He works on the Distributed Systems team building compute platforms for Engineers at Yelp. Matt is a long suffering infrastructure engineer with a thirst for code. He has worked on many cloud and infrastructure... Read More →


Tuesday November 6, 2018 1:00pm - 1:40pm
B
  • Host Organization Yelp

1:50pm

Automating large scale cluster management
At Uber, we manage a Mesos fleet of tens of thousands of hosts, and as we scale up to span multiple datacenters and cloud platforms, we've developed a system called CLM (Cluster Lifecycle Manager) to automatically manage host maintenance, cluster operations, and infra upgrades without impacting running services' SLAs. CLM serves as a missing layer between service orchestration and host / resource management. By using an extensible system to gather issues on our hosts from multiple sources, such as hardware failures or misconfigurations, we are able to repair or remove those hosts before they impact production. We use goalstate config to specify the expected state of clusters and automatically converge on that, while integrating with the orchestration layer to ensure we operate safely without causing any disruption to service health.

Speakers
avatar for Iain Becker

Iain Becker

Staff Software Engineer, Uber
Iain is a Staff engineer at Uber working on cluster management automation. Before Uber he worked on deployment and test automation for search infra at Facebook, and search infra at Google.
avatar for Yunpeng Liu

Yunpeng Liu

Sr Software Engineer, Uber



Tuesday November 6, 2018 1:50pm - 2:30pm
B
  • Host Organization Uber

1:50pm

Next Journey Mesos Containerization
Mesos has its own container runtime (aka. UCR) and companies have been using it for years in production. Based on the pure Linux Kernel namespaces and cgroups, Mesos supports different container image formats with advantages of extensible container storage, networking, and security. As of today it supports all major container standards, such as CSI, CNI and OCI soon.

In the past one year, there are a lot of accomplishments in Mesos Containerization. For instances, Linux ambient capabilities, File-based secret with the new secret resolver module, Cgroup Blkio Isolation, Automatic Container Image GC, Standalone Container, Windows Support improvement, Persistent Volume Resizing, and many other fantastic new features. In this talk, we will review the achievements in Containerization and discuss the future of Mesos Containerization.

Speakers
avatar for Gilbert Song

Gilbert Song

Staff Software Engineer, Mesosphere
Gilbert Song, Apache Mesos PMC/Committer, is a Tech Lead at Mesosphere. He has been contributing to Mesos for years and mainly focuses on Mesos Containerization. He holds a Master’s degree in Computer Engineering from University of California, Santa Barbara. He is passionate about... Read More →


Tuesday November 6, 2018 1:50pm - 2:30pm
A

2:40pm

Mesos on Windows
This talk covers the last year of work porting Mesos to Windows, particularly with improvements to stout, libprocess, the containerizer, and the inclusion of libarchive (for Linux too!). We'll be talking about challenges we faced and solutions we found, as well as what the Windows port needs next.

Speakers
AG

Akash Gupta

Software Engineer, Microsoft

Program Committee
avatar for Andrew Schwarztmeyer

Andrew Schwarztmeyer

Software Engineer, Microsoft
Andrew Schwartzmeyer is a cross-platform software engineer and open-source evangelist at Microsoft, where he leads the effort to bring full Windows support to Mesos. Andrew has been an open-source contributor since his college days, and previously brought PowerShell to Linux. In his... Read More →



Tuesday November 6, 2018 2:40pm - 3:20pm
A

2:40pm

Monitoring Apache Mesos with Telegraf and Prometheus
Monitoring the workload on a Mesos cluster is a challenge for which no industry standard exists. There are several distinct problems to solve, specific to Mesos.

Firstly, the sheer scale of possible metrics. Mesos enables and encourages microservice architectures. More services means more metrics - and a scale issue.

Secondly, metric identity resolution. This means understanding relationship between metrics from nodes, frameworks, executors, tasks and containers.

Lastly, cardinality. We must tag metrics with metadata for context, but each combination of label and tags increases the cardinality of the dataset.

We will outline each problem, round up the state of metrics with regard to orchestrated environments, and present a reference solution using Telegraf and Prometheus. All tools (and our own libraries) are free, open source and will be available immediately for use with Mesos.

Speakers
avatar for Philip Norman

Philip Norman

Senior Software Engineer, Mesosphere
Originally from London, Philip worked in Germany before moving to San Francisco. He leads the observability team at Mesosphere and is passionate about metrics, monitoring, and open source software. He likes to ride bikes.



Tuesday November 6, 2018 2:40pm - 3:20pm
B

3:30pm

Multi-tenant Spark workflows in Auto Scalable Mesos clusters
Recommendation algorithms have been the core of the Netflix product from very early on. Because of their importance, we continually seek to run our machine learning workflows in a reliable, scalable and robust manner.
We will present our design choices on building a Mesos-centric multi-tenant architecture for running Spark-based machine learning workflows that power the algorithms behind Netflix recommendations. Also we will share our experience using the auto-scaling capabilities of Amazon Web Services to dynamically change the size of our clusters to support the allocation of thousands of spark jobs running daily. We will discuss how we are leveraging Apache Spark to deploy batch jobs as well as the interactive use of Zeppelin Notebooks efficiently in this shared environment.
We will cover a few aspects of this multi-tenant platform, such as the Spark scheduler for Mesos, dynamic resource allocation, metrics and dashboards, and Spark history logs.

Speakers
avatar for Pablo Delgado

Pablo Delgado

Machine Learning Engineer, Netflix
Pablo Delgado is a Senior Software Engineer, he works on building infrastructure for machine learning for Personalized Recommendation Algorithms at Netflix. Previously he was working on the recommendation systems stack for personal restaurant recommendations at Opentable. Pablo obtained... Read More →



Tuesday November 6, 2018 3:30pm - 4:10pm
A

3:30pm

Using and extending the new Mesos CLI
The Command Line Interface (CLI) of Apache Mesos has recently been rewritten in Python 3. Aside from the refactoring, the CLI offers new features for developers and operators. In this talk, we will see:
- How to configure the CLI to work with your cluster.
- How to use the main CLI commands, `mesos task attach` and `mesos task exec`.
- How to extend the CLI by developing your own plugins.
After this talk, you will be capable of monitoring your clusters using this new interface and extending this component to make it work with your workflow.

Speakers
avatar for Armand Grillet

Armand Grillet

Software Engineer, Mesosphere
Armand Grillet is a software engineer at Mesosphere currently working on the DC/OS CLI after having operated for a year clusters deployed to test the scalability of Apache Mesos and DC/OS. He contributes to the Apache Mesos project since 2016 by working on components used by developers... Read More →


Tuesday November 6, 2018 3:30pm - 4:10pm
B

4:20pm

Break
Tuesday November 6, 2018 4:20pm - 4:25pm
Break Area

4:45pm

AMA Panel/Wrap up
Tuesday November 6, 2018 4:45pm - 5:30pm
A
 
Wednesday, November 7
 

9:00am

Hackathon
Wednesday November 7, 2018 9:00am - 3:00pm
Mesosphere HQ 225 Bush St, 7th floor, San Francisco, CA 94104