Unveiling the Data Magic
Roles in Data Science Teams
Estimated read time: 1:20
Summary
In a captivating journey through Netflix's data-driven evolution, this video takes viewers into the heart of how data science reshapes modern businesses. Spearheaded by Gibson Biddle, it narrates the shift from star ratings to a more nuanced percentage match system at Netflix, influenced by consumer behaviors. The emphasis is on the quintessential role of data in decision-making and the integral parts that data engineers, scientists, and machine learning models play in comprehending and predicting complex consumer patterns. It highlights the elaborate yet vital infrastructure required to manage and analyze colossal data volumes, touching upon how machine learning is embedded into regular operations to yield meaningful forecasts and insights, a trend not exclusive to Netflix but prevailing across major organizations.
Highlights
- Netflix's shift to thumbs-up, thumbs-down ratings based on consumer data was revolutionary. 👍
- Gibson Biddle shares insights on how this change improved user satisfaction. 👥
- Data analysis isn't just for show; it's a critical component of Netflix's strategy. 🚀
- Netflix processes 700 billion events daily, showcasing the power of data collection. 🔢
- ETL - Extract, Transform, Load - is at the heart of making data usable. 💾
- Machine learning models predict everything from Uber arrivals to movie recommendations. 📈
- The roles within data teams are becoming more specialized and sophisticated. 🛠️
- The future might see the merging of data science into regular software development roles. 🏗️
Key Takeaways
- Netflix's rating evolution shows the power of simplifying data presentation. 🌟
- Data-driven cultures aren't just buzzwords; they are the backbone of companies like Netflix. 📊
- From 700 billion events to business insights, data pipelines are crucial. 🔄
- ETL processes transform raw data, making it meaningful and actionable. 🔄
- Data roles are ever-evolving, from engineers to data scientists, each playing a vital part. 🧑🔬
- Machine learning models keep Netflix and others ahead in customer satisfaction. 🤖
- The future hints at integrating data science as a core aspect of software development. 🔮
Overview
The journey explores how Netflix revolutionized movie rating systems, moving from the traditional star format to a more consumer-friendly percentage match system. This change came about after deep-diving into customer behaviors, showing us how data can redefine experiences. Gibson Biddle helps unravel these insights, illustrating the tangible impact of data decisions on enhancing viewer satisfaction.
Through the lens of data infrastructure, the video elucidates how Netflix and other enterprises harness data pipelines and ETL processes to distill over 700 billion daily events into actionable insights. The meticulous architecture behind these operations ensures that businesses can leverage data efficiently to inform everything from content recommendations to strategic decisions.
Looking ahead, the landscape of data roles is shifting. As automation increases, particularly in machine learning model deployment, the amalgamation of data expertise into broader software development practices seems imminent. This evolution is not merely a trend but a testament to data's growing omnipresence in shaping future business strategies and technological advancements.
Chapters
- 00:00 - 01:30: Introduction and Netflix's Rating System In 2017, Netflix transitioned from a five-star rating system to a thumbs-up, thumbs-down system, which led to dissatisfaction among users who found this new binary method reductive. The new system was supposed to base recommendations on a match percentage. However, Netflix discovered that users often rated movies highly for their perceived quality rather than personal enjoyment.
- 01:30 - 03:00: Data-Driven Organizations The chapter delves into the way data analysis functions within organizations, using Netflix as a prime example.
- 03:00 - 05:00: Data Infrastructure and Pipelines The chapter discusses the importance of data infrastructure and pipelines in an organization like Netflix. It references an anecdote about Adam Sandler's movies and 'Schindler's List' to illustrate how Netflix gauges viewer enjoyment and satisfaction through a streamlined feedback system. This system is designed to enhance user experience by avoiding bias in ratings. The chapter emphasizes that these insights into customer preferences are made possible by both a strong data culture and robust data infrastructure.
- 05:00 - 07:00: Data Warehouses and Business Intelligence The chapter discusses the concept of a data-driven organization, using Netflix as a prime example. It explains how Netflix records over 700 billion events daily, capturing user interactions such as logins, clicks, pauses, and subtitle activations. The data collected is made accessible to thousands within the organization via visualization tools like Tableau and Jupiter, illustrating how data warehouses facilitate business intelligence.
- 07:00 - 10:00: Roles in Data Engineering and Data Science The chapter discusses how businesses, including major ones like Netflix, are leveraging big data and artificial intelligence to make informed decisions. It highlights the importance of data engineering and data science in creating environments where users can generate reports, query information, and subsequently use this data for business strategies, from simple decisions like choosing thumbnails to significant investments like selecting new shows for Netflix. It mentions that 97% of Fortune 1,000 companies are investing in data initiatives.
- 10:00 - 13:00: Machine Learning and Production Models The chapter delves into the role of data infrastructure technology and the engineers behind it. It draws an analogy between data infrastructure and physical pipelines, where data pipelines have origins (like user interactions) and destinations, mimicking the flow in gas or liquid pipelines.
- 13:00 - 15:00: The Future of Data Science The chapter discusses the omnipresence of data generation in today's world, noting that it's more challenging to identify what doesn't produce data compared to what does. It mentions sources of data like vehicle tracking devices and turbine vibration sensors in power plants. It highlights that even the absence of data can be informative. Furthermore, it explains the initial process of data handling where generated data travels to a staging area, serving as a repository for raw data. Before usage, this raw data requires processing to remove errors, fill in gaps, and make necessary alterations.
Roles in Data Science Teams Transcription
- 00:00 - 00:30 [Music] in 2017 Netflix changed its five-star rating system to a simple thumbs-up thumbs-down now the service was recommending movies based on the match percentage and people hated it how can we reduce all the nuance that lives in cinematic art to a primitive binary reaction in reality what Netflix found was that people were giving high rates to those movies that they believed were good not necessarily those they've really enjoyed watching at least that's
- 00:30 - 01:00 what the data said so how does data analysis work in organizations like Netflix and what are the roles in data science teams this is Gibson Biddle a former VP and chief product officer at Netflix when talking about consumer insights he explained an unexpected customer behavior that led to changing the whole rating system in shifting to percentage match Netflix acknowledged that while you may ready to leave your
- 01:00 - 01:30 brains at the door Adam Sandler comedy only three stars you enjoy watching it and as much as you feel good about watching a Schindler's List and give it five stars it doesn't increase your overall enjoyment and keeping subscribers entertained is kind of critical for Netflix so they simplified the feedback system to avoid bias but these insights into customers are impressive by themselves and they wouldn't be possible without two things the culture that fosters the use of data and a powerful data infrastructure in
- 01:30 - 02:00 tech jargon it's called a data-driven organization you've likely heard this buzz phrase hundreds of times but what does it really mean Netflix alone records more than 700 billion events every day from logins and clicks on movie thumbnails to pausing the video and turning on subtitles all this data is available to thousands of users inside the organization anyone can access it using visualization tools like tableau or Jupiter or they can get to it
- 02:00 - 02:30 via a big data portal an environment that lets users check reports generate them or query any information they need then this data is used to make business Asians from smaller like which thumbnails to show you two really serious ones like in which shows to Netflix invest next but Netflix isn't alone according to some estimates about 97 percent of Fortune 1,000 businesses invest in Data initiatives including artificial intelligence and big data buzzwords again but let's have a look at
- 02:30 - 03:00 the real data infrastructure technology and data engineers that make it work to describe how data infrastructure works technicians borrowed the term from liquid and gas transportation similar to physical pipelines data pipelines have their own origins destinations and intermediate stations so it's a pretty apt metaphor the origin of data may be anything from clicks on a reserve button and pull-to-refresh to conversation records with customer
- 03:00 - 03:30 support from vehicle tracking devices to turbine vibration sensors on power plants in today's world it's actually harder to say what cannot generate data rather than what can even no data can tell us something once the data item is generated it travels down its pipe to a staging area right here this is the place where all raw data is kept raw data isn't yet ready to be used it must be prepared you have to remove the airs from it fill in the gaps change
- 03:30 - 04:00 its format or merge data from different sources to get a more nuanced view as soon as these operations are done the data now structured and clean can't continue on its journey all these operations happen automatically they are described in three words extract extracting data from its origin and getting it to a staging area transform preparing data for use and load push prepared data further ETL for short all prepared data falls into another storage a data warehouse
- 04:00 - 04:30 unlike the staging area a warehouse is a place where all stored records are structured and prepared for use just like in the library with its classification system finally you can query visualize and download information for a warehouse to do that you must have business intelligence or bi software it presents data to final users data lists and business analysts who carry out essential tasks they access data explore it visualize it and try to make
- 04:30 - 05:00 business sense of it did our marketing campaign work out well what's our worst performing channel they act like a sensory system supporting an organization with historical data and getting insights to management and ultimately anyone who makes decisions okay who's in charge of building this whole pipeline traditionally these specialists are called data engineers mostly tech people adept at what's known as plumbing moving data from its origins to destinations across the pipeline and
- 05:00 - 05:30 transforming it on the way they design pipeline architecture set up ETL processes configure the warehouse and connect it with reporting tools Airbnb for instance has about 50 data engineers sometimes you might encounter a more granular approach with several extra rules involved data quality engineers for instance make sure that data is captured and transformed correctly having biased or incorrect data is too expensive when trying to derive decisions from it there may be a
- 05:30 - 06:00 separate engineer responsible for ETL only and also a business intelligence developer focusing solely on integrating reporting and visualization tools however reporting tools don't make headlines and a data engineer wasn't called the sexiest job of the 21st century but machine learning does and a data scientist was what everybody knows is that data science is particularly good at taking data and answering complex questions about it how much will
- 06:00 - 06:30 the company earn in the next quarter how soon will your uber driver arrive how likely is it that you'll enjoy Schindler's List the same as uncut gems there are actually two ways of answering such questions data scientists make use of BI tools and warehouse data as business analysts and data analysts do so they would sit here and get the data from the warehouse sometimes data scientists would use a data Lake another type of storage that keeps unstructured fraud data they'll create a predictive
- 06:30 - 07:00 model and suggest a forecast that will be used by management one time reporting and it works for revenue estimates but it doesn't help with predicting the uber arrival time the real value of machine learning is production models those that work automatically and generate answers to complex questions regularly sometimes thousands of times per second and things are much more complicated with them to make the model work you also need an infrastructure sometimes a big one have
- 07:00 - 07:30 a look at this dramatic image not in the way most people consider the meaning of this word obviously but for data scientists it really is dramatic notice this tiny box in the middle which says yeah let's zoom it in please it says ml code the paper is called hidden technical debt in machine learning systems by Google engineers and the image compares the amount of machine learning code to the rest of the systems that make machine learning code useful without them this tiny box however
- 07:30 - 08:00 brilliant it may be is a relatively small piece of code in Python or in Java but it's actually pretty hard to arrive at this model data scientists explore data from warehouses and lakes experiment with it choose algorithms and train models to come up with the final m/l code it takes a deep understanding of Statistics databases machine learning algorithms and a subject field in his famous tweet Josh wills former head of data engineering at SLAC said that a data scientist is the person who is
- 08:00 - 08:30 better at statistics than any software engineer and better at software engineering than any statistician what about the rest of those boxes okay imagine yourself isolating and ordering food at uber eats once you confirm your order the app must estimate the time of delivery your phone center location restaurant and order data to a server with a delivery prediction ml model deployed but this data isn't enough the model also gets additional data from a separate database that contains say an
- 08:30 - 09:00 average time for your restaurant to prepare a meal and a wealth of other details once all the data is here the model returns a prediction to you but the process doesn't stop there the prediction itself gets saved in a separate database your delivery person shows up in the real time of arrival will also be captured to record the ground truth the model performance against it and explore the model via analysis tools to update it later and all this data will eventually appear in a data Lake and a warehouse in reality
- 09:00 - 09:30 uber eat service alone uses hundreds of different models working simultaneously to score recommendations search rankings of restaurants and estimate delivery time if you have that level of complexity you also need a clever system to update more Tyre models as well as prioritize some models over others to manage computing resources that's a lot to process usually this job falls on the shoulders of data engineers or machine learning engineers ml engineers take charge of the production side of things they aren't as much into statistics and
- 09:30 - 10:00 subject matter as data scientists but they know how to configure production models automate extraction of specific data from multiple sources and verify data quality before use finally if you run machine learning with hundreds of models deployed you need a data architect role to make the work of the whole data platform consistent this person would be responsible for the platform itself and its capabilities rather than how specific models solve real-life problems these six roles are those you frequently meet today but
- 10:00 - 10:30 things will be changing in the future look at how people imagined our time in 1982 if you ever glance out of a window in 2019 when Blade Runner takes place you didn't see the dystopian architecture flying cars or multi-store commercial Holograms in fact the real future looks like this or this or even like this you can't touch data you'll have a hard time explaining what data means but that's what defines the real future we're living in today and data
- 10:30 - 11:00 science and business intelligence will soon be taken for granted Adam Waxman head of core technology at Foursquare believes there won't be data scientists or m/l engineers anymore since will keep automating model training and building production environments much of the data science work will become a common function inside software development thank you for watching if data is what you deal with every day tell us more about your work in the comment section below you may also send meaningful signals to
- 11:00 - 11:30 YouTube's machine learning rhythms if you liked the video and want to see more