PySpark Interview Preparation for 2025

PySpark Interview Questions (2025) | PySpark Real Time Scenarios

Estimated read time: 1:20

    Summary

    Prepare for your PySpark interviews in 2025 with the latest and most comprehensive guide, crafted by Ansh Lamba. This resource covers critical PySpark interview questions with real-time scenarios that align with today's technological advancements. It offers both conceptual and coding challenges, including solutions to handle the latest PySpark features such as Delta Lake time travel, adaptive query execution (AQE), and schema enforcement. Structured interviews will test your practical abilities, ensuring readiness for upcoming tech hiring rounds.

      Highlights

      • Gain a thorough understanding of PySpark architecture, including Spark context and session.
      • Learn how to optimize PySpark performance with advanced techniques like Z-ordering and Delta Lake.
      • Get insights into handling large data sets efficiently using partitioning and broadcast variables.
      • Tackle real-world coding challenges with innovative solutions provided in the video.
      • Join community forums for continuous learning and problem-solving.

      Key Takeaways

      • Unlock the secrets of PySpark with Ansh Lamba's expert guidance πŸŽ“!
      • Navigate through complex PySpark scenarios effortlessly πŸš€!
      • Learn to optimize data handling using Delta Lake and AQE in PySpark πŸ’Ύ!
      • Master PySpark interviews by understanding real-world scenarios πŸ“š!
      • Engage with community resources and boost your PySpark skills πŸ”—!

      Overview

      Ansh Lamba presents an exclusive video dedicated to prepping candidates for PySpark technical interviews slated for 2025. His comprehensive guide not only tackles the frequent conceptual questions but also dives deep into complex coding scenarios, equipping you with a tactical edge over other applicants.

        The video breaks down significant concepts like Delta Lake, which enhance data management and query optimization. Learn to leverage PySpark’s capabilities through practical examples, ensuring you adapt to dynamic data environments with ease.

          Emphasizing community learning, Ansh encourages engagement through forums and collaboration. This PySpark preparation journey is about mastering individual skills while contributing to and benefiting from collective knowledge resources.

            Chapters

            • 00:00 - 01:30: Introduction and Explanation The chapter opens with an emphasis on preparing for P Spark interview questions for the year 2025. It highlights the importance of staying updated with the latest conceptual and coding questions related to PPAR (Presumably Parallel Programming/Processing and Analytics-related, given the context). Despite the lack of specific questions on popular coding platforms, the chapter assures readers that a great solution is available to bridge this gap, hinting at upcoming resources or strategies to overcome the challenges in technical interview rounds for 2025.
            • 01:30 - 03:00: Purpose and Need for Special Practicing The chapter discusses the development of a unique tool designed to simulate real-time technical interview practice. It allows users to engage with scenario-based coding questions, providing practical experience with an SLE (Software, Literature, Environment) data set for comprehensive feedback. Additionally, the tool includes the latest conceptual coding questions anticipated for 2025, aiming to prepare users effectively for upcoming challenges.
            • 03:00 - 04:30: Challenge with Platform and Solution by Channel The chapter titled 'Challenge with Platform and Solution by Channel' provides the latest updates from the world of Icebar. The speaker describes solutions to potential challenges, aiming to equip listeners with the knowledge needed to answer interview questions confidently. The speaker mentions having spent significant time developing these solutions to provide them for free, encouraging viewers to subscribe and share the video widely.
            • 04:30 - 06:00: Interactive Q&A: Removing Duplicates In this chapter titled 'Interactive Q&A: Removing Duplicates,' the speaker enthusiastically introduces the content, suggesting it will address real questions posed in data-related interviews. The speaker mentions a significant amount of time was spent preparing this video, which they describe as a comprehensive solution to interview preparation. The chapter starts by setting up the context for discussing the architectural aspects relevant to these interview questions, but does not delve into specific details within the given transcript.
            • 06:00 - 07:30: Handling Inconsistent Schemas The chapter focuses on handling inconsistent schemas in the context of a 'P Spark interview questions' video. It seems to elaborate on a solution designed to simulate the experience of answering conceptual questions as if it were on a major, well-known platform, though the name is not specified. The tool or method discussed aims to provide a comprehensive practice environment for users to enhance their skills in answering P Spark-related conceptual interview questions.
            • 07:30 - 09:00: Comparison of Apache Spark and Hadoop MapReduce The chapter discusses the comparison between Apache Spark and Hadoop MapReduce, emphasizing how challenges in practicing coding questions and handling data sets are addressed by platforms like lamba, offering solutions and scenarios for better preparation. The speaker feels that a major gap in learning has been filled with these resources.
            • 09:00 - 10:30: Handling Missing Values The chapter titled "Handling Missing Values" focuses on solutions for dealing with missing data in datasets. The presenter describes their approach without unnecessary delays, showing the creation of a specialized notebook to address the problem. The discussion mentions using markers, suggesting some interactive or visual elements in the presentation, although the specifics of the solution are not detailed in the transcript.
            • 10:30 - 12:00: Calculating Top Users by Actions The chapter titled 'Calculating Top Users by Actions' seems to focus on practical, real-time scenarios that are similar to questions encountered on coding platforms, particularly those relevant for interview preparation. The speaker mentions having prepared a data frame for the reader to work with, enabling them to practice solving problems using this data frame without difficulty. The aim is to engage with real-time problem-solving, specifically in calculating the top users by their actions.
            • 12:00 - 13:30: Finding Recent Transactions with Window Functions This chapter delves into using window functions to find recent transactions within a data frame. It challenges the reader to apply their knowledge and skills in PySpark to solve problems using the provided data frame. The focus is on applying hands-on coding techniques to manipulate and analyze data, leveraging new technologies, including Delta L streaming. The chapter encourages practical learning by solving up-to-date questions that reflect the latest industry practices.
            • 13:30 - 15:00: Date-based Filtering and Calculation This chapter covers the topic of date-based filtering and calculating using various aggregation functions. Instead of directly showing questions and code, the focus is on teaching the approach needed to tackle such questions, particularly in the context of job interviews. The chapter emphasizes understanding the methodology rather than just memorization, aiming to equip readers with the skills necessary to confidently handle interview challenges.
            • 15:00 - 16:30: Text Processing: Most Frequent Words The chapter titled 'Text Processing: Most Frequent Words' discusses the human tendency to make mistakes due to anxiety, especially in situations like interviews. It emphasizes the importance of having a calm approach and understanding the logic behind questions. The speaker offers reassurance by suggesting a collaborative learning process, where guidance will be provided on approaching questions and writing code effectively in order to grasp concepts easily.
            • 16:30 - 18:00: Cumulative Sum Calculation This chapter emphasizes the importance of understanding conceptual questions, particularly in the context of interviews. The speaker highlights the need to grasp the concepts thoroughly rather than just memorizing them. While acknowledging the necessity of learning certain terminologies or technical names, the speaker's primary advice is to focus on comprehension for confident answering.
            • 18:00 - 19:30: Unique Use of Row Number for Order Preservation The chapter discusses the challenges of recalling definitions during interviews when not fully conscious and the importance of being well-prepared. It mentions the availability of platforms that do not include PPAR as an option, suggesting the uniqueness of the content, and criticizes some platforms, expressing their overly commercial nature.
            • 19:30 - 21:00: Basic Aggregation Question: Average Duration The chapter discusses the use of a hypothetical tool called 'lamba' for preparing and succeeding in interviews scheduled for the year 2025. It emphasizes that while investing in platforms could be expensive, lamba provides an affordable alternative to assist particularly those who are proficient and believe they are capable of solving any PPA (presumably Performance, Planning or Analysis) questions. The chapter encourages leveraging this tool for effective interview preparation.
            • 21:00 - 22:30: Advanced Aggregation: Sales per Product per Month The chapter discusses advanced aggregation techniques for calculating sales per product per month. The content primarily targets individuals who are not experts but are seeking to improve their skills through practice. The speaker encourages viewers to attempt problem-solving independently but offers solutions for reference if needed. The chapter acknowledges both beginners and those who may already feel confident in their skills, emphasizing the importance of preparation and practice, especially for interviews.
            • 22:30 - 24:00: Spark Context and Spark Session Explanation The chapter titled 'Spark Context and Spark Session Explanation' starts with a discussion about the effort and time it took to understand and prepare the content. The speaker mentions being surprised by the amount of work involved in formulating the best questions and solutions, as well as preparing a comprehensive plan and platform. This introduction suggests that the chapter will delve into the complexities and nuances of Spark Context and Spark Session, likely offering detailed insights and explanations based on thorough preparation and analysis.
            • 24:00 - 25:30: Data Pipeline Optimization The chapter, titled 'Data Pipeline Optimization,' focuses on preparing data frames as part of the data pipeline process. The speaker emphasizes the importance of feeling confident before interviews and expresses a desire to help others by providing valuable resources. Despite a light-hearted moment of humor, the underlying message is serious and aimed at offering support to those seeking quality Spark resources.
            • 25:30 - 27:00: Handling Schema and Corrupt Records This chapter focuses on handling schema and corrupt records, emphasizing the importance of resources and community responsibility in data management. The speaker encourages the audience to share resources, engage with the community by subscribing and commenting, and stresses the happiness derived from community interactions.
            • 27:00 - 28:30: RDD, Data Frame, and Data Set Differences The chapter titled 'RDD, Data Frame, and Data Set Differences' appears to begin with a personal anecdote or casual conversation as indicated by the transcript provided. The actual content related to the chapter's title is not included in the transcript excerpt shared, leaving the explanation or distinctions between RDDs (Resilient Distributed Datasets), DataFrames, and Datasets absent. To generate an accurate summary reflecting the chapter title, further content from the chapter would be needed.
            • 28:30 - 30:00: Understanding Query Optimization **Chapter Title: Understanding Query Optimization** The chapter begins with instructions on how to download the provided notebook to follow along with the video. The speaker encourages viewers to run the notebook and fill in the solution boxes as they go along. Additionally, the speaker mentions the creation of a Telegram channel where all video resources will be uploaded for easier access. There's also mention of a secondary Telegram channel designed to offer help to those who might face difficulties. This segment is primarily an introduction to the practical resources available for understanding query optimization within the context of the video.
            • 30:00 - 31:30: The Role of Partitions and Data Handling in Spark The chapter discusses the importance of community support in the context of handling errors and challenges while working on Spark projects. It encourages individuals to participate in community forums to seek help and also contribute by sharing their knowledge to assist others. By doing so, it helps the community and the data domain grow collectively, promoting the sharing of information and cooperative problem-solving in the realm of data processing and Spark.
            • 31:30 - 33:00: Working with Nested and Compressed Data The chapter focuses on the importance of collaboration and community in the IT domain, specifically among developers. It emphasizes the benefits of solving errors not just for oneself but for others as well, which enhances one's debugging skills. The chapter illustrates how participating in community platforms like Telegram channels can be mutually beneficial; by helping others, one improves their own skills, and in times of need, can find support for their own issues.
            • 33:00 - 34:30: Optimized Data Solutions in PySpark The chapter emphasizes the importance of community and collaboration among data professionals. It encourages members, ranging from senior to junior data engineers, to engage in conversations and assist one another in their professional growth. The author highlights the role of sharing resources, such as a PySpark notebook, within the community to facilitate learning and development. Overall, the message is a call to action for community members to communicate and support each other as they advance in their data careers.
            • 34:30 - 36:00: Exploring Specific Functions and their Uses The chapter discusses the importance of community interaction and communication. The speaker emphasizes creating a group specifically for enjoyment and contribution to conversations. They also announce plans to engage with the community by coming live at least once a month for casual chitchats and discussions. The focus is on maintaining a connected and interactive community environment.
            • 36:00 - 37:30: Final Thoughts and Encouragement In the final chapter titled 'Final Thoughts and Encouragement,' the speaker wraps up discussions related to the channel's purpose and interactive offerings. They emphasize the intention to create special mini-projects that can be built within one to two hours during live sessions or cover hot topics worth discussing. These initiatives aim to enhance engagement and learning experiences for the audience. Furthermore, they encourage active participation and contributions from the community members, highlighting that there will be specific links shared in the description for resources and involvement.

            PySpark Interview Questions (2025) | PySpark Real Time Scenarios Transcription

            • 00:00 - 00:30 are you preparing for p spark interview questions in 2025 and you need latest interview questions for 2025 that consist your all the conceptual ppar questions plus ppar coding questions as well to is the technical interview rounds in 2025 well I have a great solution for you I know that even on the popular coding platforms we do not have questions for ppar do not Fe sad because
            • 00:30 - 01:00 I have built something special for you which will make you feel that you are practicing interview question actually on those platforms where you will have realtime scenario based ppar coding questions and a SLE data set for all those questions so you can actually code and see the output really yes not only this I will also provide all the latest ppar conceptual questions for 2025
            • 01:00 - 01:30 including all the latest updates in the world of Icebar and I will explain all all these Solutions in detail so that you can confidently answer those questions in the interview and just get placed I have spent so much of time to actually create this solution so that you do not need to pay to all those platforms so just hit that subscribe button and just make this video viral
            • 01:30 - 02:00 so buckle up because it's time to actually solve the questions the real questions let's go welcome welcome my data fam to this amazing video so let me tell you let me tell you first I spent so much of time to actually prepare this video it's not just a video it's like your solution to a your interview so without wasting time I just want to discuss what is the architecture of
            • 02:00 - 02:30 video and what exactly we need to do because lamba you just mentioned that you have created a kind of solution which will make us feel that we are actually practicing the questions in the platform such as like famous famous platform I won't take name but yeah you know like what I'm talking about so the thing is this P spark interview questions video This Is Your onetop solution to your obviously ppar conceptual questions obviously
            • 02:30 - 03:00 okay and the main area the major the major the major area that I feel is was missing right now because now it is filled with this an lamba so whenever I want to practice any ppar question the coding question the coding round where is the platform where's the question where's the data set where's the scenario where's that it's here it's here so what I did what I did I just tried to create a solution we know an
            • 03:00 - 03:30 just tell us what's that solution okay without wasting time let me just show you what I have done so basically I know this is the kind of notebook that ah wait my escape button just ignore so basically hey wait let me just use marker as well okay now let's feel it okay so first of all what we will do basically I have created a notebook a specialized
            • 03:30 - 04:00 notebook for you which will which will provide you the realtime scenarios like this the same type of scenario that you see on the coding platforms to prepare for the interview rounds okay then then okay we can just find this question but I have prepared the data frame as well so that you can actually use this data frame without any hustle you just need to use your data frame that's set and you need to solve this problem you need to just deal with this real time
            • 04:00 - 04:30 scenario and that's all so how this will work this is your question this is your data frame so there will be a box here the empty box where you need to use your brain use your knowledge use your P spark skills and you need to actually code it run it and boom you have your platform ready and trust me I have just put all the latest questions all the questions which involve latest technology such as say Delta L streaming
            • 04:30 - 05:00 everything everything and some amazing aggregation functions like I can't even like discuss what else I didn't put because everything is available in this video plus I will not be showing you the question and the code directly no you know me so I will be just showing you how you need to approach the question because whenever you sit in the interviews obviously there's a fair of interview okay like you are just giving
            • 05:00 - 05:30 interviews I don't know like it's human nature right because you will be worried about the result so you can make mistakes and at that time you cannot even think of the logic like how you need to actually approach any question so I will sit with you your boy will sit with you and will just tell you how you need to approach the question how you need to actually write the code so that you can easily grab the concept plus let's talk about the conceptual questions so I will just try my best and
            • 05:30 - 06:00 I think I have tried my best to actually explain the conceptual questions which which are like bread and butter if you're just giving the interviews and I have tried to explain you everything in such a way that you can understand it absorb it so that you can actually answer it confidently you do not need to learn anything you need to understand it I know there are some terminologies that you need to learn like some technical names but after that you just need to understand it and if you are
            • 06:00 - 06:30 learning then obviously when you are afraid of like interviews when you are not in your 100% Consciousness you cannot remember those definitions so I decided to explain you the stuff and just make you prepared for the interviews all all all set now you do not need to pay for those platform by the way we do not even have ppar available in those platforms but there are some platforms and I would say they are so so so
            • 06:30 - 07:00 expensive I want say do not waste your money but still if you do not want to spend some amount on those platforms an lamba an lamba an lamba is here and is all set to help you to just ASO interview in 2025 and let me just tell you this will definitely help you a lot if you are a pro if you think that you can just solve any PPA questions bro then definitely you you can you can
            • 07:00 - 07:30 directly solve all the questions without my help but if you feel stuck if you feel that okay what can be the possible solution you can definitely refer um my solution like how I am approaching but I am making this video for most of the people who are not like Pro like you we are like normal human beings who need to just practice who need to feel sure before the interviews we are normal human being I know there are some people who are like bro like we know everything we know everything bro we do not need you okay okay okay okay no worries no
            • 07:30 - 08:00 worries no worries all sorted all sorted all sorted and let me tell you bro I spend so much of time I'm not just flexing this thing okay okay I just spend so much of time no bro literally I was thinking that it won't take much efforts but when I actually started working on this bro my mind was what because I need to figure out the best questions I need to figure out the best Solutions I need to just prepare the plan platform I need to just put the
            • 08:00 - 08:30 questions I need to just prepare the data frame I need to just oh bro let me drink some water no no just kidding so my main agenda is just to make you feel sure and make you feel confident before your interviews and bro just make this video viral because I want to help most of the people who are just looking for some five Spar resources because I know there are not much resources a aable and I do not want
            • 08:30 - 09:00 them to just pay for like some silly resources just use this resource and all done and all done just make sure you share this resource everywhere it's your responsibility it's your it's your responsibility my data fam and if you have not subscribed my channel do it right now right now share it with others I don't know I don't know just just just do it just do it just do it just do it and drop a lovely comment for me I feel feel happy when I read the comments I am
            • 09:00 - 09:30 I think self-obsessed I'm just kidding so it's up to you and I know you love me a lot so I can feel that love when I just read the comments so thank you so much man thank you so much bro says everyone everyone everyone everyone finally buckle up and grab your coffee by the way I just had a cup of coffee so I cannot have one more cup so you can and let's do it man let's do it it so now the question is from where you
            • 09:30 - 10:00 will be able to download this notebook in order to start with this video obviously because you just need to download this notebook so that you can just run it and then you will also be able to just fill those solution boxes along with this video so the answer is I I have just created a telegram Channel where I'll be just uploading all the resources for my videos plus I have also created one telegram channel for help like let's say if anyone is facing some
            • 10:00 - 10:30 errors like I always like upload so many videos regarding projects as well so I have seen like people face errors while creating those projects so it is the best place if you have any query just drop it in the group and someone in the group can help you so I'm just talking to you as well so if you know how to debug that error you should help others and that's how the community grows okay that's how this whole data domain grows and not just the dat data domain it's
            • 10:30 - 11:00 for like all the the whole it domain specifically developers Okay and like what is the benefit if you just try to solve a error so solve an error for others like bro so your debugging skills will be improved plus when you will be having errors there will be someone for you as well who will guide you how to solve those errors and that is the role of a community so you can just help others in the telegram Channel and if if you want help definitely just uh drop
            • 11:00 - 11:30 the message and I just request you all just try to help each other and just grow this community stronger stronger stronger okay so I will just drop this notebook in the channel and in the group it is your duty to just have like normal conversations with each other just try to talk to each other like help e each other because there can be some senior data engineers in the in the community they can be like just Junior data Engineers they can be someone who is just transitioning into a data role so
            • 11:30 - 12:00 just try to talk to each other so that community that group is just for you enjoy and obviously I I will try my best to just uh contribute in that particular conversation and and and one more thing one more thing so why this channel is so so important so I have decided to come live I think at least once a month okay so I will just try my best where we will be having just like normal chitchat and obviously some discussions in the world
            • 12:00 - 12:30 of data engineering plus I will just try to prepare some special mini mini project where we can just build in like one to two hours in the live or let's say a a Hot Topic that we should discuss so that channel is for you so you can just find this notebook in that particular Channel as well and plus make sure that you are an active contributor in that group as well so there'll be like two links in the description so just make sure you check both the links now what what is the prequest to start
            • 12:30 - 13:00 this video basically I would say just basic pbar knowledge just basic and I would say even if you not have any PB knowledge you will be able to start with this video trust me and obviously if you are just preparing for the interviews I hope that you have like little bit of knowledge of ppar okay so second thing is we just need a datab account because that is the best place to practice pypar so I will just show you how you can just create a free databas account we will be using a Community Edition account and then you can just load my notebook there
            • 13:00 - 13:30 and then finally let's get started with this video and finally complete all the questions and feel confident and grab the offer and and and and and do not forget to say thank to me on on on on let me just type it for you on so I'm just kidding do not need to thank me just drop a message that you
            • 13:30 - 14:00 got selected or anything so I'll be like more than happy to hear that so it is just for that because that is the best thing that I can just expect in return okay so if you are Su if you are becoming s successful if you're just cracking the offers that means I am happy and it's it's just the you can say the best thing I can get in return simple and sorted so now you know you do not need to pay for any platform
            • 14:00 - 14:30 obviously I wouldn't say like those platforms are bad those platforms are really really good but it is like my contribution to your success as well so I hope that I can match the level of difficulty plus the scenarios and well and then definitely with the with the course of time within the video you will see that level is really really high with the course of time and you're learning a lot because I'm here to just explain that code for you as well right so it is like having a platform okay
            • 14:30 - 15:00 having all the answers as well plus your buddy to explain that code as well all the three components by the way these are six okay all the three components for free for free yeah for free enjoy grow and love me a lot just kidding okay so let's get started with this video because I'm really really really excited to cover all the questions let's go so in order to create a free databas account we will be creating a databas
            • 15:00 - 15:30 communiation account which is the best best best form of databas account available uh in which just we can just run our pbar code and do a lot of stuff in the world of datab brakes so let's quickly create a free datab Community Edition account so just go on Google and just type datab Community Edition then obviously you will see a lot of links okay so just try to find this one databas Community Edition login or sign up for databas Community Edition so let's click C on this sign up page then
            • 15:30 - 16:00 obviously it will just take you to this page where you can just log to your databas account if you already have one like maybe you can have an instance in your Cloud but don't worry we are not using that so simply click on try datab and then you will see this for personal use or for professional use so we'll simply pick for personal use because we are creating a Community Edition account we are not creating an a database account with cl okay simply
            • 16:00 - 16:30 pick get Community Edition then it will just say sign up for Community Edition now you just need to pick the email that you have not used before for your database communication account okay so I will just simply put my credentials so you just need to put the credential and I already have one communition account so I will just use the same in your case when you will be just putting your email ID it will just send an OTP to your email then you just need to put that email ID here and once it is done then obviously just continue
            • 16:30 - 17:00 just click on continue with email okay then once you have created your account you simply need to just go on Google again and just add this signin as suffix so that you can just click on the signin page or I think it will just directly take you to the signup page but I'm just telling you for your future login so simply click on this login button because this time we just need to sign in instead of sign up okay so simply use the same email ID that you have used while creating your database Community Edition Okay so this is your datab BRS Community Edition
            • 17:00 - 17:30 workspace and I know if you are already familiar with data breakes it looks almost the same yeah because that's why we just use it just for the sake of practice and when we do not want to spend money while just practicing mainly the pbar so that is I think the best thing that you can do while just practicing your pbar code where you do not need to actually access the data stored in the cloud you are just practicing and obviously while you are prep preparing for the inter that is the best best best thing that
            • 17:30 - 18:00 you can use for your ppar interviews and obviously obviously obviously obviously you would be familiar with data bricks like what are the notebooks because that's why you are just going for your ppar interviews obviously obviously an obviously so just click on this workspace so we will quickly create a workspace and just ignore my existing workspaces so I will simply click on this folder button so let's say ppar interview then after creating this
            • 18:00 - 18:30 folder now you just need to click on this folder and now obviously you will be using my notebook in which I have just put all the questions so that you can just easily run the data frames for those questions and just see the questions in sequence so it will save a lot of time for you you do not need to just manually create the data frame again and again because obviously you are preparing for your interviews and it's very very very precious time and you do not like you cannot afford to
            • 18:30 - 19:00 just waste your time in just creating your those data frames again and again so now you just need to download this notebook and you already know like from where you can just find this notebook so simply go to these three dots and just click on this import button so now you just need to upload that notebook that you have downloaded from my channel and simply click on this browse button and just you can just pick that notebook from your PC so what's up my fam all set all set Bro all set all set okay first of all after attaching this notebook
            • 19:00 - 19:30 okay first of all remember you need to just turn on this cluster so if you are new just simply click on this drop down because it can be the scenario like you have not used datax earlier it's fine bro it's fine it's fine so just click on this drop down and then you will see create new resource if you have not turned on your cluster yet so simply click on create new resource and it will just create a new resource for free it is basically a node a machine which is running for you to run your spark code and is just a basic machine and obviously is for free so enjoy enjoy bro
            • 19:30 - 20:00 enjoy first of all as we can see that it is our question so the question says while ingesting customer data from an external Source you notice duplicate entries okay and this is the very very very common scenario that everyone faces yeah everyone faces okay so how would you remove duplicates and retain only the latest entry based on a time stamp column because obviously where are record is coming just try to understand
            • 20:00 - 20:30 the scenario because I'll be explaining the scenario in detail so that you do not need to spend time during your interviews because scenarios can be different like the language for the scenarios can be different but hopefully scenarios will be like this obviously there will be scenario for like removing duplicates but the overall the generic scenario the the fundamental of that scenario will be the same like removing duplicates okay so I will just try my best to just explain you the scenario because I want if you spending your time like if you're spending few hours on
            • 20:30 - 21:00 this video I want you to learn the maximum of it and after watching this video you do not need to go anywhere else bro just sit in the interviews and say ask me any question because I have prepared all the question from an lamba's Channel no no do not say like this okay maybe maybe interviewer maybe interviewer would have already prepared some questions from my video so it will be like cherry on the cake it will be like cherry on the cake and you can just communicate with each other like we are then an lambas data just kidding just
            • 21:00 - 21:30 kidding okay so first of all let's see our data frame obviously I have just prepared this data frame for you so when you just run this you will see this issue okay you will see this issue let me just zoom the screen a little bit yeah perfect uh a little bit more no no no it's too much yeah it's fine so when you will just look at the table you will see that this is your duplicate column duplicate column which is product ID and what what like what column is
            • 21:30 - 22:00 responsible for the time stamp it is obviously date okay it is obviously date and you you can see that one data one product ID came on 1st of December the second ID came on second of December so what the question is saying like before even writing the code make sure that you have understood all the points because obviously you will not get the second chance when you'll be just writing the code in front of the interviewer because some interviews can
            • 22:00 - 22:30 be based on the screening rounds like uh some software will be just running your code and it can just say hey this is not the output this is this is not right but there can be the scenarios where like actual person like real person will be just monitoring you and then you cannot say oh wait wait I didn't understand the question or or oh I didn't just read all the points no do not behave like this just read the question like a pro and after reading the question just make the points in your notebook can you use the
            • 22:30 - 23:00 notebook during interviews obviously obviously if you are just sharing your screen you can just write those points in your notebook and if someone ask you like what you're doing I'm just creating the notes I'm just creating the points just bullet points because I'm a developer and I need to consider all the things so that I should not miss anything that's a good practice that will be a good point to highlight as well so just make bullets like bullet points and what are the bullet points here first of all just tell me what are the bullet points first thing is duplicate entries okay you will just create a bullet on product ID column
            • 23:00 - 23:30 let's say just write product ID duplicate and when you need to just pick one one value out of like maybe two three or 10 right then what will be the deciding factor it will be a time stamp column column which will be the latest one okay just write the second bullet point latest date why because column name is date just write latest date that's it it will it will just put a positive impression because it's also
            • 23:30 - 24:00 about the vibe it's also about how a person is approaching to that question it's not just always about writing a code the person is not hiring a machine who can just write the code person is hiring a problem solver and a person who knows how to approach to a question how to approach to a query if any stakeholder comes to you okay okay simple just try because that's that's why this video is different this video
            • 24:00 - 24:30 is unique this video is not just hey just give us the question just give us the code that's it no bro no no a lamb is unique then obviously his videos are unique and I accept that I accept that but it is for your benefit bro it's for your benefit okay so this is the date column okay and just tell me one thing which entry will you pick which entry will you pick obviously you will pick the 2nd of December why
            • 24:30 - 25:00 because this is the latest date you will not pick first of December so you just need to pick sales as 150 why we are having duplicate values obviously can be like so many um options due to which we are having duplicate values maybe discrepancy in the source maybe there's an adjustment for that product ID so there can be like many many many reasons so now how to just mitigate this thing first of all if you are a if you are an
            • 25:00 - 25:30 efficient data engineer or if you're just applying for a role so it will be a quick test for you as well and I will just explain this thing only for this question and then I will only apply those things in all the question do not ask me like hey this was not given in the question hey this was not given in the question why you are just doing this because I know this thing is really important and this is the time to discuss it if you will if if you will just see at this column what is the data type of of this column what is the data type it's string right let me just zoom
            • 25:30 - 26:00 it a little bit let me just zoom it yeah perfect 100% yep perfect just go away man okay it will be it will be faded okay so what what is the data type of this column it's string right it is string so first we need to cast this string data type into a DAT data type then only we can approach so this will be a quick test for you before before even approaching to a solution are you
            • 26:00 - 26:30 just looking at the schema are you just looking at the data frame because this is important right this is important and then only you can actually proceed so these are some things that you should always consider okay all the things L mentioned in the question why bro they're hiring a person they're hiring for a person who will be a data engineer obviously and what will be the roles and respons ities for that data engineer that data engineer will be
            • 26:30 - 27:00 receiving questions queries problems from from the stakeholders from the nonn tech people non tech nonte non-tech people right like managers or let's say team leads or any anyone anyone so will you just say hey I didn't write the code because you didn't put that this date column is in the string format and this date format is not in the this date column is not in the date form will you say this will you say this you are a developer you need to just take care of everything they will just provide you
            • 27:00 - 27:30 the problem hey data engineer hey XYZ data engineer we are having duplicates okay you just need to remove the duplicates and in order to pick the uh one value you can just uh decide on the date column which is the which is uh which is just providing you the dates so just pick the latest that's it they will not just give you the all like all the information okay so first of all how we can just approach to this question so I will simply say casting it so I will simply say hash tag
            • 27:30 - 28:00 casting date column from string to date format okay then I can simply say DF equals DF dot how we can just cast a column oh wait wait wait if you're new to my channel if you're new because myam knows like knows about everything like knows everything about me okay so if you're new to this channel I have created a dedicated dedicated I think six hours long video just on ppar
            • 28:00 - 28:30 and yes this is the video that is coming on the screen this video has covered all the questions like not the questions like all the functions which are required or like which you should know if you are just sitting for the interviews because you could have used any function for like any question right so if if if you have not watched that video just make sure after completing all these these questions just mark down which questions you would not able to
            • 28:30 - 29:00 solve okay before reattempting to those questions just watch that video first and then just uh give it a second try right if you are able to just solve all the questions here still make sure you watch that video for uh after this video okay and if you are a beginner beginner beginner if you are a beginner beginner if you are a beginner beginner beginner if you are a beginner okay then it is
            • 29:00 - 29:30 your responsibility after watching this video just make sure you watch that video as well because I know this video will be like covering almost everything but if you are dedicated if you are the person who is like I want to cover this thing and this thing and this thing this thing this thing I do not want to leave anything because if you're just appearing or let's say if you're just trying to grab that goal that you have made right you should not leave any any anything right so that video is also for
            • 29:30 - 30:00 you just make sure you watch that video after watching this video so just save this video like this video so that it can directly go to your liked videos if you will say hey why you promoting it like why we should just like it okay just click on that save button so that it can just go to your library okay then you can just like it because I I I I believe that you will love love that video okay so first of all let's cast It cast the column because obviously we will not be discussing like hi how how we should should just cast the column because I know like I know like you know
            • 30:00 - 30:30 all these things like basic things then we are approaching PP questions and I know who are like Pro we know everything we know everything bro hold on I know you know everything we need to just make sure that everyone is on the same page okay so all these information is also important for those who are growing who are not pro like you okay so make sure you respect that like those people as well very good I know maximum of like not
            • 30:30 - 31:00 maximum like all of my data Community is very very very good I'm just talking about some new people who are like who is this guy and why this is like so talkative and all can't help can't help can't help can't help bro so and don't worry just spend some time with me you will just fall in love with my videos not with me don't worry I am very poing person bro in real life trust me trust me trust me I'm I'm I'm the like the most boring person you will ever meet
            • 31:00 - 31:30 most boring most boring TR trust me and most annoying do not do not fall in love with me just fall in love with my videos that's it so now first of all let's cast this column quickly okay D have. with column oh very good very good before even start building anything we should first import all the libraries perfect so this is also your task so I have I will just create this box even here so
            • 31:30 - 32:00 that you can just start this questions and you will be having all the libraries imported but this is a tip for you always make sure that you are importing all the libraries okay from ppar do SQL do functions so I will import all the functions because I do not want to just import every single function again and again no no no no no no no no no no no no from bp. SQL do types okay types types
            • 32:00 - 32:30 types run this perfect got it now now we can just use column DF dot with column why it is not giving me the recommendation for column why why why why why why why oh because this is not a paid version in the paid version it just writes a code for us that's why I didn't pick the paid version because it just writes a code for us I don't want that I don't want that so DF dowith column what is the column name it is date okay then we just need to say what function
            • 32:30 - 33:00 that we need to apply we need to apply column date do cost and we need to pick dat type dat type is the API available in the pyspark.sql do functions okay so simple it is done now we have just converted its type to date from a text from string okay simple date type then then then then there's like another
            • 33:00 - 33:30 function as well that you can use I can just show you it's up to you which you want to use so DF equals DF dowith column and then you will simply say date and then you just need to pick two date why is not giving me the uh recommendation why why why why is it a real test I think so so two date then we just simply need to pick the column who is texting me bro column and then date so it will also
            • 33:30 - 34:00 work so I can just comment it down I can just comment it so this is just converted to date okay or I can simply remove it bro one thing is fine I just wanted to show you so now this is fine this is fine now first of all I will just sort the data let me write drop duplicates okay so now what I will do I will first of all s the data okay so
            • 34:00 - 34:30 DF dot order by then I want to just s the data on date and actually on product ID as well perfect perfect product ID then I will say what is the order of the state then I will say ascending equals to one and Zer why 1 and Zer because I want to apply ascending order on product ID and descending order on date okay then once
            • 34:30 - 35:00 it is sorted then I can simply use my function called drop duplicates now this is a good catch you should not just get confused you have like both the ways of using drop duplicates function it is like with underscore as well and without underscore if you use without underscore just make sure you are using capital D right after the first word okay so these are like small small things and obviously if you are if you are myam you should not feel like you should not feel confused okay so drop duplicates then we just need to assign the
            • 35:00 - 35:30 subset yes why we can simply Define the column name because again a tip let's say in The Question Interview ask you like you have like multiple questions like not multiple question like then you may be encountered with the equation in which we have like multiple columns which have like duplicates or you need to decide on multiple columns so in that scenario you just need to put all the columns in that so that's why I personally always use subset that's easy to remember as well subset equals
            • 35:30 - 36:00 product ID okay then you can simply say display do display and then your data is ready in which duplicates are removed and data is based on the latest data as well sorted now let's jump on the question two so we have our question number two which says while processing data from multiple files obviously you will be just ingesting data from so many files or let's say from so many sources from
            • 36:00 - 36:30 where you will get multiple files okay you see inconsistent schemas what does it mean I just show you don't worry you need to merge them into a single data frame how would you handle this inconsistency in ppar so just have this idea in your mind so let's say this is your Source this is your Source this is your source and from where you are just getting few columns let's say three columns and from here you're just getting four col columns and let's say from here you're getting just two columns okay but you just need to
            • 36:30 - 37:00 create only one data frame for all these sources that is the requirement how you can just do that so for that you just need to keep one thing in mind let's say you will create for obviously we for this we do not have any data frame you need to just refer to an imaginary data frame obviously because you just pulling the data from the source for this you need to create a data frame for them you do not have any data frame ready okay spark Dot read do format and now obviously if no
            • 37:00 - 37:30 data format is given obviously I can just pick any data format so I will simply pick bucket okay and then then then then this is the main thing so now you know that you have so many files and there's no scheme of fixed for it so you need to use this function it's called obviously you will just use the option and then within this you will simply write merge schema and then you will
            • 37:30 - 38:00 just make it true so what this will do this will just merge the schemas of all the files and then it will create one data frame on top of all those files so if we have like let's say multiple columns or let's say more than specified number of columns in any file it will just add those columns and it will just produce one data frame on top of that particular folder in which we have files right so then everything is fine then we can we can simply say dot load and I can just use location that I
            • 38:00 - 38:30 always use if I'm just working with a JW if it is S3 you can just simply say S3 it can be any location so for now oh it's not as3 it's S3 so for now you can just pick file semantics because obviously location is also not given because they just want to test this function if you know or not this is merge schema this is very very very important and nowadays it is even more important because schema is not consistent within files so yeah so files data and within this I have another folder called data files simple okay so
            • 38:30 - 39:00 this is the data frame that you will be creating and you will get one data frame on top of all the files which have inconsistent schema or like schema mismatch so your data frame will be created so so let's finally discuss this question and this is your like one of the most important questions and this is the question that I can predict that will be asked in your interviews for sure for sure Plus this can be tweaked into so many so many
            • 39:00 - 39:30 forms let's say this can be this question can be asked in such a way like why we stopped using Hadoop map ruse because we were using Hadoop map ruse why spark is better than AO map like this can this question can be tweaked in so many ways but you know what answer is same answer is the same only and only if you know this answer that I'm going to tell you right now if you know this whole concept right now that I'm going to discuss only then you can answer all
            • 39:30 - 40:00 those tweaked questions otherwise if you're just learning the answer after just going to Google you cannot answer all the questions but after actually understanding this concept you you can just answer all those forms of questions trust me trust me trust me so basically this Hadoop map reduce what is that Hadoop map reduce was the framework that we were using to process our data in
            • 40:00 - 40:30 parallel so it is obviously similar to spark obviously that is why we have this question like why we are using spark over Hado map so this was our go-to framework before spark as you all know but why we are using spark then and why we just stopped using Ado map reduce why why why so the thing is this word map reduce is actually consisting of like two words map and reduce map and reduce so so this map was the process in which
            • 40:30 - 41:00 our data was being distributed among nodes among machines among cluster cluster of machines anything you can say and this distributed data was then combined with the help of reduce and then we were having the complete data set complete data frame or complete output okay so what was the issue with that everything was fine right the issue was this map step was actually writing
            • 41:00 - 41:30 all the intermediate data to the disk what can be the intermediate data like any kind of processing any transformation everything so all that data was being stored to disk what's wrong with that then this reduced tip needs to go to the disk again to collect the data because obviously only then it can combine all the data that was distributed among so many machines then it was actually taking a lot of a
            • 41:30 - 42:00 lot of time as data GW okay then spark came into the picture and Spark actually started working or let's say processing or Distributing the data in memory and obviously reading the data from the memory so it was not supposed to go to the disk to store the data first and then read the data and it was saving a lot of lot of lot of time that is the
            • 42:00 - 42:30 first thing and that is the best thing now a followup question can be asked and this is really really important this can be really tricky they can ask that but still spark writes data to the disk first of all why and then what's the difference then why we need SP if it still writes it to the disk so the answer is yes it writes data to the disk but only when only then when we say to write the data to the disk
            • 42:30 - 43:00 only when we say using let's say persist option and then just write data to the desk using that parameter or or it cannot handle your all the intermediate data and it cannot fit in that memory only then it writes data to the disk it is not writing data like it is not writing all the data to the disk it is a scenario where it writes data to the disk now if it writes data to the disk under
            • 43:00 - 43:30 some circumstances then spark and map reduce will be the same right no spark is still faster than map reduce because it has query optimization techniques Catalyst optimizers logical plans and all so still if it writes data to the disk in some scenarios still it is faster than Hadoop map redu still so this is the major difference and this is your answer to all those
            • 43:30 - 44:00 questions why map redu why blah blah blah blah blah okay second answer well not second answer like second Point why we are using spark so basically Hadoop map redu was designed was meant for batch processing okay but spark is meant for both it is efficient with both batch and streaming processing and that is obviously the need of the r right now we need like and like a framework which can just process
            • 44:00 - 44:30 the data in real time and Spark is very very very efficient with that as we all know right so now now you know the concept right now you know the concept so just be confident and whenever you hear this question and just say bro I I will answer this question I I will nail it okay so this was your question and this is really really important just make sure you are taking notes so now let's see what is our next question so this is our third third not third I
            • 44:30 - 45:00 think fourth question yep because third question was the theoretical question yeah but according to the notebook it's third so it's fine so this is our fourth question so what is this question it is saying that you are working with a realtime data pipeline oh we were just talking about some real time data streaming data and now we have like real time data Pipeline and we noticed some missing values in our streaming data right and column name is category and we need need to just handle null or missing values in that particular scenario and
            • 45:00 - 45:30 this is the data frame that they are just referring to run the stream like just to initiate the stream right streaming query so in that scenario this is our streaming query right what we can do we can simply say DF equals DF do fill na we have a function called fill na but within this I will assign I will assign a dictionary in the key pair because I need to specifically put column inside this that which column I
            • 45:30 - 46:00 need to pick because if I just use f a i can just assign one value but if I want to just tell the column as well because there can be like multiple columns in this scenario it is just one column maybe they can be like two to three columns in that scenario we will just pass a dictionary so we will simply say category and then we can simply say if we have any specific value for category then it's fine otherwise we can simply say na or let's say na instead of
            • 46:00 - 46:30 missing values or null values it will hold some value I know it is na but it is understood because we can just query this data based on this na we can just show this data that we have na these values and the dashboards and all so it is still fine we have some value to hold within it so if you just want to assign multiple values we can simply add comma and assign another key value pair so it is fine it is fine so this is your code for this question and this was really really easy I know but the catch was this this thing because they can just
            • 46:30 - 47:00 ask you to just put more and more columns more and more columns so this was for it now let's see what we have in our next question so here we have our next question where you need to calculate total number of actions performed by a user in a system okay and how would you calculate just top five most active user based on this information so how our data data frame
            • 47:00 - 47:30 looks like so this is our data frame where we have user ID and where we have actions and obviously we need to just first find the total number of actions per user as you can see we have multiple users user two user two so this table can go on then we have like total actions so first of all we need to find Total actions performed by users and on top of it we need to just find top five okay just top five so how we can just do it it is very simple so let me just
            • 47:30 - 48:00 write code for you so we'll simply say DF equals DF do groupby so we'll be using Group by function and we need to apply Group by based on the user ID okay then what we need to perform we need to perform aggregation using obviously sum function and we need to find sum of actions column perfect and then obviously I will use an alias for being an efficient data engineer you should also use alas whenever you writing SQL query whenever you're just doing just doing any
            • 48:00 - 48:30 aggregation it shows that you're professional it shows that you know how to just write queries or transform data in the real world as well obviously you cannot just leave your aggregation without any alas right so I will simply say total actions okay and then once it is aggregated what I will do I will just apply an order by command based on the number of action based on the total actions okay so we'll simply say
            • 48:30 - 49:00 order byy on total actions why because we just need to fetch the users with the maximum maximum maximum uh your actions okay then it is saying we just need the top five just top five so we can just use a limit function and then we can just say limit five that's it and then we can just simply run it like DF do dis play and one more thing we just need to
            • 49:00 - 49:30 say ascending equals false of ascending equals false or ascending equals say zero it's your choice I can simply write a sendals false perfect let me just run it oh perfect we have got the users in the descending order and obviously user two has 11 actions the top one user 4 10 user one 5 user 32 and obviously we just have four users so it just gave us all the records but obviously if we have more users we can just get that just top five so it makes sense right you can just see the data it
            • 49:30 - 50:00 is sorted based on number of total actions perfect perfect perfect so now it was all done this was all about this question it's nothing special in this so let's see what we have in our next question so here we have our next question and this question will require some special things so I will just let you know what that so basically while processing sales data transaction or let's say processing sales transactional data you you need to identify the most recent trans transaction of each
            • 50:00 - 50:30 customer so let's say you have like so many customers like customer One customer 2 customer One customer 2 then you just need to get the most recent one the most recent one okay let's say this one this one this one this one so now we know like this question is something similar to the one that we did before like where we just removing the duplicates but this time I do not want to use drop duplicates method then how we need to just do that because
            • 50:30 - 51:00 obviously it is a professional method as well so now I need to use window functions for it window functions yes window functions and obviously it will just show how much you know about let's say window functions as well so I will just try to complete the solution with the help of window functions and just do it yourself if you know window function if not I will just tell you how you can do that okay so first of all all obviously it's a quick test for you what we will do the first thing obviously
            • 51:00 - 51:30 convert this to date and this time we have transaction date instead of just date and we have customers instead of users okay so first of all I will simply say DF equals TF dot with column and then I will say transaction date okay and then I will simply pick uh um
            • 51:30 - 52:00 column transaction date do cost okay and this time it will be dat type let's use this thing because previously we use to date so it's fine okay so now now is the thing so what we need to do first of all we need to just pick the customer with the recent transaction date just recent transaction date and without drop duplicates without now you will say will you get the quotes
            • 52:00 - 52:30 like you need to just use this without duplicates or will you get the codes like approach this question with the help of window functions yes yes yes if they want to just test your skills they can just definitely mention they will mention trust me window functions are really really important and window functions like the bread and butter for like a few no SQL so window functions definitely definitely definitely will be involved so they can just mention this thing so that's why it's my duty to just write the code for you using window
            • 52:30 - 53:00 functions as well so first of all how we how we how we need to just do this so basically I will use a function called row number okay row number or let's say I will use the function called dense rank okay so what I will do I will just rank the customers based on the transactional date and I will create a new column called flag okay let's say this is flag and this will give the ranking to each
            • 53:00 - 53:30 customer based on the transaction date so this will get the ranking one this will get the ranking two obviously we need to just sort the latest one so this will get the ranking two and this will get the ranking one and then I will just filter all the records based on flag equals to one let's do it let's do it let's do it bro let's do it so this column is obviously typ casted so I do not need to run this because I know it will run fine if I will see some errors I will just
            • 53:30 - 54:00 correct it don't wait don't wait because we are just developing the code and we are not testing the cast thing we can just test it okay so now the thing is we will simply say DF equals DF dowith column so I will simply create a new column called flag okay and this will this will be a denth rank function over over what over window dot don't worry I just need to import window
            • 54:00 - 54:30 Library I will just import it don't worry window. Partition by customer ID okay and then I need to sort my data based on transaction date transaction date perfect and I just need to S this data in descending order so I will simply say DC so now my data frame is sorted right so now what I can
            • 54:30 - 55:00 do I will simply pick I can see red mark here okay are we missing some column yeah so it is fine so now now what we need to do our data frame is ready like DF is ready but now we need to just filter do filter and then what we need to filter we need to say column of flag just equals equals to one that's it
            • 55:00 - 55:30 because that's we want and we will simply say DF do display and before running this I can simply say from PB by the way this can be your another question like and this maybe this will be not written but this will be tested like this this can just test your knowledge so this can just say do you actually know which library to import import import to use uh window function so it's spark. cql do window with uh W in lowercase then we will say import
            • 55:30 - 56:00 window with W in capital like an uppercase so I will simply run this and let's see what we get so perfect I just got all the latest version and this column is just for a reference so that you can see that this is the first value this is the first value that we have brought and just to confirm I should see customer one with sales 200 based on the data customer one with sales 200 perfect this is the latest
            • 56:00 - 56:30 sales based on the latest data and customer two with 250 so this way you can just use window functions and you can just uh work with scenarios where you just need to rank the data set rank the data frame and then you just need to filter a specific value why why why I use this function do you know why and why this is very much in demand when they want to ask this question so the thing is if we do not want to let's say if if I'm
            • 56:30 - 57:00 not using drop duplicate there's a reason behind it obviously if it is written in the equation then we have to use it but if we want to find let's say uh customer ID with customer sales and their third transaction date maybe this can be the scenario but at that time how you will just eliminate the duplicate based on that that third date obviously there are workarounds but then it will be such a long transformation but with the help of window function
            • 57:00 - 57:30 these are basically the function in which we apply transformation on a row level so this can help us to actually give us the right index either it's three either it's four either it's 5 6 7 8 9 any any any index we can just get the sales of a customer let's say third sale fourth sale fifth sale second sale any sale okay okay but with the help of other thing we can either get like minimum or maximum like first and last that's it
            • 57:30 - 58:00 that's it we cannot play much with indexing but this is the key this is the key so by learning this a solution now you know you can just work with any question that requires indexing that requires you can say ranking okay so this is your question this is your question so let's see what we have in the next question so now let's look at our next question and it is you need to ident identify customers who haven't made any purchases in the last 30 days so basically we need to
            • 58:00 - 58:30 just apply a filter but if you just look at the data set we have like customers obviously and their last purchase date but we just need to filter those customers who haven't made any purchase in the last 30 days that means if I just make the bullet points that I always say first of all we need to just compare the data with the current date okay like the recent one like the date like the today's datee then we need to also make
            • 58:30 - 59:00 sure that it should be greater than 30 these two should be the bullet points right so how we can just do that so I will simply first of all convert this date into date type okay so I will simply say DF equals DF dot with column and it says last purchase date okay then then then then then then obviously first we need to just cast it so now we will simply
            • 59:00 - 59:30 say to date okay and then we can just say last purchase date so this will be just converted into the date format perfect now now now now now now what we will do now what we will do we'll simply say DF equals TF dot with column okay so basically we will just uh create a new column in which we will
            • 59:30 - 60:00 just mention the number of days like the difference between today and the last purchase date we will definitely want to see like what is the difference then only we can just apply a filter right first we need to just get the difference so you need to just approach the question in the form of steps that how you need to just do it so if if you want to just figure out okay I just want those users which are which have the difference of greater than than 30 first you need to just have that difference like what is the difference so for that
            • 60:00 - 60:30 I will simply say uh Gap let's it okay so I'll simply use a function called Date difference it's called Date diff or let's say dat diff I think it's dat diff yeah so date difference then we need to just provide the start date and end date so start date can obviously like the end date can obviously be the today's date so we'll simply say current and then what will be the start date start date will obviously be the uh
            • 60:30 - 61:00 last purchase date okay last purchase date okay done now it will just give me the Gap okay then then then then then then what I just need to do then I will just apply the filter dot filter okay and then I will say column of Gap greater than 30 so then I can say DF do display perfect let's see what we have
            • 61:00 - 61:30 so perfect we just got two customers because the Gap is like more than 30 days and one is for like 40 days second is for like 45 days so this is the thing and if you just validate the data that makes sense because one user last purchase date was like first of December 2025 okay so he has just made the purchase in the future okay so I think I just mentioned 12 instead of just putting 01 so it's fine it's fine because we going to have the data for
            • 61:30 - 62:00 the future as well so obviously if you're just dealing with Finance data where we have the future date so it is fine it is possible no need to worry so now let's see what we have in the next question so it's time to cover some questions related to Text Plus array It's a combination of both so basically this is your question in which it is written that while analyzing customer reviews so obviously if you're just working in an organization most most most probably you will be just working
            • 62:00 - 62:30 with customers data as well because obviously customers are everywhere everywhere everywhere so obviously in order to actually review their reviews review their reviews okay makes sense you need to identify the most frequently used words and this can be used by data scientist as well right and just forget about data scientist be a data engineer and this is the requirement and you need to fulfill it yes there can be a scenario where you will be asked by a data scientist to actually complete this
            • 62:30 - 63:00 Quest then how you can just do it and how our data frame looks like so basically that man okay so this is these are my customers customers One customer 2 customer 3 okay and these are the feedbacks I actually want to get the most frequently used words so in order to actually proceed with this kind of question or let's say any question related to just you want to categorize the words so basically what I'll be
            • 63:00 - 63:30 doing what I will be doing and what you should also do okay so this is your column feedback column and within this we have so many words the product is create and then create again great product so what we will be doing we will categorize each word each word and then we can just find the count and then we can just apply let's say order by on that count an lamba you has discussed so
            • 63:30 - 64:00 so so many bullet points so when you will be actually doing it we can monitor you and we can just see and meanwhile we can just try solving this question on our own perfect perfect that's perfect okay so what I'll be doing I will simply use a function called so I can say DF dot with column with column yeah hold on hold on hold on so I will simply say uh feedback okay right first of all what I will do I will create this text into an array simple so
            • 64:00 - 64:30 for that I will be using split function so I'll simply say split this feedback column okay based on the delimeter based on the delimeter which is space then it will just store all these information into a list perfect now I just want to show you the display because we need to just apply so many transformation within this code but you need to understand
            • 64:30 - 65:00 what I'm actually doing so in order to do that I will simply say DF do display because I want you to learn as well bro it's not just about learning the code or just remembering the code no you should just grasp the knowledge grasp the knowledge so as I was saying that this is converted into an array so now what I can do I have a list ready okay sorted sorted I have a list ready so what I can do in order to just transpose this list into the values because what I want to
            • 65:00 - 65:30 do I want to break this list and I want each and every single value in that column so for that I will be using explode column explode yeah so explode will just explode the list simple as the name suggest really bro I can just show you see I will simply write explode then I can encapsulate this whole thing in a parenthesis okay parenthesis or braces braces or curly braces okay round
            • 65:30 - 66:00 brackets simple simple simple simple like in coding we have like so many different kind of brackets and we have so many names so simply say round brackets sorted life okay so this will just explore the value now now I just want you to see the value so for that I will simply rerun this code from here why because obviously I have already transformed something into DF so now this is the the code that it will be running so now this has transformed the
            • 66:00 - 66:30 column into individual value individual value what now I will do I will simply group my data let's say DF group okay and I will say DF do group by Group by feedback perfect perfect so it will just Group by the word and then I need a count so I'll simply say AG count of feedback column that makes
            • 66:30 - 67:00 sense because we just need the count and I can just put the Alias as uh what should be the good Alias the Alias can be let's say number of times appeared or let's say count for now let me just keep it as count word count it will be a better alas because L you were just giving us the lecture that you should just choose a good alas so I accept that okay alas word count okay so it will just count
            • 67:00 - 67:30 the how many times the word has been repeated okay now now now now now now what what what I will see so I will simply rerun this code from here again now I should see TF group let's see what we have inside this because I'm not just explaining you or giving you the questions I'm just giving you the scenarios I'm just trying to explain you the concept got it got it
            • 67:30 - 68:00 got it got it got it got it got it got it got it so obviously this was a small data set so all the words just appeared only once one thing I just want you to tweak here as you can see great and great great actually appeared two times but it didn't pick the latest one why because of the uh uppercase NG so in order to tackle with this situation obviously these things can be addon okay so what you can actually do you can
            • 68:00 - 68:30 first of all apply another transformation let's say dot withd column or we will just apply with column first so here we can apply another withd column okay and then I will say word count perfect and what transformation I want to apply lower simply lower lower and then account perfect now let me just run this now let me just run
            • 68:30 - 69:00 this oops oops oops uh uh uh uh uh uh oh sorry I need to just use feedback because it was the previous column it was the previous column and then we just renamed it so I just looked at the new data set instead of previous one so so sorry for that do not need to worry just reun this just rerun just rerun perfect this was the thing that I was talking about now if I
            • 69:00 - 69:30 just want to sort my data based on the most frequent words what are the most frequent words now great is the most frequent word in my small data set right so these are the scenarios where you can actually play with the word counts and you can actually use those words while building your let's say ml model or not like you will be building but you need to just feed the data to the stakeholders right so this was really really really good question and I hope you also liked it now let's see what we have in our next question or what's our
            • 69:30 - 70:00 next question let's see bro if you are applying if you are applying for any financial consultancy company or anything that requires to just deal with some Finance data accounts data or anything like that this question is like bread and butter for you and it's not just about cracking the interviews the day-to-day activities as well you have to have to have to deal with these kinds of scenarios what's
            • 70:00 - 70:30 that basically you need to calculate the cumulative sum what is that cumulative sum so obviously I'm not a finance guy really yeah we are mathematics guy and I don't even know M of mathematics by the way I I scored I think 95 in my 12th in mathematics so I thought like I'm very good at mathematics then I just took Bachelor's in mathematics then I realized that I was wrong because yeah I I don't want to
            • 70:30 - 71:00 discuss that bro forget about that broet that was the part of like that was the pH of like that that is gone okay okay so okay okay okay so you need to just calculate the cumulative sum of sales over time for each period wow basically to sum up the question or to just make you understand the question this is your data let's say product one product two product 3 4 so on these are the dates I actually want to find the cumulative sum
            • 71:00 - 71:30 so how my output will look like here so as you can see my data is like this okay then I need to make sure that product one will having we'll be having sales as 100 okay but for the next period which is 3rd of December in 2023 my data will show sales as 250 because 100 100 is added into 150 then so on then so on that is called cumulative sum so in order to find
            • 71:30 - 72:00 cumulative sum we have to use window function again this time we will not be using a well- defined window function we will be using aggregated or let's say aggregation functions but we will be treating those functions as our window functions on lamba life is already really complex why you are just making it more complex am I making it am I making it bro I'm just trying my best to make it simpler bro trust me so I'll just show you what I I mean okay so
            • 72:00 - 72:30 first of all what I will do I'm just making you feel confident bro uh you are saying that I'm just making your life complex really they are making your life complex what any your interviews I'm just making it simpler so that you can just answer all all all the coding rounds and just drop me a message uh that you are selected and all and just enjoy enjoy enjoy enjoy obviously so what we will be doing Simple I will
            • 72:30 - 73:00 just create or let's say recreate the sales column and how we can just do that so I will simply say DF do actually I can just create a new column so that you can actually understand the cative some column okay let me show you DF width column okay and this this time I will simply say cumulative sum come sum yeah it makes sense come some yeah so now what I will do I will just use a window function or it's not a window function it's basically a normal aggregation function called sum but I will treat it
            • 73:00 - 73:30 as our window function so I'll simply say sum sum of our sales column okay sorted so far okay so this will just try to find the sum of sales but we need to find do over then we need to say how we need to just partition the data how we just need to segregate the data how we just need to create the slabs you just need to visualize it bro just visualize it it's so beautiful my
            • 73:30 - 74:00 visualization like I I I should not say that but yeah so dot over and then I can simply say after over I just need to define the you can say slab or let's say any kind of partition so I will simply say window dot Partition by simple and how we need to partition our data based on product ID simple then after that what will be the order of that
            • 74:00 - 74:30 cumulative sum it will be date column okay it will be date column so I will simply say dot order by and then date column now it is sorted it is sorted okay one thing that I just need to do first that I first need to cast this column because this column is again in format okay so DF it will just take few minutes and now you are just well vered
            • 74:30 - 75:00 with that so I'll simply say date to date uh perfect and then date perfect perfect perfect and then I can simply say DF again now if I just zoom it a little bit now you should see the code come some column name sum of sale dot over because we are just treating it as a window function because we are just applying row level transformation right Ro level transformation oh my hair bro okay window. Partition by then we
            • 75:00 - 75:30 have just defined that partition column partition column is product ID then order by order by is by date obviously and by default it will just take ascending order and in cumulative sum we should always take like cumulative sum as like an ascending order but it's not a big deal if you just want to Define DC we can just say DC not a big deal at all now just tell your Finance that oh oh wait wait wait just hold on just hold on I I just hit shift plus enter by mistake
            • 75:30 - 76:00 I was supposed to do it like this and then I wanted to run this perfect so just tell your Finance guy that your data is ready and then just do your Finance stuff no no don't don't talk to him or her like this because obviously you need salary right okay so this is my product ID okay and this is my com some 100 then in the next date in the next date or period or anything anything I
            • 76:00 - 76:30 will see some will be cumulatively added 100 then 150 it will become 250 common Comm common mathematics then 200 and then it will become 450 and so on this is very very very important question and obviously you should know how to deal with this so this was all about this question let's see what we have in next question so in this particular question we have a special addition we have a
            • 76:30 - 77:00 special addition yeah so the thing is we need to work with the data pipeline in which we have some duplicate rows bro there are so many questions in duplicate rows and duplicate data that's that's that is the basic thing that will be the thing that you will do almost daily to remove duplicates to remove duplicates why because if you know some fundamentals of data warehousing it can it can break the whole data warehouse that is why it is so so so
            • 77:00 - 77:30 important to deal with duplicates and just applying as many staging steps as you can before just loading your data into the into the data warehouse simple simple simple now there's a special addition to this question what's that basically there's a special requirement by our stakeholder in which stakeholder is saying that you need to remove duplicate rows but you you just need to keep the order as it is you cannot say just uh pick the maximum of the value
            • 77:30 - 78:00 based on date column and just just keep it like this no I don't care about date column I do not care about if I'm using maximum date for some values or minimum date for some values this is my table just keep the order as it is and just keep the records and and just keep the first records without caring about anything like without just let's say
            • 78:00 - 78:30 Touch without touching anything you just need to pick the very first record available and how we can do that here we will be again using our window function and this time we'll be using a special window function it is called row number so what is row number you should know bro if no basically row number just assigns the value one 2 3 4 5 based on what thing based on nothing I'm not kidding my phone's battery is low no
            • 78:30 - 79:00 worries I'll charge it so it what it will do it will simply provide 1 2 3 4 5 6 it does not care if you have same values let's say one column is let's say this thing so here if you see the age column what should be the ranking for 25 you will say one let's say let's say one and what should be the ranking for 25 if it is one then obviously you will say one row number will say I do not
            • 79:00 - 79:30 care this is two how I don't know because it appeared at the second rank that's it it is second so it does not care that that's why it it doesn't mean that it is like a useless function see it is very useful in this particular situation by the way second useful application for row number is like generating a surrogate key now do not ask me what is a surrogate key you not know what is a surrogate key bro what
            • 79:30 - 80:00 you're doing bro so basically surrogate key is a pseudo key pseudo business key that we assign in order to eliminate the primary key from the dimensions why because we just keep the primary key that we get as business key but while applying joints it takes a lot of time and it just creates complexity because it is like so long in length and all surrogate keys are simple 1 2 3 4 5 6 simple it is very easy to just apply joints sorted right sorted so now see
            • 80:00 - 80:30 you're just gaining extra knowledge as well additional knowledge as well because everything should not be covered in the form of question like what is a surrogate key but it can be a quick question that you are just writing the quote interview asked you hey bro what is the uh application of row number other than this you were like I didn't cover this question no that is why I'm just explaining you the concepts explaining you the scenarios actual scenarios how to approach them how to
            • 80:30 - 81:00 approach them right it's not just a coding session it's not just a coding class you you need to just solve these solve the problems you just need you you are a developer you need to just develop the solution instead of developing a website that is the difference right between a website developer and the data engineer right you just need you do not need to just develop you need to solve problems I know that we do not have much dsas popular in data engineering but still we
            • 81:00 - 81:30 solve problems bro we solve problems not with the help of coding we first understand the problem then we approach it with the different you can say applications at all okay okay just keep your mouth shut and just write the code okay okay I I heard you I heard you okay so simple what I will do I will simply say d app. with column okay sorted and then I can just create a new column just to show you as row flag okay and this time I will be using a transformation called uh window not
            • 81:30 - 82:00 window just row number yeah row number and then I will simply say dot over and then I just need to define the uh Partition by first of all obviously and then order by simple so Partition by will be obviously name so window dot Partition by and it will be name and then I will simply say order by and order by Will be my age perfect perfect perfect perfect and then it will just assign values at like 1 2 3 4 5 6 7 8
            • 82:00 - 82:30 and then what I will say I will simply say do filter filter my data frame and add column as a row flag oops equals equals 1 simple and then I will simply say DF do display and simply just just display this result to your interviewer and he or she she will be very happy no what's that bro oh I just added
            • 82:30 - 83:00 underscore do not worry just say now you should be happy simple bro see now I just got Jane John allies John as my John as my as my as just like fun value for John and what is the value the first one the first one the first one okay so this was all about row number function so this was really really handy function so you should know window functions detail see
            • 83:00 - 83:30 there like so many questions so let's see what we have in our next question I don't know I don't know I don't know so how do you just implement this first of all just tell us this thing just tell us tell us tell us so basically it's very simple it's a little like it's a very cute and simple question but still important still important it doesn't mean that if I if I'm just creating a video I should just only cover the complex question so that video looks like very very catchy it it's not like that when you're just preparing for the
            • 83:30 - 84:00 interviews you should cover everything simple things complex things some crazy things everything everything everything is important bro do not just focus on the difficult things I just need to prepare for this this this this I just need to prepare for like data skus and all pro pro cover for every cover everything cover everything cover everything everything means everything cover everything okay so this is very simple these are like my users and I just need to perform a simple aggregation and I just need to
            • 84:00 - 84:30 grab the average duration for each user so how we can just do that it is very simple you can simply write DF equals DF dot Group by and then I will simply say I need to just apply Group by on user ID okay and then I just need to find Ag and AG will be average of duration duration average of duration do
            • 84:30 - 85:00 alas because an Lama said just always provide areas's name we cannot skip it yeah you should not now you can say average duration okay and then you can say DF do display perfect let's see what we have what's Pro what's what what's what's happened what what happened what what happened oh just put single code here so it's fine just run it so this is the thing this is the thing user one user two 55 minutes 60
            • 85:00 - 85:30 Minutes as you can validate this information as well the average of 50 and 60 is 55 just trust me because I have done my Bach in mathematics bro this is like simple mathematics it is 55 right okay and the average duration for 45 and 75 is 60 just trust me because no no no just just take out your calculator and just validate it so if if this question this question this
            • 85:30 - 86:00 question this question says that you need to find the product with the highest sales for each month for each month and what do we have in our data set we have product ID we have date we have sales so we need to find which in in in like just let's say a month 12 okay and then within that we have product and then we have like sales so actually we need to apply double
            • 86:00 - 86:30 aggregation one is based on date this one and second is based on this product ID so double uh aggregation as you can see you need to find the product with the highest sales for each month okay so so so so so so so so for that how we will be doing it so just telling you the approach so let's say in the first uh month or let's say on day we have like two products product one and product two and then we have sales as
            • 86:30 - 87:00 like 100 and 150 sorted sorted so what we will be doing we will just say and they can be like multiple records as well this like a small data set but they can be like multiple records so actually actually we need to apply aggregation on date so for every month or let's say for every date Okay then if you will apply Group by on product ID as well so that we can just find the total sales available total sales available as you can see we have the sale for next year
            • 87:00 - 87:30 as well so we need to find the total sales we need to First aggregate the data on date and then on product then then we will just get the maximum uh product ID from each month so this is like containing multiple Transformations so let's see how we can approach it this is really important just just focus on this part just focus on this this is really important so first of all I will just convert my column my date column
            • 87:30 - 88:00 into date format so I'll simply say with column date okay and then I will say uh uh uh to date to date and then I will simply say date column perfect perfect perfect perfect perfect so this is done and once it is converted into date then real transformation starts so first of all I just need to grab the month of the from the date column if you
            • 88:00 - 88:30 see my date has the complete date like 2023 1201 2023 1202 so we just need to grab the month we just need to grab the month so for that I'll simply say dfal DF dowith column okay and I can simply transform this column okay date and I will simp simply say month and then I will say column of date or let's say just date so this will just
            • 88:30 - 89:00 grab the month out of date column this is sorted okay now on top of this data frame I will apply a group by group Pi on top of this wait I just added dot groupy inside this so on top of this data frame I will apply group Pi okay and what will be my group by column first of all it will be date the let's say priority will be given to date okay so now it will just apply Group by
            • 89:00 - 89:30 on date but I just need to apply Group by on product ID as well product ID as well so I will simply say product ID simple so now I have applied aggregation on top of both the columns then I will say do Ag and I will simply say sum of sales perfect perfect perfect so so far I have aggregation on month okay so this will
            • 89:30 - 90:00 just go to this column so it will just say month 12 and it will say product one and it will just apply sum on both the values 100 and 200 sorted then it will say this date the same date obviously because month is same it's 12 then it will go to product two and then it will apply some on 150 and 250 okay so so far the data is just aggregated it is it has
            • 90:00 - 90:30 just aggregated the data just aggregated the data so now now what we actually need to first of all let me just show you this uh output like output for this data frame so that you can actually understand what we are doing so in order to show so simple oh and Lama you forgot to give alas H okay okay okay okay okay okay we got you we got you so so what bro so what I was just trying to explain
            • 90:30 - 91:00 your stuff and what's wrong uh wait wait wait let me let me just complete this code and then I can just do it what do you want bro what do you want what's wrong in that oh because we just need to rerun it because we already applied the aggregation no worries so it is fine yeah simple so first of all as you can see the month is same okay month is same product one product two so this has total sales 300 and 400 now what we
            • 91:00 - 91:30 need to actually grab we need to grab the product with the highest sales for each month for each month now on top of it what we need to do we will be applying dense rank on on what on sales obviously but Partition by will be on date instead of product ID and that was
            • 91:30 - 92:00 the whole catch or let's say that was the main catch of this question so what I will do I will simply say same DF equals to DF do withth column okay with column actually not or yeah it's same thing DF do with column and then I will simply say let's say ranking okay ranking then I I will say d
            • 92:00 - 92:30 rank okay denth rank then do over window dot Partition by okay and Partition by will be on date perfect Partition by will be on date and then then then then what will be the order of my ranking it will be do order by and column is we all know
            • 92:30 - 93:00 sum of sales because we didn't rename it it's fine sum of sales but but but but we need to say first of all column because we need to use dodec as well perfect every braces is completed perfect now it is sorted right now I will say just filter ranking or column of
            • 93:00 - 93:30 ranking okay equals to 1 now let me just show you for that we just need to yeah again rerun it because we need a fresh DF and let me just rerun it again what's that mean equals oops it just removed one equals to sign and what's wrong with this actually we just messed up with one I think Braes or let's say parenthesis so let's say
            • 93:30 - 94:00 column of sum of sales okay oh man yeah there were like too many braces so let me just confirm before submitting so column of sales. DC okay and this braces is closing for this one perfect and this brace is uh parenthesis is closing for this one and this one for this one this one for this one this one let's run this bro what's wrong with that sales cannot be resolved really really really
            • 94:00 - 94:30 really uh uh actually do we need to again rerun it yeah man we I think just need to again rerun it what's wrong with this code let me just see so B basically colum is not callable why why why why we have already ran this and that is the only issue that
            • 94:30 - 95:00 it throws regarding column object let me just rerun it oh we just need to restart it column object is not callable why and where is that one I think something is wrong with the Braes no it is fine okay ranking equals equals to one and then column of this one filter column of ranking I think there's something wrong
            • 95:00 - 95:30 with braces let me just check wait wait wait wait wait I think code is fine wait wait code is fine there's something wrong with indentation because we have just used all the cod in single line uh I got it yeah so we just need to apply parentheses or places here let me just rerun the code and now I think it should it should run fine code was fine it was was something wrong with that perfect I knew it I knew it I knew it so
            • 95:30 - 96:00 simple so now as you can see that we have just got product two based on the month and this product two has more sales 250 + 200 becomes 450 simple and we just got the top product for each month and it can be used or let's say it can be uh applied in the scenarios where you're just analyzing product performance month by month or it can be tweaked with let's say weekly performance or let's say yearly performance quarterly performance as
            • 96:00 - 96:30 well so it totally depends but yeah you now know the fundamental thing like how you just need to approach to that question okay so I think now we should just level up our questions and now yes we should just level up our questions because we have already solved how many questions 12 yeah so let's level up our questions okay let's see what we have in our next question this is our next question what is the role of spark context in P spark what what is the role of spark context so basically spark
            • 96:30 - 97:00 context spark context is basically the entry point for spark so basically it is the connection that we built and let's say it it will just create the connection to the cluster so that it can just actually get the machine CPU memory everything available so that we can just start our processing of obviously we will not be processing but our code will be like will get started processing with the help of spark context so you can just
            • 97:00 - 97:30 treat it as like a kind of bridge between the driver program and the cluster manager so it acts as a bridge it fills that Gap and obviously it will just establish the connection to that particular cluster manager or cluster simple sort it is very simple to not make it complicated so this is all about spark context it's time to cover this most basic most fundamental and the most most most important question so obviously there
            • 97:30 - 98:00 can be a direct question that explain the spark architecture in Just 2 to 3 minutes just quickly explain this architecture or they can be like course questioning based on this particular architecture so you should know actually what is this spark architecture and this is very very simple do not make it complicated because this is really really simple let me just show you so let's say first first of all this is your cluster okay just to make sure like what are these boxes these are your clusters clusters cluster with machines
            • 98:00 - 98:30 or nodes just one cluster okay so basically whenever you submit the code let's say you are your manager or your interviewer submits the code or application it first goes to Cluster manager yes and now this cluster manager can be Yan Ado Yan right or let's say spark standalone cluster okay then then it picks one node or one machine within
            • 98:30 - 99:00 the cluster within the cluster hold on I'll just talk about that thing as well that you thinking right now hold on so it will just pick one machine and it will create a driver program on that and it will say it is a driver node okay perfect then within that driver node spark context will be created that is the starting point of our spark that's why I discuss spark context first then I
            • 99:00 - 99:30 discuss this right so it will create a spark context which is the connection with cluster manager why this connection is important so driver program will actually look at the code or it will say okay we just need to do do these Transformations okay okay okay then it will say hey we just need two more machines we just need two worker notes to actually get this work done so it will communicate that information through spark context so it will say
            • 99:30 - 100:00 okay sir let me just give you this is number two so it will say I will just assign you two worker notes done your machines are ready now it's just about between you and your machines I'm am out simple so now driver program or driver node will be communicating with worker nodes simple this is the architecture so obviously this driver program will or driver node will just pass on the
            • 100:00 - 100:30 information that you just need to do these Transformations you just need to apply this everything this then they will be just actually following the guidelines that's it that's it because this is the brain driver node is the brain now and you just mentioned just hold on why because when I say that cluster manager will actually pick one machine within the cluster it totally depends because there are like two types of mode if we are running our program in
            • 100:30 - 101:00 cluster mode then it is fine it will just pick the driver node within the cluster but if we are running our R and client mode then driver node will reside outside of the cluster it can be your local machine it can be any other VM anything but it will decide outside the cluster mode like the cluster sort it this is your spark architecture man come on come on come on come on do not make it complicated do not make it
            • 101:00 - 101:30 like a big thing obviously it's a big thing but not when you're just actually understanding it it's a big thing when it was released or when it was like actually launched right so now in terms of understanding it it should be simple sorted and now I know that you know this architure architecture architecture so now let's see what we have in our next question so we have a question and this is really really important and trust me
            • 101:30 - 102:00 this is really important so you are working with large Delta table so this is one hook that now you need to be very very very careful while approaching the question okay so now you working with a large Delta table that is frequently updated by multiple users maybe you maybe some other developers and some other uh pipelines anything that data is stored in the partitions in the form of partitions okay and sometimes update can
            • 102:00 - 102:30 cause inconsistent reads due to concurrent transactions okay now how would you ensure asset compliance okay then avoid data corruption in ppar as well boom no data frame nothing you have to design the uh solution plus implemented as well so it's like mix of uh designing solution and then implementing as well
            • 102:30 - 103:00 what you will do first of all we have large Delta table so first part of the question is how we will just ensure asset transaction so in order to ensure asset transactions we will be creating a Delta log okay and Delta log will be taking care of all the asset transaction because obviously s transaction stands for atomicity consistency isolation durability so it takes care of all those things when we create a Delta log and when we just write the Delta uh data in Delta format it will automatically create that Delta log for us sorted
            • 103:00 - 103:30 second part second part is like the implementation part it says we need to just avoid the data corruption so what actually we need to do we need to apply an upsert condition upsert update plus insert it's called upsert condition okay how we can just apply so in order to design the solution we can say DF is our new data so I can simply write # new data okay and DF is my new data and
            • 103:30 - 104:00 which can be anything like spark do read do format okay and this can be in any format let's say in Pocket format okay this is a definition okay this is a definition then we can say do load and then path perfect okay okay okay okay now we need to make sure that our destination table is not corrupted what does it mean so let's say I'm inserting some records
            • 104:00 - 104:30 for ID 1 2 3 but after 10 minutes someone else inserted some data for ID 3 345 now three ID will be duplicated because there are like so many inserts so this can just create our data corrupted like this can just make our data corrupted so how we need to overcome this thing so we will apply by upsert condition or let's say merge condition so what we need to do we will simply say DF
            • 104:30 - 105:00 equals uh not DF first of all we will just import the Delta tables so we will simply say import delta. table or let's say from delta. table because I like importing that way from delta. table import Delta table then I will just start working on my implementation part now what I will do I will create an object a Delta table
            • 105:00 - 105:30 object on top of the data which is being stored there in the Delta format so I will simply say DLT Delta table this is just just a variable so I'll simply say Delta table dot for path and then I will assign the path okay okay when I have assigned the path now my object is created this is also called
            • 105:30 - 106:00 as destination or let's say old data or let's say this is our Target okay now I will apply merge condition so I will simply say Delta table do merge or in the real world we just always try to provide the alas to this particular object because it is easier for someone who is just looking at the code for this for for the first time so person can simply say okay this is our source this is our Target simple so I'll simply say do alas as my target
            • 106:00 - 106:30 then I will say merge then I will say with Which object with which data frame I need to apply merge it is DF okay and I will provide alas to this as well and this is my source right this is my new data this is my source then what will be the condition condition can be let's say source. ID equals to equals to R tg. ID simple once it is done once it is done I
            • 106:30 - 107:00 will simply go to next line and I will simply say when not mashed insert all okay because I want to just insert all the data all the columns when ID is not matching that is fine but when matched update all simple because when ID is matching do not insert the data just update the information that is called
            • 107:00 - 107:30 upsert then I will simply say execute simple this is my implementation that I will be doing and this will just eliminate any data corruption thing and let's say any duplication of data anything any any inconsistency anything okay so this was regarding upsurt this was regarding merge where you can just apply merge by the way this will be the solution for slowly changing Dimensions as well don't worry we will just cover that as well don't worry don't don't don't don't don't don't worry so this was all about your upsert
            • 107:30 - 108:00 called merge condition okay so what do we have in our next question so we have another question and it says you need to process a large data set in Pocket format and ensure that all columns have the right schema almost why almost what you will just get to know how would you do this first of all we know that we can manually assign the schema but if we do not want to manually assign any schema if we want spark just assigns
            • 108:00 - 108:30 a schema based on the best possible combination for the columns then we can just use a function called infer schema in which we do not need to do the hard work to just Define the schema it will just pick the best possible schema for our columns so how we can just do that it is very simple so I can simply load the data DF equals spark. read do format and format is in is in bucket okay then I will simply say dot option and then I will
            • 108:30 - 109:00 say infer schema and then I will just make it true okay simple then obviously the location from where I just need to read the data and it will be a path variable okay or just like location that's it so this particular function this particular function in for schema will make sure oops I just need to add comma yeah perfect so this particular function will make sure that it will just provide the best possible schema for your data set and you do not need to
            • 109:00 - 109:30 Define any schema manually okay got it so here we have our next question which says you are reading a CSV file okay and need to handle corrupt records gracefully by SK skipping them so let's say you are just ingesting some data from the website okay or let's say you are ingesting some data from the apis and you are just getting the data in the form of csvs okay and then you figured out that some of the records are corrupted maybe schema mismatch or let's
            • 109:30 - 110:00 say some additional columns add on but according to the stakeholders permission or let's say according to the stakeholders guidance you just need to drop those records which are corrupted because this can just cause some data discrepancies and that data is already available in staging layer so do not need to worry that data will be lost no this is the specific condition where you need to drop those so how you can just do that so basically in order to do this we have something called let me just show you so I will simply read this data
            • 110:00 - 110:30 okay dfal spark do read. format so format is CSV okay and then I simply need to say do option and I will pick the reading mode yes we do have reading mode as well so I will simply say mode and I will pick drop I think we can simply pick drop malformed and I think I just need do not need to just put these double codes or do we need to yeah I think we need to yeah mode equals drop Mal form
            • 110:30 - 111:00 so it will just drop those records and then I can just simply Define the location do load and then I will simply say staging location okay simple so basically this particular mode we have I think two to three more modes for this so basically this mode will make sure that if you drop all the records which are corrupted which are corrupted they do not want to just keep those records and yes they just want to drop it so for
            • 111:00 - 111:30 that you can just use drop malformed function it is very very very very very handy yeah so let's see the next question so let's discuss this question next what is the difference between rdd's data frame and data set trust me this is one of the most basic questions that we have and this is the question where you can feel stuck trust me okay so what what is the difference between rdd data frame and data set and what is the best way to answer it keep it very simple lch and
            • 111:30 - 112:00 keep it detailed as well okay let's talk about this rdds data frame and data set so basically rdds let's first talk about rdds and we were just working with rdds before data frame and data set so the and still still that's not like we are not working with rdds now now we have just added some layers on top of rdd so that we can work efficiently don't worry I will just explain everything so basically the thing is rdds are the lowest data structures available in
            • 112:00 - 112:30 spark lowest data structures it doesn't have any structure it doesn't have any schema it does it's like the combination of objects and those objects can be stored in a list and that list is being distributed among number of machines because we know that we distribute our data to multiple machines and it is really really slow because we cannot even perform any query planning what is query planning don't worry I will just discuss that question as well
            • 112:30 - 113:00 so we cannot even do any kind of optimizations it is very very very slow very slow because obviously it doesn't have any structure doesn't doesn't have any schema so we cannot do much things then we have something called Data frame which actually solved lots of problems that were there with working with rdd so what data frame did data frame first of all is like has a structure it has a tabular structure it has a schema you
            • 113:00 - 113:30 can perform query planning you can do lots of lots of lots of stuff plus it is very very very easy to use we have so many libraries that are very much compatible with data frame I think almost all spark SQL P spark spark Scala and then R there like so so so many options available okay and data frame is like very popular and I think all the data professionals have worked with data frames maybe with pandas maybe with
            • 113:30 - 114:00 machine learning models and obviously in ppar we most of the times work with data frames okay then why do we have this data set and what is the difference between data set and data frame then so the basically the difference is data set is like a data frame but with some more more features just few features that are related to type safety that are related to functional programming so basically
            • 114:00 - 114:30 data set just combined both the worlds uh rdds and data frame and when you are just working with data frame you do not have the feat feature called type safety you cannot even Find the Errors while debugging your code you will only get the code when you will be running the code so these features were missing with data frame so data set actually solved these problems and now now you have type safety and function programming on top of data frame but the thing is why we do not use data set why we use data frames
            • 114:30 - 115:00 then so the thing is it is not compatible with python library that is pbar we can only use data sets with Spar scolar so that is the thing and data frame is easy to use uh easy to code you can use SQL like apis and if you're already familiar with Panda's Library it is very much much easier to work with P spark instead of spark Scala so obviously if you are a developer you need some leverage overwriting the code
            • 115:00 - 115:30 right how easily you can write the code how easily you can make changes so python is the OG and it is not compatible with python so that is one of the major reasons that we do not use data set but yeah it's not like a big difference between a data frame and data set but if you want type safety then you can definitely use data frame oh sorry data set so this was a difference between rdds data frame and data set not a big deal but you should know this okay
            • 115:30 - 116:00 let's discuss the next question so our next question is what is query optimization basically this can be asked in so many ways because in query optimization we actually want to discuss what is the entire flow we we we should know like how our code that we are writing how it is actually being executed on multiple machines what is the flow between the whole thing like I just wrote the code right and then it is
            • 116:00 - 116:30 being executed on multiple machines I know this but what is the exact gap between those two things so this can be asked in so many ways and trust me after covering this question from this part of the video you can answer this question like not only this question like any question which will be revolving around this topic this area okay so now let's actually discuss what is query optimization so let's say you write the code in ppar okay let's say you write the code in your notebook okay okay this
            • 116:30 - 117:00 is your code okay and this is your cluster worker nodes okay these are two worker nodes but I want to know what actually happens between the two areas like this is input this is output not output like code is going to multiple machines but what actually it is happening between these two stages okay let's talk about this and this is also about query optimization like how spark
            • 117:00 - 117:30 optimizes our query that's why we use data frames because we can just use Query optimization logical planning physical planic unch just just tell us about those things instead of just using names again and again okay okay okay sorry so basically this is your code first of all this code will be converted into a logical plan okay that sounds amazing but what's that logical
            • 117:30 - 118:00 plan okay so what is a logical plan and how we achieve this so basically let's say in your code you have mentioned so many Transformations select where join from everything everything everything everything you have just mentioned let's say 10 Transformations then obviously we know that we need to perform all the Transformations but we do not actually need to perform all
            • 118:00 - 118:30 those transformations in the order you have specified in the order you have specified we do not actually need to perform all those transform such as you just perform join right and after performing the join you are just using a wear clause or anything or let's say then you are performing select so what it will do it will smartly just just decide the order like what should be performed before and what should be performed after let's say you are applying a join and then you are
            • 118:30 - 119:00 just using select after that like select columns and you are just selecting two columns it will say hey I can simply select the columns first to just reduce the size of the data and then I can just perform the join so it will smartly just uh like decide the order of your execution like what actually you need to do and obviously it will just make your query efficient to run faster so this will create a logical plan and it will
            • 119:00 - 119:30 be done by Catalyst Optimizer once your logical plan is ready okay then we have something else and that is called physical plan okay this is called physical plan what's inside the physical plan now so basically once your logical plan is ready once we know that we need to apply a join that is fine but it will just compare the cost of that join or any transformation really yeah it will just
            • 119:30 - 120:00 compare the cost against cost model and the least expensive transformation like the type of the join or any transformation it will pick that one so for example in join we have so many types of join like uh sort merge join Shuffle join hash joint so it will just pick the best one it will just pick the best one according to the situation according to the scenario do you need to take care of it no we have hired spark for it okay and that's why we are paying
            • 120:00 - 120:30 for the Clusters so this will be converted into a physical plan and once it decides the best plan the least expensive one then your physical plan will be converted into rdds rdds resilient distributed data set that we just talked about in the previous question yes your all the Transformations will be converted into rdds because these rdds will be going to these nodes and will be executed in
            • 120:30 - 121:00 parallel okay so at the end all your data Frame data set will be converted into uh rdds so that it can just be distributed among nodes okay and I hope now you know the exact query optimization plan or query optimization or how is the flow starts and it goes to the cluster like it can be any question it can be any question but now you know what inside the logical
            • 121:00 - 121:30 plan what is inside the physical plan and then what happens after physical plan so it can be like it can take any form of the question like it can just say hey uh what comes before physical plan hey what comes after physical plan and what is the first thing what is the first plan spark creates is it physical or logical so now you know how to answer all those questions right perfect now let's discuss the next question so the now next question is tell us about spark
            • 121:30 - 122:00 session okay why why why this question I have included you will get to know okay so if I want to answer this question I can simply say hey Spark session is the entry point for spark but hold on hold on hold on an lambar you just said in the previous question in one of the previous questions you just said that spark context is the starting point for spark and now in the new question you are just
            • 122:00 - 122:30 using the same answer why are you tricking us no basically the difference is yes spark session is the entry point for spark and yes spark context is also the entry point of spark but let's clear the confusion spark session is the newer entry point for spark and Spark context was the older older one that we were using before spark 2.0 so in with spark 2.0 they have
            • 122:30 - 123:00 introduced a new entry point that is called spark session and now it is available after all the versions that that are released after spark 2.0 why they have launched this one and why like what's new in this particular entry point so the thing is before spark 2.0 we were using three different types of of let's say entry points one was spark context second one was SQL context and third one was Hive context now in the
            • 123:00 - 123:30 spark session we have functionalities of all the other starting points so spark session includes everything that spark context does plus the support for SQL operations so we do not need to explicitly uh create the spark context because it is internally managing that spark context and obviously the other two two other two SQL context and hype context as well simple and sorted so we have seen people talking about using spark context
            • 123:30 - 124:00 so the answer is yes we still can use spark context in order to support Legacy applications that people or companies have created but it is recommended after spark 2.0 that we should use spark session it is compatible and it is very very very good in order to just work with spark because you do not need to manage three different entry points right right you can simply use spk session and it's done it's done it's done got it so yeah answer is same but you should know the reason like that was
            • 124:00 - 124:30 the entry point before spark 2.0 but this is the entry point after spark 2.0 because it brings all the entry points together under one Hood under one Hood simple okay so let's discuss the next question let's talk about one of the most important questions white Transformations and narrow transformation why it is is so so so important because spark is mainly used for transformation mainly I'm not saying that it is only used for transformation
            • 124:30 - 125:00 but it is mainly used for Transformations so we should know first of all do you know that we have two kinds of Transformations if yes it's good it's good if no it's also good it's fine now you know right we have two kinds of Transformations wide and narrow and what is the difference between two and just un simplify this topic man just simplify simplify it and we also need understand this instead of just having the definition because we can look at the definition on Google but that's why
            • 125:00 - 125:30 we are just watching a video just explain us this like obviously briefly because we need to cover all the other questions as well okay let's take this first of all let's say we have two machines okay two nodes okay then let's take white transformation first so in white trans formation we do not need to shuffle data among nodes data will not be shuffled at all so can you give an example yes so if
            • 125:30 - 126:00 let's say in your node one you have data such as like you have data uh you had data frame then it was distributed among cluster of machines and these are your two nodes and let's say you have ID column okay and then you have ID such as one two uh 5 uh 9 then you have 3 6 7 simple you applied a basic transformation and let's
            • 126:00 - 126:30 pick an example of narrow transformation narrow oh am I discussing white transformation no no no let's discuss narrow transformation maybe I just created the Box on white transformation so we are discussing narrow transformation let me write it here narrow Transformations okay so in narrow transformation what will happen this machine doesn't need to shuffle the data among the different machine it is independent it doesn't need to bother
            • 126:30 - 127:00 data sitting in other machine why and how because let's talk about a basic and very popular function that we use very popular transformation that is filter let's say I want to filter data um greater than three okay let's say ID greater than three this is my transformation basic transformation man come on you have just applied this transformation on your data frame okay now in order to just return the result this machine needs to shuffle data with this
            • 127:00 - 127:30 machine obviously no bro this will simply give the results such as such as such as such as such as such as five and nine simple and this machine will simply say six and seven this machine doesn't need to shuffle the data with the other machine machine with the other node so there's no data shuffling happening between the nodes so this is a very simple you can
            • 127:30 - 128:00 say transformation that is a good example for narrow transformation narrow transformation okay okay sort it sort it now if we just talk about wide transformation what will happen what will happen Okay let me just create an example for white transformation so that you can understand everything so now we're discussing transformation okay so let's say we have ID such as 1 2 and then three then four then
            • 128:00 - 128:30 one that's not a primary key don't worry that is just ID column okay then we have one then we have two then we have six then we have seven okay very good example oh very good example for y transformation is Group by okay if you want to group bio data if you want group bio on ID column so in that particular scenario just tell
            • 128:30 - 129:00 me one thing you have id1 here you have id1 here okay but you have id1 here as well in this machine so in order to just give you the final output you need to just Shuffle the data among this machine as well only then you will be able to give the right uh output that let's say you have ID and then you have such as price and you are just applying an aggregation function on price so in order to give you give the right output
            • 129:00 - 129:30 you need to apply the group Bion ID equal to 1 which will just uh pick the data from this machine from this machine and from this machine as well so it needs to shuffle the data so in that particular scenario shuffling will be there between the machine between the nodes in that scenario it will be a white transformation simple sorted sorted sorted so this was all about white transformation versus narr
            • 129:30 - 130:00 transformation simple simple and SED okay so let's discuss our next question so let's discuss this question and this is very simple very very very simple so what is the use of ques and repartition what is the use of ques and repartition and why do we need these function these are very very handy so basically let's quickly discuss ques so Coles function is used to always reduce the number of partitions always reduce and Coles is directly linked to optimize command that
            • 130:00 - 130:30 we will be just covering questions for that as well don't worry so that is why Coles is very important very important because it is the rule of thumb that whenever we are just dealing with big data it is always easy to read data from bigger and fewer partitions rather than smaller and so many partitions it is a rule of Thum it is a rule of Thum and who has created this rule an l no no no no no no this is genuinely this is you can just Google okay don't trust me just
            • 130:30 - 131:00 Google just Google and do not even Google just just deal with that data just deal with so many partition and then you will just come here again and you will say and you will comment hey an you was right okay so when we say Coles we actually want to combine our data combine our partitions let's say I have partitions one two three four five I don't want five partitions I can simply say Coles one and two I can simply convert or let's say
            • 131:00 - 131:30 merge those partitions into fewer partitions right then one followup question which is very important does it require shuffling the answer is no it doesn't require shuffling okay simple simple now if we talk about repartition repartition can be used to increase or decrease the number of partitions you can simply when when let's say you have like two big partitions but in that scenario you do not actually want to deal with two big partitions because of
            • 131:30 - 132:00 memory issues you actually want to distribute the data among machines among nodes in that scenario you can just simply say DF do repartition and you can give any number 3 4 5 6 and it will just create the partitions like this simple one followup question does it require shuffling the answer is yes it will require shuffling because obviously it needs to just break this data into so many partitions so it requires shuffling it shuffles the data okay simple all simple right okay let's
            • 132:00 - 132:30 discuss the next question so what is the difference between cash and persist and when to use cach and persist first of all tell us H when to use cach and persist then you can just talk about the difference make sense so we have these two operations these two functions so that we can just deal with data storage on just give us the right answer on like answer on point okay so
            • 132:30 - 133:00 basically let's say we have a DF okay this is my DF oh man look at my handwriting wait let me just draw again I have such a beautiful handwriting okay so DF let's say we have a data frame right and I want to store the intermediate results of this data frame so that I can just use it multiple times let's say this DF this one is being used so many times again and again and again
            • 133:00 - 133:30 so instead of creating this data frame again and again from scratch we can store the result of this data frame into memory into disk like we will just talk about all the uh options available but we actually need to store it we actually need to store it because we do not want to just recompute it again and again no so when we I'm just answering this question first when we need to use the result or let's say intermediate results multiple times then we can just tore
            • 133:30 - 134:00 that result into memory or in this and we have functions such as cash and persist now an lamba just tell us the difference between cash and persist okay let's talk about that basically cash is also a persist function really then why do we have a different name for that what is a difference so basically persist when we simply write let's say DF do persist
            • 134:00 - 134:30 okay and then we just pass the parameter or let's say argument it's called storage let me just type it with blue so that you can just highlight that Storage level Dot memory and disk when you just write this when you just write this it becomes cache okay this is equivalent to
            • 134:30 - 135:00 Cache okay when you simply write DF do cach it is equalent to this one so this particular operation or this particular argument is so so so commonly used so that we have just created a dedicated function for it so instead of writing this again and again okay we can simply write DF do cache simple no arguments nothing but in some scenarios we have to
            • 135:00 - 135:30 use some other options as well such as memory only okay then we have disk only disk only then we have memory and disk just ignore my handwriting then we have memory and disk serialized so we have all these options
            • 135:30 - 136:00 available if we want to store our intermediate results and all these operation all these options can be accessed using persist function but as I just mentioned that this particular argument is commonly used so we have created a dedicated function for it it's called cache but it doesn't mean that we do not just work with other operations or other options no we do but we rarely use that okay so what actually it means like memory and dis so it will first try
            • 136:00 - 136:30 to just save or store the data store the intermediate results into the memory if your data frame is big is really big and it cannot fit in memory it will just write the data in disk as well so it will first try to write the data in memory then it will just write the data in desk so that is called cache and memory only as the name suggest that will simply simply simply like try to just fit the data into memory and in disk it will just simply write all the
            • 136:30 - 137:00 data to disk and in disk memory and dis serialized it will first try to fit the data to memory then it will just go to the desk so these are some of the options available within purses so basically cash is a special kind of uh addition of purist function special kind of feature or special kind of let's say method or special kind of argument that we do not need to pass again and again so we have just created a kind of special function which is nothing but persist with memory and dis argument
            • 137:00 - 137:30 that's it simple okay so let's discuss our next question so what is the importance of partitions in P spark so we are just talking about P spark P spark P spark this is a very basic question I know you know the answer but still I want to just cover this question as well so that you can just simply revise this concept as well so basically whenever we just say partitions in bpark that means we are asking spark to just perform parallelism it's also called MPP massive parallel
            • 137:30 - 138:00 processing okay massive parallel processing so we are just Distributing let's say this is the data we are Distributing this massive huge data set into pieces in two pieces and we are just this Distributing this data to machines to nodes to process it in parallel and and in parallel and in
            • 138:00 - 138:30 pieces in chunks so this word is really important MPP just use this word whenever you're just trying to answer this question so I just wanted to highlight this technical word MPP massive parallel processing so that is the major advantage of partitions by spark sorted sorted so now we are discussing question number 22 so we have a question that says you have a data set containing the name of names of employees and their departments you need to find the department with the most
            • 138:30 - 139:00 number of employees or most employees make sense and this is the data frame so how we can do that so just a hint you can simply perform aggregation based on uh department and then you can just find the number of employees in each department and then you can just sort the data one way is this second way is you can perform aggregation on top of department and discount the employees in each department then you can perform dense
            • 139:00 - 139:30 rank because that way you will get the ranking based on the number of employees it's up to you but if we just go by the solution if we just go by the solution according to the question then we actually do not need to use window function because it is just saying we just need to find the department with the most employees so it's fine so let's quickly do that and how we can do this it is very simple I think so you will simply say DF equals DF do group by because obviously
            • 139:30 - 140:00 we need to perform Group by on which column on Department okay and then aggregation will be count of employee name okay so this will basically give me the count and if I just want to provide the alas then definitely I will simply say Alas and I will simply say total employees okay so once it is done then I
            • 140:00 - 140:30 can simply say sort and then I will simply say sort on column that is total employees perfect and then we can say ascending equals to false and then let me say DF do display perfect so it will just give me perfect
            • 140:30 - 141:00 this is the output that I was assuming so HR and finance have same number of employees so basically that will just get the same ranking and then engineering will be on third so we can say hey this is your order and this these are all your departments based on the total number of employees in each department simple and sorted let's see our next question so here is our next
            • 141:00 - 141:30 question that says while processing sales data you need to classify each transaction as either high or low based on its amount okay how would you achieve this using a when condition or let's say like in this question they have already given you the hint but even if they do not give you the hint like using a v condition how you can actually do that how how how so if you have worked with SQL so you would have already familiar with case when statements so we need to
            • 141:30 - 142:00 use when case when and in highpark we call it as when otherwise statement so you need to use that approach it is very simple let me just show you what exactly you need to do architecture is simply the same you need to just create the cases and you need to just write the condition within when and then you will just output your like you will just describe what you need to see as output if that condition is successful okay or if that condition is met so it is very simple
            • 142:00 - 142:30 what I will do I will simply say DF equals DF dowith column and I will just name this new column as let's say uh classify transaction so I will simply say price cat price category like high or low okay high or low yeah based on the condition and we can simply pick any condition let's say if price is greater
            • 142:30 - 143:00 than 50 then we need to say it is high otherwise it is low okay so this is my column name price Cad and now now it's the main thing how we can just do it I will simply say when and then I will simply write the condition so I'll simply pick column sales greater than 50 then I want to say it's high
            • 143:00 - 143:30 right okay if it is not the scenario then I will simply pick and say dot otherwise then I will say it's low simple okay and then I can simply say DF do display so this is my data frame that is coming right now now okay perfect that's I wanted to see right we have created a new column in which we have categorized the prices based on the value and the condition we
            • 143:30 - 144:00 can actually see that was provided in the question so now if you ever see any question related to categorization based on some conditions you should always use when otherwise clause and within this you can simply Define the condition like this let me Zoom it you can simply Define the condition like this and just just make sure that you are using the thought of the wi because it is similar to case when statement and that as well we just use when case when then and we
            • 144:00 - 144:30 use else right it is exactly exactly exactly similar to that particular thing that's why we say fundamentals are really really important because we when we just change the Technologies we do not change the fundamentals we actually change the implementation right so if you know the fundamentals then you you you are good okay so this was our 23rd question I guess yeah so let's see what we have in 24th question let's see so in this
            • 144:30 - 145:00 question let me just tell you this is all about your IQ this question is all about your IQ this is very very very less to work with piceart for this question it is all about your IQ so what we have in this question while analyzing a large data set okay you need to create a new column that holds a timestamp of when the record was processed so before continuing this question I
            • 145:00 - 145:30 would just simply like to break this question so let's say you have a source okay so let's say this one this is your Source now maybe there will be a column maybe like this in which time stamp will be there so this timestamp can be from The Source in which you will have let's say when the record was actually added in that Source
            • 145:30 - 146:00 right but we as a data Engineers we as data Engineers need to actually add the column in which we will just tell hey this record was processed at this time okay then how would you implement this obviously then what can be the best use case this question is important the second part is even more important what is the best use case for this like what can be the real time scenario or they can ask you okay you know how to
            • 146:00 - 146:30 implement this when did you implement this thing when and under What scenario what was the use case so this is tricky so let's solve the first part then I will just tell you the second part okay so how we can do do this it is very simple we have already just work with this kind of function so I will simply say DF equals DF dowith column okay and I will simply say process time process time I will simply use function called current time stamp current time stamp
            • 146:30 - 147:00 perfect because they want detailed time okay and then I can simply say DF do display perfect let me just run it okay perfect now you can see I have detailed time detail time so like can just expand so it is just showing here 2025 11th of Jan and then 723 and so on milliseconds and seconds so now we know like at what time these
            • 147:00 - 147:30 records were actually processed what can be the best use case that is the most important part of this question so the best use case is it can be like so if okay let's let's give the straight answer like let's give the straight answer so the answer is whenever you just dealing with let's say slowly in dimensions then you might have used this kind of function because in that scenario you actually need to tell like at what time this record was
            • 147:30 - 148:00 changed at what time this record was modified at what time this record was created so at that time you create two different columns create time and modify time so in that particular scenario you can just make use of this function and you can say hey I have used this function and I applied these steps when I was working under that particular scenario when I just when I was just working for like when I was building a data warehouse when I was just dealing with slowly changing Dimension so you can just say like this okay okay so in this
            • 148:00 - 148:30 question in the 25th question we have you need to register this Spar data frame which one this one okay as a temporary SQL object and run a query on it how would you achieve this so basically what they want they want to convert this TF into any SQL object because they want to run SQL query on top of it let's say they do not want to use Python API or whatever it can be but they want to run SQL query on it so how
            • 148:30 - 149:00 we can do that so in order to achieve this we can simply create something called Temp View in ppar using this DF and then we can simply run our SQL queries normally let me show you how you can do that so you can simply say DF dot create or replace temp I got the popup okay simple create a repl oh by the way this question is really important the next one that I just got in my mind I will just share you don't worry DF do create or replace Temp View okay okay and then I can simply name it
            • 149:00 - 149:30 let's say I want to say temp SQL DF okay and it should be done and then once it is done let me just run this first so now what we have done our SQL object is ready now you can see temp equal DF now I can just use this as my temporary table temporary view whatever you want to say and how I can just do that I have two ways one way is using
            • 149:30 - 150:00 python as my language and I can simply say spark. SQL okay and then I can simply say select oops select test tricks from temp SQL DF simple one way is this this is just the one way okay this is the output obviously this is not the output that you want to see and if you want to print it you can
            • 150:00 - 150:30 just print it using a DF but if you want to just apply pure SQL code you do not even want to use your spark. SQL something simply say SQL then you can simply write everything that you write in a SQL UI workbench so I will simply say select Ax from temp SQL DF perfect I didn't use any other syntax I will simply run it then I will simply get the result that I want
            • 150:30 - 151:00 see everything is done they can just run anything it's not just about select statement you can also use where along with this like you can just use pure SQL let's say we product ID equals to product one and let me just run this see this is the normal output like normal SQL that you use in your SQL workbench so you can just convert any data frame into a temporary View and
            • 151:00 - 151:30 then you can easily easily easily run any SQL syntax any SQL code on top of your temporary object see simple okay so in the next question we have you need to register this ppar data frame the one as a temporary SQL object and run query on it Ana we have just solved this question wait wait wait we have a special requirement this time from different notebooks as well what does it mean so
            • 151:30 - 152:00 it means that when I just create this view this Temp View this one if I'm just trying to run this view from different notebooks I cannot I cannot so if I want to use this view from different notebooks remember all the notebooks are attached to the same cluster okay so in that particular scenario I cannot use Temp View okay then what's the solution so the solution is you can simply say DF
            • 152:00 - 152:30 dot create or replace this time you will say Global Temp View and then you can just name it as global view once I create this you can actually access this global view from different notebooks attached to this cluster whereas in case of this one uh this one you can not use this from different different notebooks even
            • 152:30 - 153:00 if they are attached to the same cluster no no no that's why we have a Global Temp View which can be globally used and it will only be reset when this cluster will be terminated before that it will be available it will it can be accessed through any notebook attached to this cluster okay so if I just create this okay it is created so now I can just simply query it using the same thing select as see these are like small small things that can confuse you at
            • 153:00 - 153:30 that time so you should be well prepared for this now if I want to query global view I cannot write like this global view no so if I just run this it will just give me an error that it cannot be found so in order to query the global view we have to use a keyword called Global Temp then only we can query our Global Temp views just remember this thing just remember this
            • 153:30 - 154:00 thing Global Temp you have to have to have to use this keyword Global Temp in order to query this global view okay sorted so this question is like your go-to question if you want to just deal with nested columns and I think in to's world we almost get all the data in the form of nested structures at least you will get one column in your like in the form of nested structure just trust me so it says you need to query data from a
            • 154:00 - 154:30 ppar data frame okay using SQL okay that's fine but the data includes a nested structure how would you flatten the data for easier quering do you know what it is saying so this is data frame and you need to just use SQL you can you can you can just create a data frame and then you can just do it but it is saying that it's your responsibility to flatten this data so that if let's say a data analyst wants to query this data and
            • 154:30 - 155:00 that guy doesn't know how to just flat just apply the flatten Transformations and they do not even want to mess up with this data so it's your responsibility to to apply the flatten transformation to make this data flat simple how you can do that so in order to just flat this column so we will simply create basically two columns out of this one because we have two sub columns price and quantity right so in order to achieve this we will simply say
            • 155:00 - 155:30 DF equals to DF dowith column or let's say we just need to select it right so we will simply say DF do select then we can simply say uh product info okay then I will just use Dot because now I will use the key within the column price simple then I will simply say product info do
            • 155:30 - 156:00 quantity simple and there's one more column called product ID so let's include that as well product ID simple then I can simply say do display perfect perfect perfect perfect or if we were just want to create a view on top of it we can simply say do create or replace Temp View and we can simply say a flat view perfect and we can just
            • 156:00 - 156:30 run this perfect then I can just simply query this SQL and select Ax from this View and this view name is flat View perfect so now I can just simply query this as you can see this data is now broken into different different columns simple it was simple but yeah it is very
            • 156:30 - 157:00 very very handy and in order to answer this question this is the best way to answer it and if they are not asking you to create a view you can just say like you can even create Temp View and it will just show that you know how to deal with these scenarios okay okay sorted so so in this question it is saying while reading data from pocket you need to optimize performance by partitioning the data based on a column and it can be any
            • 157:00 - 157:30 column let's say column called category how would you implement this so it is saying that you have data in parket format okay and this is the big data set but at the time of writing this you do not need to just write the whole data in one folder you actually need to create partitions and based on a column called category see let's say this is category food this category clothes this category uh
            • 157:30 - 158:00 electronics so you need to just create the folders and then you need to just put the data so does it mean that you need to just give three different locations no bro no no no no no let me just remind you that we have a function called Partition by so I will simply say DF do WR do format let's say I want to just write the data in Pocket format then I will simply say do mode and mode can be append then the main thing it's
            • 158:00 - 158:30 Partition by then I can just Define that column let's say the column name is category simple and obviously then do save and the location so what it will do it will simply create the folder three different folders food um clothes and what was the third which was the third one Electronics yeah so it will just create different different folders within this location like after this location let's say I have a folder
            • 158:30 - 159:00 called Data within that data we will be having three different columns and within those folders like not columns within those folders we will be having our pocket files automatically it will just create the partitions and why do we need to create partitions because it is much efficient to just read data because when we have like partitions it is much easier like it promotes query optimization and it is very much easier to read the data okay so this is all
            • 159:00 - 159:30 about partitioning okay so we have this question and this is the question in which you will be tested by your bio skills which can just save a lot of money to that compan company how because it is saying you are working with a large data set in poket format and need to ensure that data is written in an optimized manner with compression because obviously whenever
            • 159:30 - 160:00 we just compress the data we are just saving a lot of money because we do not need to pay for the storage and when we have big data we always compress the data but how we can just compress the data and what is the best way to do that so basically in data engineering world when we are just dealing with Pocket Data so pocket file format so we use a compression called let me show you so first of all I will simply say DF do write do format so I simply say
            • 160:00 - 160:30 pocket oops then I will simply say do option and option will be compression and what will be the compression the compression will be Snappy so Snappy compression is one of the most popular compressions that we have which is optimized work with query performance and obviously for storage so this is very very very good whenever you are working with pocket file formats okay so we will be using Snappy gation and this is the way to use it and now you also know it right okay just keep
            • 160:30 - 161:00 this small thing in your mind because this can be asked because definitely if you're a data engineer you need to save a lot of lot of lot of cost for the company okay sorted so this is the question that can make your selection really in the company I'm not kidding I'm not kidding because this is the need of the r and this is something that every recruiter or let's say almost all
            • 161:00 - 161:30 the companies are looking for let's read the question first because this question will be discussed in little bit detail because this question has to be discussed in detail because this is the turning point and trust me this question will be in your interviews plus if you will be able to answer this question like the way I am doing it bro bro you'll be just coming to this video and you'll be just commenting with with so so so many hearts and all just trust me bro let's actually discuss it so it is
            • 161:30 - 162:00 saying your company uses a large scale data pipeline that reads data from Delta tables okay simple and processes data using complex aggregation everyone does that okay however performance is becoming an issue due to Growing data set size okay we all know know because data is exponentially growing and every company is looking to solve this issue and that's why they are hiring data Engineers who can definitely work with the large data sets and who are
            • 162:00 - 162:30 knowledgeable to actually know the issue and how to resolve that issue okay how would you optimize the performance of the pipeline okay so let's understand this question they are saying that the data is growing in their Delta tables from where they are reading maybe they just connecting their Delta tables in powerbi all say anything anything we need to optimize the performance of Delta tables at anyhow okay what's the best way to do that if we have partitions in the data obviously like we
            • 162:30 - 163:00 are assuming that they have already applied all the partitions or everything still they are having issues with their Delta tables what we can do this like what we can do in this scenario what so I will simply use a equal syntax why because it is much easier and compatible not compatible like it's same uh if I just compare the performance it's easier to apply simple thing and we can simply write optimize okay table name let's say our table name is table Delta then I will
            • 163:00 - 163:30 simply say Z order by hey an what you are writing wait wait wait we'll tell you wait Z order by and let's say I want to pick column as order date okay as my Z order by date order date yeah so I wrote this command and this can immediately improve the performance what's so magical in this statement and is it all about the answer
            • 163:30 - 164:00 no no no no because interviewer will ask you so many cross questions when you will just say okay this is the command okay what can be what what can be the possible questions let's discuss that and before that let me just tell you why did I use two keywords one is optimize second is z right and let's say the first question is let's start from the questions first question is what is this command doing optimize and what is this command doing and what actually happens
            • 164:00 - 164:30 when we are using both the commands together this question can be asked in so many ways in so so so many ways trust me so let me just tell you what actually happens so let's say this is your data set and this has partitions such as like one 2 3 four five six okay this is just an example six partitions are not like so many partitions just six partitions okay so this optimize command will Coles the
            • 164:30 - 165:00 partitions will Coles the partition remember I said like we just call E the partition optimize and it's time to discuss this because now you have a very strong base because you have solved so many question so these are six partitions 1 2 3 4 5 6 yeah perfect so this will just code the partitions and it will just make fewer partitions of bigger size so let's say this will just create two partitions one and two do not need to worry we do not need to decide how many partitions we need it
            • 165:00 - 165:30 will automatically pick the best partition size automatically okay so let's say this has created two partitions out of six so now we know that it is much easier to read these two partitions but why did we add Z order by command and what actually it is doing and how it can just improve the query performance more how because when we use Z order by command we are saying that data within the partitions should be sorted should be
            • 165:30 - 166:00 sorted either let's say in a sending order in descending order and after that what what is the benefit of that thing let's discuss that first let's say I have data such as 1 2 3 4 5 these are like IDs then 6 7 8 9 10 because data is sorted right okay now one thing to note that when we are just working with Delta
            • 166:00 - 166:30 tables we get something called column statistics for first 32 columns so just make sure that you are using Z order by command and the Z order by column comes under the first 30 32 or 33 columns yeah so when we are just reading these partitions not us like when spark is just trying to read the partitions it has the statistics let's say minimum value is one maximum value is five these are like one of the or let's say two of the statistics it has right now this
            • 166:30 - 167:00 partition has six this partition has like maximum value is 10 okay perfect now let's say you just use the query as select estx from this table where ID is less than equals to 5 okay okay sorted earlier if we are not using Z order my command it would have gone to both the partitions why because data is not sorted it needs to just go
            • 167:00 - 167:30 inside each partition and need to fetch the data but now data is sort it because you're using Z order Z order by command so now it knows that this partition has the maximum value is five okay and this partition's minimum value is six that means we do not need to read this partition that's called Data skipping that can be the other question like I I I told you like this question can be asked in so many ways they can ask you like what is data skipping so this is data skipping it will just skip this partition because this data is not
            • 167:30 - 168:00 required this data is not required this partition is not required in the query sorted if not sorted just rewatch this part it's fine it's fine to rewatch this part you can just rewatch this part and just make sure you are well prepared for this question bro well prepared means well prepared okay sorted we have our next question and it's called what are broadcast variables and why are they used so basically first of all these are
            • 168:00 - 168:30 variables okay so why are we call as broadcast variables obviously you would have used your Messenger application such as WhatsApp and you get the feature such as broadcast it is little bit similar to that so basically what we do broadcast variables are like are a way of of just sending the read only variables so let's say I have a variable in which I have a list of country codes or anything so instead of just sending that list to all the nodes I can just
            • 168:30 - 169:00 broadcast it and it will just send all that data and the data will be just read only data and it can be accessed by all the nodes so we do not need to actually send the data to the nodes we can just broadcast it and then that particular variable that particular broadcast variable can be accessed by all the nodes so why are they used if you just want to eliminate the networking overhead we should always use it because
            • 169:00 - 169:30 we do not need to worry about networking and obviously it is just used to optimize the joints when we are just using that particular variable multiple times so we do not need to just worry about all that list so we can simply broadcast it so these are like broadcast variables basically these are read only variables that are shared across multiple nodes okay simple then we have next question what is the difference between DF do show and DF do collect
            • 169:30 - 170:00 okay so as we know we use this function frequently and it just returns some default number of rows so that we can actually look at the data let's say it will return some thousands of Records we can just set the default value but it will not show all the records but it will just give us the overview of the data or you can say preview of the data then this tf. collect function it will return all the rows as the list of
            • 170:00 - 170:30 objects and it will just return all the rows trust me all the rows so it will just create a list of objects okay and it will just show you the data but you should not use DF do collect if you have a big data set or Big Data frame because it can give you an error called driver out of memory so just use it cautiously and we rarely use DF do collect it depends upon the situation but most of the time if you want to if our goal is to just look at the data then obviously
            • 170:30 - 171:00 we use DF Dosh show or display. display instead of using DF do collect but yeah there are some scenarios where we use DF do collect where we actually need to return the result in the form of a list so that we can just use that list okay this was all about DF do show and DF do collect that's a small difference but you should know this okay sorted so what is lazy evaluation in by spark so this is one of
            • 171:00 - 171:30 the building blocks for ppar or let's say the importance of ppar nowadays because because because because first of all an just tell us what is lazy valuation we are more interested in that answer okay so lazy valuation basically means ppar lazily evaluate your code what does it mean so let's say obviously if you want to explain this scen this
            • 171:30 - 172:00 process you have to use an example so you can say let's say you have a data frame okay and you have performed some transformation such as DF do filter then in next code cell you said DF dot uh sort like there are so many Transformations right what do you think will these Transformations will be executed right away when you run the cell no no I know when you run the cell it shows completed it doesn't mean that
            • 172:00 - 172:30 your Transformations are actually performed no it will not perform anything no nothing so what it will do it will just store all that information and it will create a logical execution plan and now you know what is that right right so it will create The Logical execution plan for all the Transformations and it will only execute it when you will trigger something
            • 172:30 - 173:00 called as action so we have some action operations and when we use that only then it will be executed like all the Transformations will be actually executed and before that it will be just stored in the logical plan that's it that's it and what are those actions so basically when we write DF do show that is an action when we say DF do
            • 173:00 - 173:30 collect that is an action that's why I discussed that question first so you can use hundreds of Transformations hundreds of functions filter join wear everything okay if you do not trigger any action no transformation will take place simple simple so the moment you trigger an action it will just perform all the
            • 173:30 - 174:00 Transformations obviously in the same order really no we what we discussed it will create the logical plan okay so it will see that what transformation should be uh executed first and then what next so it will just save that information in the logical plan okay sorted I think so okay so this was all about lazy valuation let's see what we have in our next question so this is the hot question hot question hot question hot question because it is it
            • 174:00 - 174:30 is it is it is strongly aligned towards the modern scenarios Delta lake so what are the advantages of Delta Lake over traditional file formats just give me three advantages just three Okay so number one you should say asset transactions okay asset transactions
            • 174:30 - 175:00 basically it makes sure all the four properties that are like atomicity consistency isolation durability are are following all the rules and they are well optimized for transactions or let's say transactional data how it manages atom do you know okay let me tell you so let's say you we we know that all the data is being stored in the Delta log of the Delta Lake by the way
            • 175:00 - 175:30 if you want to learn Delta Lake in detail I have a detail video on Delta L you can definitely check that video and you will learn almost everything and trust me if you are aiming to crack the interviews you cannot afford to just sit in the interview was before watching this video like this one like Delta Lake one because Delta lake is the backbone of all the data engineering Solutions right now which are aligned toward spice
            • 175:30 - 176:00 Park data braks everything so after watching this video just click on the video coming on the screen because this video is genuinely genuinely genuinely recommended to a the interviews and because it will just teach you everything and it will just make you a pro Delta L developer so highly recommend it so I have just mentioned everything how it just enables atomicity and everything okay so one thing is asset transactions second thing is schema
            • 176:00 - 176:30 enforcement schema enforcement and schema Evolution so by default Delta Lake uses a structure called Data uh on read it applies schema let's say schema on read so it will just apply the schema when it will be reading the DAT but in Delta Lake it applies schema on right it
            • 176:30 - 177:00 will just check the schema of the Target location whether it matches with the source or not if not it will definitely throw the error and that's what we want so that it cannot break anything right so that is schema on read H schema on write but we can also evolve our schema based on some scenarios it's very simple it's very very very simple again if you want to just learn everything regarding this click on the video coming on the screen and it will just guide you every everything then what's the third thing
            • 177:00 - 177:30 third thing is optimization techniques we can optimize the Big Data one thing that we have already watched optimize and Z command command there are some other techniques such as liquid clustering and all so these are like major advantages of Delta Lake over any any any oh my hair man okay now it's now it's better so these are like major advantages of Delta Lake over any traditional file formats
            • 177:30 - 178:00 and yes over pocket file format as well okay okay sorted so this is the next question what happens when a ppar job runs out of memory it is also called as o and very popular two o are driver out of memory and second one is Executor out of memory so basically
            • 178:00 - 178:30 driver out of memory occurs when driver programmer driver node is collecting larger data than its size than it's storage and its memory so when it can happen because when we just say DF do show we are just just showing a very small sample right just few thousands of Records in the previous question that's why I mentioned so when we say dot collect and if the size of the data frame is so so so big
            • 178:30 - 179:00 that driver's memory cannot handle it so then it will definitely say hey driver is out of memory you need to do something so that's why if your driver node is not big enough and cannot handle much bigger data sets or data frames you should not use DF do collect option okay what is this executed out of memory by the way this is very very very lengthy topic but in order to answer this question we can simply say executed out of memory can happen due to due to many reasons one of the major reasons you can
            • 179:00 - 179:30 say is skew data let's say his data is highly highly highly skewed and while applying let's say joints it needs to just um pull the data to the same executor right then it can just apply the joint and if your data is so skewed it is so skewed that it cannot pull or let's say it cannot call all the joining key on the same you can say executor then it
            • 179:30 - 180:00 will simply say executor is out of memory what are the steps that we can take in order to mitigate these things first of all increase driver's memory or increase executor's memory and in case of executor out of memory we can also uh try to make our data less cued and if it is not possible then obviously we will be using aqe what is AQ we'll be discussing about that as well don't worry and the previous way or let's say
            • 180:00 - 180:30 older way of dealing with qess was salting so we can just Define a salt key in which we can just make our data less cued how it will just add a kind of suffix after our joining key and it will just break that one single value into multiple ones so that it do not need to be exactly fit into only one executor makees sense makees sense so this was all about salting and how we can just mitigate the data qess okay so
            • 180:30 - 181:00 this is all about ppar uh job runs out of memory remember driver out of memory and executor out of memory simple simple let's discuss or let's cover a question which is revolving around aqe because AQ is a new topic okay let's see so this question is really really important this is really important aqe because this is a new topic not really new but yeah not really old as well so yeah as compared to the
            • 181:00 - 181:30 other stuff in spark it is really new what is aqe in ppar and why is it useful first of all aqe is adaptive query execution by the way it it has taken care of all the overhead stuff that you do not actually need to worry about including skewness that's really good so basically aqe or adaptive query execution is
            • 181:30 - 182:00 a spark optimization technique which which which which which optimizes query execution at runtime this is important it optimizes the query execution at runtime and it has three
            • 182:00 - 182:30 major powerful features and these are first of all Dynamic partition pruning then we have join strategies optimization okay then we have
            • 182:30 - 183:00 dynamically uh optimize qess Miser skus so what are these three features let me quickly just give you one liner thing so that you can just understand and you can just revise if you already know so Dynamic par pruning what's that basically let's say you just applying any kind of white transformation so by default you get 200 partitions right not after AQ like before aqe so it will be just divided into 200 partitions it it
            • 183:00 - 183:30 totally depends let's say you are just applying a group by so it will just try to fit each value in in like one partition okay let's say you you are just applying a group by on ID column and you have a value called one so it will just create one partition for that one okay and let's say you have just 10 values so you will just have 10 partitions okay and rest of the 190 partitions will be empty so those 10 partition and let's say you have skewed data as well okay so in out of those 10
            • 183:30 - 184:00 partitions you have one partition of a big size and rest of the nine partitions are really really small small as compared to the bigger one in that particular scenario it can col the partitions and it can just remove the unnecessary partitions what do I mean so let's say we have 10 partitions for now but let's say it will just C the partitions dynamically based on the size and it will just make like only three
            • 184:00 - 184:30 partitions sort it and then rest of the partitions will be eliminated like all the empty ones as well it will just eliminate all those partitions so that is why it is really really important if you have like so many partitions like say you are just applying Group by by transformation do not need to actually worry about uh small partitions problem because you can just call Easy partitions based on the memory based on the size of the data simple then we have join strategies optimizations so let's
            • 184:30 - 185:00 say you are applying join between two big data frames two big data frames df1 df2 okay and both the data frames are big but while applying the joint you are using any filter in df2 which makes it really small data frame really small really small so in that particular scenario it will dynamically process it and it will say hey why do I need to just perform Shuffle sort merge join join or hash
            • 185:00 - 185:30 join I can simply perform uh broadcast joint because this is a really small data set after applying the filter so it will just dynamically execute that and just change the join strategy at run line we know when we just writing the code we know that broadcast join is not possible because df2 is really big but after applying the filter condition it is really really small Okay the third one dynamically optimizes qess so let's say you have like one partition like one big partition so it will dynamically at a
            • 185:30 - 186:00 run time just break that partition into multiple ones and it will just take care of the qess as well automatically you do not need to worry about qess it will take care of it wow that means we do not need to actually use salting yeah you can say that because you can just enable aqe how to enable aqe it is auto enabled now after I think spark 3.0 or 3.12 yeah now it is auto enable so you do not need to actually worry about that sorted now you are I know that after
            • 186:00 - 186:30 knowing all these points you would start loving aqe I also did that aqe is amazing it's amazing it is just taking care of everything everything so this is all about aqe if you get any question hey how you will do that how you will do that just say I will not do that because I will just enable aqe and I do not even need to do that as well because now it is auto enabled so that's that was all about aqe
            • 186:30 - 187:00 let's see what we have in our next question so here we have a question how would you handle SK data in pispot and you just mentioned that aqe handles Q data yeah I know so the thing is still interviewer can ask you just tell me how would you handle skew data using salting he or she wants to hear something about salting maybe they can just simply ask you hey what is salting I know it is an older method to
            • 187:00 - 187:30 just deal with salting but still it is very simple not a big deal so first of all we can handle skewes using salting that is our previous method as well like older method so what happens let's say we have a key column and this is cued let's say we have a value called one one one one then two and then one one one one one so this is cued right because number of one are like so much like so many ones are there so what we do we create a salt
            • 187:30 - 188:00 key in which we add a random ID after this let's say I create a function called like the this F this colum and then I just add a random value I can simply say 1 a then 1 a then 1 B then 1 C then two 2 a so obviously 2 is two so it what it will do it will just eliminate the skewness it will see it
            • 188:00 - 188:30 will just now use this value as the joining key 1 a 1 a earlier we had like 1 2 3 4 4 one now we just have two now our data is not much skewed so this is the technique that we were following before but now now you can say salting is one of the methods definitely you should know about salting if it asks you and then obviously aqe the best one it will take care of skew data automatically on its own
            • 188:30 - 189:00 you can simply say hey enable aqe if it is not enabled but I assume that it is enabled okay simple so we have our next question what is broadcast join and when should you use it basically broadcast join is used when let's say we have two data frames df1 df2 df1 is really really big let's say 1
            • 189:00 - 189:30 GB it's not much big but for example it's fine and this is let's say of 3 MB so let's say we have three executors so our data this df1 is distributed among these executors right okay then this df2 is here in this executor if we do not apply broadcast
            • 189:30 - 190:00 joint so TAA shuffling which is an expensive transformation will be done like this trans like these partitions will be coming here to apply the join then this then this then this so in order to just eliminate everything like why will be happening because you need to have data in the same executor okay in the same machine if you want to apply the joint so what we will do we will say hey this data is really really small
            • 190:00 - 190:30 this data frame is really really small it's off just 3 MB it can easily fit in the memory we will simply say hey do not come to our machine just take this data just take this data just apply join by sitting into your machines only do not come to our machine so this will just broadcast C this data frame to all the machines and we just save that data frame to the memory make sure that data frame is small enough to fit in the memory and how we can just use this like you can simply write df1 do join df2 but
            • 190:30 - 191:00 remember you need to use the word broadcast before writing df2 so that it can just pick that okay we need to apply broadcast joint by the way spark automatically performs a broadcast joint if your data frame the smaller one is smaller than the threshold that you have given for your broadcast join let's say You have given threshold as 10 MB and this data frame is of 3 MB it will automatically perform broadcast joint but you can explicitly mention broadcast keyword so it will just broadcast that
            • 191:00 - 191:30 particular data frame to all the machines simple sorted yep sorted let's see what we have in our next question so Mr and Ms and Miss okay men are simple just miss Mr and we have Mr mrss no Mrs Miss okay whatever XYZ what is spell and Spark and why does it happen just tell me by the way this is the question that
            • 191:30 - 192:00 we have I think almost discussed almost and we have just included a fancy name it's called spell so what is spell and why does it happen so basically it is similar to executed out of memory so little bit similar to that one okay let me just complete so when our executors do not have enough memory to store intermediate results or intermediate Transformations so then in that scenario spark writes data to the
            • 192:00 - 192:30 disk and in that scenario we just call it as spell that it is spelled and in that scenario it will just use the disk to store the data to store the result right so why does it happen so it can be due to let's say executed out of memory like we do not have enough memory and why it can happen we already know skewness or like expensive Shuffle joints and all so this can happen right skewness major major
            • 192:30 - 193:00 concern in order to mitigate this again we can just increase the memory or we can just try to eliminate the skewness simple right simple so this is all about spell and now you know how to answer this bro and sis sister sis doesn't sound sound good bro is still fine bro sis what are Delta Lakes time travel
            • 193:00 - 193:30 features and how do they work it is really important jokes apart it is really really important and this can just come as a kind of scenario to your interviews let's say let me give you a scenario you are a developer okay and you just join the company company and you just deleted some data in Broad and they are now saying hey just bring back the data just bring back the data right now how will you do that basically you will just use the same
            • 193:30 - 194:00 answer that is called time travel features within Delta Lake and the this is like one of the best features best features and I love it I love it it is really really really good so what we do let's say we have a Delta table okay we can actually see all the versions of the data so for that let me just break down this question in two parts one is versions and then we use time Trav based
            • 194:00 - 194:30 on those versions okay so let's say we have versions right so I will simply say versions so I will simply say let me use blue color I like blue color come on an so I'll simply say this describe history then table okay I will simply run this command I will simply run this command when I run this command it will just give me the table with all the versions
            • 194:30 - 195:00 let's say version zero version one version two version three and it will just show me all the activities performed on this particular table okay let's say let's say yes you just need to just answer like this in detail okay it's all up to you bro it's all up to you they are just interv interviewing interviewing so many candidates right let's say they are interviewing 20 candidates out of obviously hundreds or thousands of applicants they short list
            • 195:00 - 195:30 is 20 and they are just interviewing 20 applicants and everyone is able to answer all the questions let's let's imagine this how will you be an outlier among those 20 applicants to just show that you have knowledge in depth or you have a knowledge and you have the capability to explain the stuff to your Junior ones or let's say to your teammates right so just try to explain everything in detail everything in detail they should
            • 195:30 - 196:00 feel that this is the person this is the person this person should be the CEO of our company no bro just a developer it's up to you bro if you want to be if you want to go into the management line you can just a that not a big deal not a big deal you are a human you can do you can do anything trust me you can do anything don't trust me it's up to you
            • 196:00 - 196:30 okay so this will just give me all the versions so let's say till version two let's say this is my these are all my versions and till version two everything was fine and when you deleted the data it was for version three now if I just want to and now our table is on the version three okay is at version three now if I just want to travel back I can simply use I can simply use
            • 196:30 - 197:00 restore table to version as of two simple when I run this it will simply take me back to version two simple and sorted and just tell your manager hey bro your data is back no no don't don't don't talk to him like this just I'm just explaining the stuff you can just talk to yourself hey we have done that so this way you can actually
            • 197:00 - 197:30 restore the data and you can just actually perform the time travel okay okay okay okay that is really really important one that is why I have added so many new questions we know that we have already watched so many so many so many content regarding bbar questions it's good now it's time to add some new questions as well because these are all the part of P part
            • 197:30 - 198:00 really yes because you use P parar to actually run all these things you use Park SQL you you use spice Park you use spice Park so that is the part of your interview questions bro bro okay so this is all about Delta Le okay okay what do we have in our next question let's see so in this question hold on let me just do
            • 198:00 - 198:30 it so in this question number 43 what do we have we have performed so many aggregations using Group by now it's time to form aggregation but we do not want to find any count any max value Min value or some this time we need to aggregate the
            • 198:30 - 199:00 data really just read the question you are processing sales data gr by product categories and create a list of all product names in each category you get the question so what we actually need to do we have category okay then we have product this time I don't want to just count number of
            • 199:00 - 199:30 products in each category I want to show how many products do we have in each category how we can just do that so basically we'll be using a function called collect list so I just show you how you can do that so we simply say DF dot Group by okay Group by column is category perfect then AG this time I
            • 199:30 - 200:00 will use collect list as you can see collect list then I will simply say product sorted I will simply say areas and I will say products that will just contain the list of products right then I will simply say DF do display let me just run this and you will see the list this time
            • 200:00 - 200:30 see let me just expand it so I have category I have furniture then I have list of all the products laptop smartphone chair table simp simple but yes really really really helpful so just keep this function in your mind because this is really great function and if you are familiar with SQL this is similar to group concat group concat function in SQL and we have
            • 200:30 - 201:00 collect list okay let's see what we have in our next question so in this question this is pretty much similar to the previous one but this can just help you to memorize which function need you need to use under which scenario scario and that's why I have just put this question right after the previous question so this question says you anal you are analyzing you are analyzing orders Group by customer IDs and list all unique
            • 201:00 - 201:30 product IDs unique this word is tricky unique product IDs not all the product IDs each customer purchased okay so in previous scenario we just Group by on customer ID or let's say the uh what was the thing we had category yeah so here we have customer ID but this time we do not need to put all the product
            • 201:30 - 202:00 IDs let's say under customer ID one one1 we have two product IDs p001 p001 but this time we just need to put unique product IDs we have collect list function but that doesn't do that then what we need to actually do that like how we can just do that so we have a function called collect set if you are familiar with mathematics so being a mathematics student I know what is a set and I'm not
            • 202:00 - 202:30 flexing my degree so basically set is a function which eliminates the let's say duplicates and it will just give you the unique values when you just apply set like Union and all I studied I think about sets in class 11th in the year 20 17 I guess fine so let's code now so DF do group by or
            • 202:30 - 203:00 let's say DF equals DF dot Group by okay and then we just need to apply Group by on customer ID then I will simply say Ag and then I will simply say collect set instead of collect list see then I'll simply say product ID product ID and I will simply just put
            • 203:00 - 203:30 an Alas and I will simply say unique products and then DF do display Simple Man Simple simple simple and I should just see one value for each customer because this one 4101 4101 for1 01 it's P2 P1 and for 102 it's just P 001 let me just look at the data set so yeah perfect so as you can
            • 203:30 - 204:00 see for sort it come on man see 1 01 P 001 P2 P 001 so it eliminated this thing and for 1 02 we just just happy 001 sorted okay good good let's see what we have in our next question so in the question number 45 we have a combination of tasks really bro so the question says for customer
            • 204:00 - 204:30 records combine first and last names only if the email address exists otherwise don't do that so let me just have a look at my data set or data frame so we have first name last name and email so we need to combine we need to just concatenate first name and last name only if email ID exists otherwise don't do that don't do anything okay here I will just show you
            • 204:30 - 205:00 obviously all the tasks that we need to perform but I will just show you how you can just perform concat using a different function really we have a choice while performing on cat yes we have a choice we have a choice life is all about choices right it's all about choices you're watching this video it's your choice or you are just like you just love me maybe okay okay okay okay as your brother
            • 205:00 - 205:30 so how we can just do that basically we need to apply a condition then after applying that condition then we will just conat otherwise we will not okay so I will simply create a data frame DF equals DF dot let me just zoom it TF dot uh with column and we will simply say full name okay then I will say when when
            • 205:30 - 206:00 column email dot is not null okay dot is not null then I want to perform concat and I will simply use concat WS instead of concat I will tell you hold on hold on hold on bro hold on so we will use concat WS in this function we just need to define the separator let's say I want
            • 206:00 - 206:30 to use hyphen okay then I will simply type all the things that I need to concat so it will just eliminate to eliminate to eliminate the need to just put that delimeter again and again and again so we just need to define the delimeter in the beginning that I have already done plus it will automatically add the delimer after all the objects so it is very handy when you have multiple
            • 206:30 - 207:00 columns let's say you need to just concatenate four to five to 10 columns instead of just writing concat this then comma then delimeter then object then comma then delimeter then that no be a smart data engineer be a smart data engineer so I will simply say column first name okay perfect and then I will simply say column last
            • 207:00 - 207:30 name why I'm thinking that I will get an error because I can see so many braces here but still we will just tackle that no worries okay so this when is done now we need to say otherwise if email ID doesn't exist then otherwise simply say none okay then let me say DF do display
            • 207:30 - 208:00 H DF dot that's why I don't like writing the whole code in simple in single line but but but but see in syntax what happened what happened wait wait let me just indent it because when I was just hitting enter it was giving me this so simply need to close it and then we can simply say d. display simple simple simple let's see if we have done that yes yes we have we
            • 208:00 - 208:30 have so we have first name last name and we can see full name only it's if if if email ID exists and we have used conad WS then we use dot is not null so it is basically a condition that we applied that column dot is not null if it is not null then only perform the fourth step otherwise just give me none all sorted man all sorted all
            • 208:30 - 209:00 sorted now what we have in 46th question let me just show you what do I have a data frame containing customer IDs and a list of their purchased product IDs okay calculate the number of products each customer has purchased okay we have only applied aggregations based on the values but we didn't apply aggregation based on the list if you want to find the number
            • 209:00 - 209:30 of elements within the list how we can just do that let me show you how you can just do that it's really easy so I will simply say DF DF do withd column I'll simply say number of of products okay then then then then then then then what I will say I will simply say size of column by the way by the way by the way
            • 209:30 - 210:00 by the way if it this question comes in the form of aggregation you just need to apply Group by and then perform these step otherwise you can directly perform these steps okay because these are not being aggregated just wanted to highlight this so product IDs simple and then I can simply say EF dot display perfect let me just see perfect so now
            • 210:00 - 210:30 you can see I have got the size 3 1 2 now I can see like how many elements do I have in each list so I just made the use of size function remember this function remember remember remember this function okay okay so in the 47th question we have you you you have you have employ IDs of varying length Okay Okay Ure all IDs are six six characters
            • 210:30 - 211:00 long by padding with leading zeros it means that as you can see in the column in one row we have like length only one then we have length three then we have length four it says just fix this length and let's say that we have like maximum six length like length six the maximum one and if one record one record's length is not six it's only one then you can just add five
            • 211:00 - 211:30 zeros to just make sure that length is constant throughout the column okay how we can just do that how how tell me you have the interview just tell me oh just kidding so you can simply say d DF do withth column DF equals DF do withth column and what's the column name employee ID okay so I'll simply modify this column instead of creating a new one employee ID and then I will simply use a function called
            • 211:30 - 212:00 lpad lpad because it says that we need to add zeros in the beginning in the question it is written if it says like you need to just add zero after the value then you can just use rpad same thing not a big deal then I will simply say column employee ID okay and then what is the length length is six and what is the character that I want to add it's zero okay and then I will simply say DF do
            • 212:00 - 212:30 display like it can be anything it can be hash it can be anything oops I forgot to mention P so as you can see it has added zero in the beginning five zeros for first record three zeros for second record two zeros for third record makes sense because they have some more values okay so this was all about this question and I hope this was
            • 212:30 - 213:00 a tricky one okay because obviously like you cannot instantly just think of the function available for laring because in this you do not actually need to apply logic you just need to make sure that you can remind these functions at that time if uh you are given a scenario as we have already seen in the previous question where you actually need to find the uh architecture like how you will be just approaching the question but these kinds of questions you just need to have that function in your mind that's it
            • 213:00 - 213:30 that's it this question requires special attention because it says you need to validate phone numbers by checking if they start with 91 by the way do not try to to dial these numbers these are just random numbers I have pegged do not just try to just dial these numbers so we need to just verify so if you just look at the data you can see two of the phone numbers are starting from for like from
            • 213:30 - 214:00 91 but the second four number is starting from 81 so we need to just filter only those data or let's say we just need to uh create a flag that this number is starting with 91 otherwise it is not starting with 91 so how we can just do that so just to give you the approach first of all we will just try to fetch the first two characters of the phone number okay just first two then we will just compare it that if it is
            • 214:00 - 214:30 starting with 91 or not okay so how we can just do that I will simply say DF do filter okay and then I will say um column phone number and then I will use substring actually so I will first use substring and I will say this is the column and I just need I will start my array with the index one and it will go
            • 214:30 - 215:00 up to two then this will just give me the first two elements first two uh let's say items that's it then I will say this value equals equals to 91 then dot display perfect perfect perfect perfect so as you can see I just got two phone numbers instead of three or instead of any other
            • 215:00 - 215:30 number I will just get the number based on this condition that I have used first of all I use substring okay so substring function helped me to just fetch first two characters this is the starting point this is the ending point and then I just compared this with the equals to equals to 91 sorted so my bro you have a equ question in which you have a data set with courses taken by students okay
            • 215:30 - 216:00 calculate the average number of courses per student if you just look at the data frame we have student ID one student ID 2 student ID 3 okay this student is very studious like okay Math and Science wow man such a great combination for Destruction okay then we have history who studies history man come on
            • 216:00 - 216:30 then we have art wow by the way the formal name of art is maybe Humanities okay PE is physical education if you do not know biology everyone knows biology biology is good till class 10th then it becomes headache I didn't took I didn't take biology I just got this feedback from my friends who took
            • 216:30 - 217:00 biology yeah just just take Humanities and sort it man sort it okay it's personal choice so I'm not the one who can comment on this by the way I can I don't I don't care so what we actually need to do we actually need to find the average number of courses per student this question requires multiple steps so let's first try to perform first step in which we will just find the size of the courses
            • 217:00 - 217:30 list okay then we will just find the average so first of all I will simply say DF equals DF dot with column and I'll simply say uh courses or course site size okay and then I simply use function called size and I will simply pass the column called courses simple and this will give me the size of the list okay
            • 217:30 - 218:00 then what I can just do I can simply say groupby and then do Ag and average of this new column which is called oh I just wrote course see mathematics students courses size course not courses okay then we can simply say DF do display perfect perfect perfect
            • 218:00 - 218:30 perfect oh so the average size is two nice nice we found it we found it see mathematics student can find the average like this mean median mode average everything okay so what do we have next in our 50th question okay 50th 50th this question should be special okay let me just bring a special question for you let's discuss question number 50 you have a data set with primary and
            • 218:30 - 219:00 secondary contact numbers use the UN from where you are just getting these phone numbers bro trust me I don't know the do not even try to dial these numbers use the primary number if available otherwise use the secondary number okay so do you do you know what it what it actually needs to do with these columns so basically we have primary contact we have secondary contact it is saying in some cases primary contact is not available in some
            • 219:00 - 219:30 cases secondary contact is not available so we need to do something that if First Column or let's say first Contact is not available use the second one and if it if force is available then just just neglect the second one because we just need one phone number that's it that's it if you are familiar with SQL this is similar to Call's function in SQL so we have call these function in pbar as well so I will also use that function don't worry so I will simply say DF equals DF dot with
            • 219:30 - 220:00 column okay and then I will simply say contact Simple we just need one contact so I will simply say Coles and first I will pick column primary contact oh man I'm really really hungry primary contact then if if if if we have nulls in that column then we will say just pull the
            • 220:00 - 220:30 value from secondary contact Simple Man Simple simple simple simple perfect yeah let's display the state of frame okay we have contact column which has fulfilled our demand okay it's time to discuss the last question it's question number 51 question number 5050 51 and I
            • 220:30 - 221:00 hope this notebook is helping you a lot so that you can just practice your uh scenario based questions without paying to any expense platforms you can simply drop me a lovely message and I will feel so so so happy just kidding just drop a lovely comment yes because like like why why am asking to just drop a comment it's not
            • 221:00 - 221:30 for just me it's for others as well who are just trying to get some resource resources to uh consider before the interviews so they can just check the uh comments and they should just feel confident okay this content is reliable so they can just spend some time before the interviews because I know every single minute is important before the interview so they should feel okay if they invest like two to three hours on this video they can just practice the questions they will feel confident and they will
            • 221:30 - 222:00 just a the interviews that's it and yeah you can just give me treat as well just kidding man I'm so hungry right now I'm just thinking about food I need food by the way I just put a toast in the toaster so I can just smell so let's discuss the last question and then I will eat then I will eat finally finally finally finally let's discuss our last question so it's time
            • 222:00 - 222:30 to discuss are you eating toast no I'm just just eating I'm just finishing my rest of the carrot so this is the last question and you should feel like yes now we know all all all the stuff okay okay simple simple okay just read the question till the
            • 222:30 - 223:00 time I'm chewing just read the question redit good the question says let me also read okay the question says you are categorizing product codes based on their length if the length is five label it as standard otherwise label it as custom so this is a very generic scenario where you actually need to
            • 223:00 - 223:30 figure out which postal code is system generated or which are custom this can be the realtime scenario trust me so how we can just figure it out obviously we need to just apply obviously we just need to apply uh condition like if else condition or case when statement and on top of it we just need to use a function which can actually find the length again architecture plus function
            • 223:30 - 224:00 it is a fusion of both both both both and how you can do that for that you simply need to use a function called length to actually find the length of a text then then you will use when statement on top of it to compare and then you will just say standard or custom an lamba very well explained now just do it okay DF equals DF do with column okay and I will simply say
            • 224:00 - 224:30 flag because I'm treating this column as a flag okay or let's say code flag just to be more specific then then I will say when when l length length is a function see length of column product code oops product code length of column
            • 224:30 - 225:00 product code this one and length this will just give me the length Okay then I will say equals to equals to 5 then then then then then then say standard otherwise dot otherwise simply say custom okay and now I'm just running it tf.
            • 225:00 - 225:30 display let's see boom boom boom boom boom so as you can see the length if is equal to equal to five see p o1 that's five if it is not equals to five then it's custom and that is the I think scenario right yeah right right right right right
            • 225:30 - 226:00 so let's have a one-on-one talk hey Mr Miss Mrs look at look at my eyes so now you have completed all the questions okay all the questions all the questions one questions theoretical conceptual practical everything now I would just say I have some you can say action steps for you right so if you have very less
            • 226:00 - 226:30 time in your interviews just revise the concepts all the concepts quickly and just sit for the interviews if you have some days left even like let's say a week simply click on the video that is coming on the screen right now this is purely based on Delta Lake and trust me this will definitely help you a lot to Ace the interviews because Delta lake is the backbone of data bricks right now and
            • 226:30 - 227:00 everyone is looking for dat of Brooks why can't you understand this why why why why so just click on the video coming on the screen for Delta Lake Plus for those who want to learn pbag in detail I have already put that link in the description they can just click on that video and for those who are just willing to learn more and more I have created an end to end 7 to eight hours long video on a your end to end data engine project just click on the video
            • 227:00 - 227:30 coming on the screen so it's your choice and I will see you there love you a lot and bye-bye and let me just eat my toast now bye-bye