Socket management and Kernel Data structures
Socket management and Kernel Data structures
Estimated read time: 1:20
Summary
Hussein Nasser breaks down how the operating system kernel manages socket connections using various data structures at the network layer. The video details what sockets and connections are, and explains the use of queues in managing these connections. Through the first of two parts, Nasser elucidates the mechanics of listening on specific IPs and ports, the handshake process for establishing connections, and the concept of socket sharding for load balancing. The discussion includes a look into data buffers, flow control, and the underlying complexities involved in socket management at the kernel level.
Highlights
- The video explains the concept of sockets and how they function at the kernel level using data structures.
- It details the mechanism of listening on specific IP addresses and ports and the associated security concerns.
- Nasser discusses the three-way handshake process essential for establishing robust connections.
- The role of queues, specifically sin and accept queues, in handling connections is highlighted.
- Socket sharding is discussed as a method for scaling acceptance and balancing load.
- The management of send and receive buffers in ensuring smooth data flows is covered.
- Nasser discusses the intricacies of managing TCP/IP parts within the kernel, underlying the complexity involved.
Key Takeaways
- Understand how the OS kernel manages sockets and connections using data structures at the network layer. 🧑💻
- Learn the importance of listening on specific IPs and ports, and the risks of listening on all interfaces. 🚨
- Discover the three-way handshake process for establishing connections and the role of queues in this process. 💡
- Explore socket sharding and how it helps in distributing load across multiple accept queues. ⚖️
- Gain insights into the management of send and receive buffers and how flow control helps maintain a balance. 🎶
- Learn why kernel-level socket management is complex yet crucial for ensuring efficient network operations. 🔍
Overview
In the video, Hussein Nasser offers a deep dive into socket management and kernel data structures, focusing on how operating systems handle connections at the network layer. He discusses the role of sockets, connections, and the socket management structures that streamline data flow on networks. Using a detailed, step-by-step approach, he unpacks how operating systems deploy these structures to ensure seamless network operations.
Nasser meticulously explores the process of listening on specific IP addresses and ports, breaking down the three-way handshake mechanism fundamental for establishing connections. He emphasizes the importance of using specific IPs instead of generic ones to prevent security vulnerabilities, and explains the concept of queues used to manage incoming connections effectively, such as the sin queue and accept queue.
The video also sheds light on advanced concepts like socket sharding, which allows multiple processes to listen on the same IP and port, effectively distributing load and improving scalability. Nasser discusses the synchrony between send and receive buffers and their significance in maintaining data flow integrity across networks, highlighting the kernel's pivotal role in orchestrating these complex operations.
Chapters
- 00:00 - 01:30: Introduction to Kernel Socket Management The chapter provides an overview of how the operating system kernel manages sockets using various data structures. It discusses what a socket and a connection are, along with the different queues employed to manage connections, listeners, and data at the network layer.
- 01:30 - 03:30: Sockets and Connections In this chapter, the focus is on connections, sockets, and queues, fundamental concepts in operating systems. The content is part of a fundamentals of operating systems course and addresses the technical aspects of these components. It's the first part of a two-lecture series on this topic, setting the stage for a deeper exploration in subsequent lectures.
- 03:30 - 05:30: Listening and Binding Sockets The chapter covers core kernel network data structures with a particular focus on sockets and connections. It promises to delve into the specifics of reading from connections and the kernel data structures involved in such processes in the following chapter. Readers are encouraged to understand and explore the concept of a socket without resorting to memorization.
- 05:30 - 11:00: The Three-Way Handshake and Connection Queues The chapter introduces the concept of a socket, explaining its role as a data structure and its representation as a file descriptor in Linux, and an object in Windows. It touches upon the Unix philosophy, where everything is treated as a file. The discussion moves to the idea of 'listening' on a network, which is a fundamental part of accepting connections on a specified IP address.
- 11:00 - 15:00: Connection Acceptance and File Descriptors This chapter introduces the concept of network interface cards (NICs), explaining that a computer can have multiple NICs, each with its own IP address. It notes that it's possible to create unlimited virtual NICs in addition to physical ones, broadening potential connectivity possibilities. This includes a discussion on the use of specific ports for communication, suggesting a structured approach to managing connections on a machine with multiple IPs.
- 19:00 - 23:00: Socket Sharding The chapter titled 'Socket Sharding' delves into networking concepts, focusing on how network settings are exposed and configured at an IP level. It introduces key networking parameters such as IP address, subnet mask, default gateway, and DNS resolver IP address. These elements are part of the network interface configuration, possibly in the context of a proxy or a similar setup where listening on a specific IP is involved.
- 23:00 - 27:00: Receive Buffers and Network Flow Control The chapter discusses the configuration and management of network cards and the importance of specifying IP addresses and ports for network communication. It emphasizes the common practice of using default ports such as 8888 or 8080 in backend development. Additionally, the chapter highlights the loopback address used as a default setting in certain network configurations.
- 27:00 - 30:00: Closing and Additional Insights In this chapter, the discussion revolves around technical aspects of network programming, specifically focusing on loopback interfaces for IPv4 and IPv6. It mentions the use of TCP/IP for local development, with the IPv4 loopback address being 127.0.0.1, and the IPv6 loopback address represented as bracket colon col one. The speaker references NodeJS, a JavaScript runtime, indicating how they use it to handle network listening and closing operations. This chapter aims to elucidate the functionality and application of these technical concepts within development processes.
Socket management and Kernel Data structures Transcription
- 00:00 - 00:30 what's going on guys my name is Hussein and this is an overview of how the operating system kernel does socket management using the different data structure the socket management data structures that is built in at that layer 4 or the network layer essentially so I'll talk about what a socket is what a connection is and the different kind of cues that are used in order to manage connections and listeners and uh data
- 00:30 - 01:00 that are received for each of these connections this is a snippet of my fundamentals of operating systems course if you're interested check out the pinned comment below and now let's go to the video all right welcome to the nitty-gritty we'll talk about connections sockets and cues i'll split this into two lectures the first part we'll talk about the in both cases we'll
- 01:00 - 01:30 talk about the core kernel network data structures but in this particular thing I'm going to talk about the sockets and connections and then in the next one we'll talk about specifically how to read from connections and what kernel data structures does the kernel use to actually implement that let's jump into it this is going to be a beautiful let's just uh swim you guys here let's swim don't memorize anything let's just swim first concept the concept of a socket
- 01:30 - 02:00 and a socket really when it comes to actual data structure is nothing but a structure a strct really in C it's also a file descriptor at least in Linux in Windows becomes an object because in in in Linux everything is a is a file when I listen and we talked about what is the concept of listening right and the fundamentals so like hey I want to accept connections on a specific IP IP address
- 02:00 - 02:30 and a specific port who's saying you use the word specific can I have more than IP address in one machine of course in every computer you have multiple necks or network interface cards and not only that you can technically create unlimited virtual network interface card each with its own IP address ideally you have a physical network and you have a mapped mount
- 02:30 - 03:00 which have essentially that network uh virtual IP not virtual IP sorry uh it will have a neck exposed to you that you say hey this is my IP address this is my subnet mask this is my uh default gateway this is my DNS resolver IP address all of that is my proxy stuff like that so you can listen on an IP and a
- 03:00 - 03:30 port it has to be those two things so that means I have four network cards i'm listening on the IP address of this network card you have to specify it no and then you specify the port and if you if you've ever built a backend you'll always see that the default is I don't know 8888 or 8080 the port and the and the IP is the loop back default neck
- 03:30 - 04:00 which points to myself for development purposes so that's the TCP IP 127.0.0.1 that's the IP version 4 for IPv6 loop back it's bracket colon col one that's the loop back for IPv6 but that's basically the games we're playing here you might say I say I don't do anything when I listen I just call in NodeJS I could call listen and then I close the bracket
- 04:00 - 04:30 I don't do any of this stuff well this is where we need to talk you guys when you listen you can give the option to listen on on owner interfaces which is absolutely Horrible idea horrible and I'm really mad that this is a default apparently we have to make it a default because it's too much to look up an IP and IP can change this and and and to make it easy let's listen on all
- 04:30 - 05:00 interfaces and you do that by doing 0.0.0.04 04 zeros or I think in IPv6 bracket colon colon bracket that listens on I all IPv6 versus all IPv4s interfaces because you can have both right watch out for this because you're building a small app that does I don't know opens a port and you accidentally deployed it in production and you wanted like literally to run it
- 05:00 - 05:30 locally you just and all of a sudden you you run it and was listen on all production nets which happens to one of them happens to be a public IP address or a public or an IP address that is essentially mapped to some load balancer it's public eventually right so now you exposed your app to the public that's how every single MongoDB instance and elastic search data got breached because
- 05:30 - 06:00 people just listen to all interfaces and deploy it on the cloud and and literally attackers will scan the whole web scan all the IPv4 uh you know ranges looking for the elastic search DB ports and then connect and then brute force their way so that's really dangerous you really want to listen on a specific port that you actually know right it's hard it's
- 06:00 - 06:30 difficult it it's clum uh not clum it's it's a it it's requires some more work but it's secure now then we have listened on a port and IP Hussein when you do that you call a system called code called listen and you get back a beautiful file descriptor called socket and this is I'm going to coin a name for it call it
- 06:30 - 07:00 listening socket file descriptor because a connection is also a socket but that's kind of a different kind of a socket it's the same data structure right and the kernel kind of stores them in the same file the same table data structure but you can see that the socket has only listening socket it doesn't have like a source IP or destination IP it doesn't have any of that but a connection actually is is a
- 07:00 - 07:30 is an is an entity that has two sides it has source a client and a server so a connection is different than a socket but I'm going to call it a connection socket for now we have a socket beautiful what do we get we get a file descriptor where does it live it lives in the process ID where PCB we know all these terms now guys we can talk safely with them so second is a file what happen if I fork guess what if you fork
- 07:30 - 08:00 uh the the the files are shared right because you get a copy of everything that the process had at that point including files descriptors so watch out maybe this is this is by design and this is good so when you fork careful where you forked you fork you get everything so we talked about let's let's go back to our single process we have a single socket in my process for every listening socket that we create you guys we get uh
- 08:00 - 08:30 something called two things you get two cues don't get hung up on the word Q because they are not actually implemented as a Q because a Q is a very it's not very efficient in this particular case actually one of my student called that i was like saying like the Q seems to be very inefficient but I and I looked the source code it's not a Q at all it's has a hash table turns out but we talk about it as a Q because it's easier to talk through it this way so we get something called a
- 08:30 - 09:00 sin Q and this belongs to the socket S and we get another Q called the accept Q what are those for well if you're if you're here you probably know how to establish a TCP connection right we do a send and then the server sends back a sin act and then the client sends back an act and that's called a three-way handshake and using that we get a beautiful uh
- 09:00 - 09:30 connection but that's the handshake to handle the handshake we we need this sends which is the synchronization we use them to to synchronize the sequence numbers that we use to label every segment So that's the first thing send Q so if you have a lot of listening you get a lot of sync cues essentially you also get an accept Q this stores completed connections because if I did if I am a client I want to connect to
- 09:30 - 10:00 you i first send you a sin by just send you a SI I don't have a connection yet we take your SIN and put it in the SIN queue we don't have a full-fledged connection we only have full flash connection if we successfully completed the three-way handshake right so then you can see that as I'm putting sends here from multiple clients right from one client in India one client in Germany one client coming
- 10:00 - 10:30 from an IP address from Lebanon one from Bahrain one you see you can see that this can fill up and then we complete the connection so we move this stuff to the accept queue and we have full-fledged connection from Germany from Russia we have these connections ready but how many can I keep in my cues before it fills up there's a size to those and this size is called the backlog and you can essentially when you
- 10:30 - 11:00 call listen you can specify the backlog and this is actually exposed all the way to the clients even in NodeJS you can node and you can specify how how much is your backlog really pretty cool if you think about it you might need a large backlog if you have so many connections coming your way and you want to essentially you're not accepting connections fast enough we'll talk about accepting in a minute now cuz connections that lives here still is not with us the back end this is still all
- 11:00 - 11:30 kernel so now let's talk about actually creating connection we don't we up to here we don't have a connection yet we only have a listening socket that is one file descriptor now let's talk about connections what happened in here is like completed connections are placed in the accept queue we talked about that after you finish when you put a send put here and then the sin entry comes here until we we send back the sin act and
- 11:30 - 12:00 then the client will send back an act and then we match the act with the sin here and then we remove it from the sin q and we add it to the accept queue and now we have a full-fledged connection but it still is not ready you have to accept it who who is you it's a backend application you must say I never called this function in my life well you don't but NodeJS does bun
- 12:00 - 12:30 does Django does all your framework does do behind the scenes you must call accept which which creates the file descriptor for that connection yes there is a dedicated files connect descriptor f identifier for each connection and when you do that then we create additional two cues for the
- 12:30 - 13:00 actual data that going to be sent right so when you do that you accept you're going to create a connection that connection is going to be a file descriptor which lives on whomver or whoever whom or who whomever the process that called except if I I'm process A I called accept then the file descriptor that comes back will live in process A if I'm process B and happen to have access to the socket somehow and I called accept then the file descriptor
- 13:00 - 13:30 that gets created will live in my process B's PCB process control block and it's going to be a chain of file descriptors maybe I'll accept 100 connections with everyone you get two cues send a Q or receive Q we'll go we'll come to that in the next lecture but each of them the send Q will go for your outgoing data so as a server as a backend if I send a response to my REST
- 13:30 - 14:00 API call that goes to the send Q if my client sends a request it goes to the receive queue does that make sense and vice versa the client also has a connection on its end as a client and it will have a send queue and it has a receive queue the client does not have a socket it does not have a listening socket it has a just a connection so it have a send queue and a receive queue so if a client sends a
- 14:00 - 14:30 request it goes to the send queue and then eventually the kernel pushes it out to the neck and then uh it holds a receive queue where the response from the server will go into before we actually take the data and push it into the user space and do some stuff with it let's go through some animations so here's my client here's my process ID on the back end that is running and of course these two lives in the kernel which nothing but a mapped area in the process to the kernel state
- 14:30 - 15:00 uh uh to the kernel state right so accessing those cues right of course require kernel mode right to be accessed you cannot access them directly from the process these are forbidden right uh again IO uring kind of bring breaks this rule but for normal operation you cannot access them they are protected so the sin comes in here and sits in the sin q right so the sin will have a from IP
- 15:00 - 15:30 address a from port a destination IP address and a destination port and that's the four pupils that uniquely identify this guy and you put it right here beautiful now we have a sin entry You immediately as a kernel reply back by SN act where the from and to from IP and from port is the whatever listening socket IP we listen to right and actually whatever the client decided to
- 15:30 - 16:00 connect to let's say I'm listening on all IPs right so the the send the client cannot connect to all IPs it has to specify one and by knowing that one that becomes the source IP in this particular case if and only if there is a matching corresponding socket because I I I drew this as a line but there's so much stuff doing going going on here what is what is going on if I am doing a sin and my
- 16:00 - 16:30 destination IP is 1.1.1 right that's cloud flares and the port is 80 then the kernel first looks up the destination IP in certain table that we're going to talk about and that table says all right do I have a matching uh kernel entry that says I'm listening on this particular socket on this IP yes
- 16:30 - 17:00 or no if yes let's do it I accept you if no we get an error that no listening socket or we get an All right so you see there's a matching that needs to happen we need to look up imagine yourself actually implementing this how we do do it there's like billions of way to do it a little exaggerating but the colonel does it certain way this lookup should be massively
- 17:00 - 17:30 fast pro possibly hopefully in the in the in the cache and the CPU all of these structures really need to be cached cuz you don't want to go look up that massive table with all the listeners you know you want to quickly look it up and from this you would know based on that entry you would know which process it belongs to absolutely beautiful so now we send this act let's say we have a behaving client that
- 17:30 - 18:00 replies back with an act now we have a full-fledged connection the kernel removes it and adds it to the accept queue now we have a free connection here now then the process calls accept which will copy that connection creates the content creates a file descriptor and copy it the to the user's file space and it will create an entry yeah that says all right now we have an actual
- 18:00 - 18:30 connection and it will create the two receive Q the the receive Q and the send Q for that particular connection right now we have a file descriptor right here so problems with accepting connections let's talk about some problems backend doesn't accept fast enough clients who don't act and we have a small backlog if you have a small backlog you might you might have a problem with um you
- 18:30 - 19:00 know you call accept by the way if you call accept and this is empty you're block and that's why except is also a candidate for asynchronous IO which we're going to talk about in another lecture beautiful stuff you guys let's talk about socket sharding there are two different approaches to this you know I'll talk about this is the second one which is literally getting a
- 19:00 - 19:30 different distinct socket on the same port so we have two processes that are not talking to each other yet listening on the same socket so you have two sockets pointing to the same um IP port pair listener versus the other approach where you have a listening socket and you fork
- 19:30 - 20:00 the process to get a copy of the file descriptor which happened to point to the same logical physical sorry socket so if those two guys we're sharing one pair of sin and accept Q here this is an approach where we're getting two distinct socket pointing to the same entry and you do this using this soit
- 20:00 - 20:30 option reuse port because normally if you listen on the same port and IP you get an error error remember like socket already listening that's a common error but with this you can actually get to another socket and you can do it any number of time now this poses it and and we'll we'll talk about this now in a minute does it envoy does it haroxy does
- 20:30 - 21:00 it and essentially what you do is you have the listener and the acptors and the readers right the accept receptors all of these are just different threads or different processes and this is what is called the socket shorting where this process will have a socket and this process will also have a socket and both represent the same IP port now this poses a problem because if a client came
- 21:00 - 21:30 in with a sin what do we do in this particular case which socket gets this send and now I I'm imagining and I don't know the detail about this and I imagining either you can have two sins or one sin q for both sockets but different accept cues for sure right so in this particular case process one will have an accept process two will have another accept Q for the
- 21:30 - 22:00 same port if I get a sin on this port complete the connection Ready the kernel will do a load balancing it will throw the accept Q right it will throw the first connection on the first except Q the second one on the second the third one on the third and so on right and that's how it that's how they do it so it will distribute the connections on different
- 22:00 - 22:30 accept Q's and if it happens to have on this accept QES this process calls accept on its own socket which will point to their own accept Q and in this particular case we call accept on my socket which happens to but both are on the same neck on the same IP and port this is a common configuration that people do in order to scale acceptance right and this is
- 22:30 - 23:00 referred to as socket sharding we're sharding the socket but the Linux kernel in this particular case does the load balancing so there is uh when they first introduced this there was a lot of bugs with this because uh as connections close and open we lost the balance of uh when when you have a connection you're done the connection is like you have a nice file descriptor to go to the problem with
- 23:00 - 23:30 this is only when you accept connections the load balancing becomes really interesting as as sockets get destroyed and recreated again all right second example receive buffer so receive buffer so the data comes in and we know the connection we know the destination IP and we know the destination port the kernel identifies it right and uh we might say what if I have multiple connections well we use four pairs and
- 23:30 - 24:00 this must be unique the source port and the source IP will give us beautiful uniqueness so even if the same client establish a,000 connection you're going to get a thousand different source ports even if the IP the source IP is the same you're going to get a thousand different source ports so you're safe right so the source port changes the source IP doesn't change the destination IP doesn't change the destination port doesn't change the only thing that changes is the port if if the same
- 24:00 - 24:30 client sends a creates a thousand connection you put the data here nice and then you acknowledge right and you can choose like as a kernel you know what i'm going to wait a little bit more for more packets to arrive and then the process calls read which copies and that's what I talked about there's a copy from this kernel to the process so copy involves going to the processor the proc because the processor does a read from this
- 24:30 - 25:00 memory address copies to the cache lines and then from cache line does a write to the new memory location that's how a copy is memory copy is is not cheap it involves going to buses to different memory addresses so so many copies so imagine copying arrays right large arrays all of this stuff goes through the CPU right again unless you have like direct memory access from me one memory to memory just copy this stuff which is also expensive
- 25:00 - 25:30 just to set up as we talked about right all right once you read it it goes back to the process ID that's a copy and we talked about zero copy in general like how do you do take this uh it depends like what do you do right if you want to send data from one kernel structure to another kernel structure you can use zero copy right by actually telling the kernel hey just you have these two things do them right but it's very hard to do send buffers the client want to
- 25:30 - 26:00 send something to the to the server to the client that sorry the server wants to send back or even the client want to send and the send buffers exist on both right so you send something it goes to the buffer and it sits here and that's what the critical part here when you send something it doesn't go to the network exactly like the page cache receive buffer send buffer and the send buffer will just not immediately flush it to the network no no no no no no no that's a
- 26:00 - 26:30 bad idea we will wait we will write right right accumulate something from the user more then we're going to flush it and this is where something called Nigel algorithm kicks in and like hey Nigel algorithm actually extend that period how how long to do we wait before we flush to the network so yeah eventually we send that data and we get back an act and that's it problems with reading and sending you have clients that does not
- 26:30 - 27:00 read fast enough if the back end doesn't read fast enough you fill up your your receiver queue if your receiver queue gets filled the clients slow down how does this this is all determined by the something called a flow control flow control will tell the client hey my receive buffer is actually full slow down your rights it's a beautiful you know symony this whole thing summary so
- 27:00 - 27:30 the kernel manages the TCP IP part of networking at least every socket represent a port and IP right you can with a specific socket options when you listen you can say hey I want to listen on the same port and the same IP despite someone else is using it but it's not straightforward you those two processes have to agree on certain you know on
- 27:30 - 28:00 certain cookie or certain key to allow them to listen on the same port and this is basically determined by the first process if you don't know that key then you can't listen because guess what what stops me from running a process in a server that happen to be listening on port 80 and I do hey socket reuse 80 I can hijack packets as a malicious
- 28:00 - 28:30 process that is a bad idea that was one of the bugs early early so to to fix that you need to know what was the key that was used in the listening and was socket option reuse port actually allowed or not can you listen on the same port I'm not so it's very very interesting to learn about all of this stuff so each connected client gets a connection we talked about that and kernel manages the all those beautiful data structures you know and there is an
- 28:30 - 29:00 associated cost all this you guys you know and it's not very clear from just the listening part cuz you don't really shove a lot of data but for the connection acceptance I mean not acceptance per se but but let just just receiving and actually dealing with packets and parsing those packets there's so much stuff here that I'm not going to go through right that's a course by itself that's a book you know because you're
- 29:00 - 29:30 talking about the actual packets how do And there's like when you receive a packet from the from receive data from the neck the kernel creates a data structure for it it's called skip uh socket buffers right and then adds data to it and then that there is metadata associated with it adds overhead so the kernel does all sorts of trick to merge and to merge packets together and ha
- 29:30 - 30:00 have less headers do so much stuff so you need if you are like a beefy machine you need to assign a core or two just for the networking stack you know says all right I'm going to leave one core or two just for the kernel to do this all this complex networking stuff you know especially if you have like a beefy CPU