Essential Guide to Keeping Your Applications Running Smoothly
Application Troubleshooting Best Practices
Estimated read time: 1:20
Learn to use AI like a Pro
Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this informative video, Dr. Lach Henderson goes over critical steps in troubleshooting applications that go down, focusing particularly on handling different server setups and error codes. He explains the importance of understanding three-tier architecture, how requests flow through DNS and load balancers, and the significance of security configurations like security groups and auto-scaling. The video also covers how backups are managed, the intricacies of database setups involving master-slave configurations, and offers practical steps for tackling common error codes (3xx, 4xx, 5xx), while providing valuable interview insights throughout the session.
Highlights
The journey of a user's request starts from DNS services resolving IP addresses, leading to load balancers that manage traffic. π
Error codes (3xx, 4xx, 5xx) provide clues: check logs and server status in sequence to diagnose and fix issues. π οΈ
Maintain a backup strategy using tools like lifecycle manager for critical server components. π
Ensure network security by setting up proper security groupsβminimize direct access. π
Database management often requires manual handling; use proxies and defined access routes for reliability. πΎ
Key Takeaways
Master the 3-tier architecture: web servers, application servers, and databases are your main troubleshoot areas! ποΈ
Understand error codes like a pro! Different error codes (3xx, 4xx, 5xx) indicate where your solution should be focused. π
Security is key! Use security groups wisely to protect your setup. π¨
Auto-scaling helps maintain infrastructure resilience by managing server counts dynamically. π
Maintain regular backups to prevent data loss. Use lifecycle manager for server backups. ποΈ
Optimize server access by configuring security for minimal exposure, only allowing specific connections. π
Understand the flow of requests through DNS, load balancers, and web servers to tackle problems efficiently. π₯οΈ
Be cautious with database auto-scaling; sometimes manual scaling is more reliable. ποΈ
Overview
Dr. Lach Henderson dives deep into the realm of application troubleshooting, unraveling the layers of the three-tier architecture you need to master. With web servers, application servers, and databases at play, having a clear strategy is paramount. He emphasizes observing your DNS and load balancers as the first touchpoints in resolving server linkages and potential bottlenecks.
Error codes are the messengers of server tribulations, each number revealing the faults in the architecture. From redirection hiccups to unresponsive databases, understanding these codes means you can target your troubleshooting efforts effectively. Security setups come into play heavily here, emphasizing not just the fixing of issues, but their prevention through meticulous configuration of security groups and access hierarchy.
The video wraps up by stressing backup importance and database management. Servers managed by auto-scaling provide the safety net to maintain operations, while a robust database setup ensures data integrity and performance. Knowing when to allow scaled access and when to lock things down is key to robust infrastructure management. Enjoy this insightful journey with Server Gyan!
Application Troubleshooting Best Practices Transcription
00:00 - 00:30 and once again welcome back to my channel server again my name is dr. Lach henderson and today we are going to discuss about very important question which is that how will you troubleshoot if your application is down so this is very critical and very important question because this contains multiple components that where to start from so this is what we are going to discuss within this video but before starting this video I would like to request you to please like share and subscribe to my channel if you are new to this ok so
00:30 - 01:00 when we say that application is down what does it mean that application is down so there are multiple error code which we usually see on browsers first of all 303 is 400 series or 500 series so there are multiple error codes like 3:01 3:02 so it means there is some redirection problem and obviously redirection problem will be based on so these particular servers so this is three-tier architecture first of all is web servers we have web servers here
01:00 - 01:30 then after we have application servers here then after we have databases fine so this is three-tier architecture okay so how does the journey begin for a user so whenever an actor it means whether these two vectors are going to hit our website let's say server gong calm so first of all request will come to route 53 now route 53 is working for the working for us as DNS so whenever any
01:30 - 02:00 any request would come so that will obviously forward the request to that will resolve the IP address and that will forward to load balancer now this load balancer will accept traffic on port number 80 and 443 so there is one more additional interview question that what are the port's you would like to open on this load balancer so obviously port 80 and 443 will be opened for public because we do not know that what is going to be IP address of our client so these are two users these are our client these are customers so we do not have an idea that what is going to be IP
02:00 - 02:30 address of these end users so for sure we will open it for public now these are three web servers which we have so these can be any number so number 18 for for three will be opened here but from where we are going to open it so for sure we will open this port 80 and 443 from this security group one only because these servers are lying behind a production zone it means there is no public IP address associated and
02:30 - 03:00 we do not want that anyone should be able to connect to these servers directly so whosoever has to connect to these servers so these servers will be connected from this load balancer only so we will allow this security group for this security group one it means if we as we are aware of that we can use one security group as reference within another security group so what we are going to do here is we will whitelist the security group one within the security group - it means wherever this
03:00 - 03:30 security group is associated so those resources will be able to access port 80 and 443 on this web servers and where the security group one associated with this load balancer so this load balancer will be able to forward and push enter except traffic from these servers okay now these servers will be managed by this order scaling this icon source order scaling group so this auto scaling group will manage number of servers so if new server is created so by default we have template wherever we have associated this group and this group has
03:30 - 04:00 reference for this group it means whenever new server will be created if any of these servers goes down a new server is created in place of this server so definitely that will also allow to accept traffic from this load balancer okay now we have these application servers so these application servers are banned another target group and we will say that all the servers wherever because obviously we know that port number seven thousand one or maybe XYZ port is open for our application so we do not want to accept traffic from
04:00 - 04:30 any other where we want that these are production servers and traffic should come via these application servers only so now what we are going to do here is we will give a reference that whenever all the resources where this security group CSG - wherever this security group is associated so only those resources are allowed to hit on port number seven zero zero one because these are application servers and application server get traffic from these a please web
04:30 - 05:00 servers only fine so this is how we are going to manage this moreover there is one more interview question which usually comes okay so if new server comes in so what would happen so obviously this auto scaling group manage it and the particular launch configuration or launch template whatever we are going to use so that will manage this security group reference that whenever new servers are created or eggs or old servers are terminated so that will ensure that new launch servers will have this sg3 with reference of SG to these servers will
05:00 - 05:30 not accept traffic directly from load balancer because that have failed to that has to come we are this particular web server a stack fine now what next so there is one more interruption like how do you take backup of this so for taking backup either we can take file level backups or we can take entire server backup anti server in the sense for example this particular web server is running on an AC 2 machine so what we can do here is we can use lifecycle manager so we have already created to
05:30 - 06:00 curl a video on lifecycle manager so you can check that out so using lifecycle manager we can take backup of these servers may be arleigh or may be on daily basis based on the requirement or based on the criticality of these servers so if we can reconfigure it so maybe once in a day is good enough or if we need really if we are really in demand to have very aggressive backup policy so we can take backup every two are same same thing goes with this server so we can take backup of only one
06:00 - 06:30 server no it is not required to have backup of all the servers reason being if we have backup of one so we can create am I out of it and obviously we can create multiple servers using this auto scaling group fine because we can create ami of this server and if it is required to boot new servers from this ami so we can associate that ami with this launch - launch kind of conversation template under this order scaling group and that will spin up new servers for us fine okay now what next so there are multiple database servers
06:30 - 07:00 as well like we have data master so that master could be running on your may be my sequel post grace may be on ms SQL or may be Oracle or whatever fine so we have master as well as we have slaves so what do we want here is that we want to configure these servers in such a manner that whenever request comes so raid request should go to slave server and right requested to go to master server so we can have to proxy super sequel servers and we can place
07:00 - 07:30 them behind a target group and that target group will be managed by any auto scaling group so if our servers are terminated new service will be created in place so that is how we can manage that moreover so these proxy sequel server should accept traffic from these servers only this application service because web servers will directly not connect with these database servers moreover if it is a requirement that we can hit these application server this per this particular database from application
07:30 - 08:00 server directly if there is a requirement then what we are going to do here is we will open this particular SG food for two security groups first of all SC 3 as well as su 2 if there is a requirement that we can give to security group as reference for port number one five one five if there is a requirement that we need to open connection from web servers as well as from app servers on to this database so this e proxy server only from the here this particulars all the servers will be accepting connections fine so this is how we can
08:00 - 08:30 manage moreover we will take backup of only one server and if anything goes wrong so we can create new ami out of it and expect creating ami we can hand it over to this order skill group and that will expand up new server for us now let's come to master there is not a Skilling replaced because it is not recommended to have database servers in order skilling group reason being because there are some additional configuration where we need to define master as well as like binary log
08:30 - 09:00 position we need to define when we are going to declare slave and if this server is under auto scaling group so it could be possible that order scheme group may terminate the master so and that infrastructure can go down so rather than going for or horizontally scaling this is always recommended to go for vertical scaling or if horizontally scaling is required then we should go we should do that manually fine or if you really have that much scalable infrastructure
09:00 - 09:30 or the master one okay so that's what you will have to manage that how how you are going to configure that now there is one more interview question that how do you take backup of this database so either you can take full backup maybe arleigh maybe daily or whatever you want so whenever backup is taken so backup should be initiated from slave server only because when you are taking backup so it could be possible that couple of tables are logged for the time being okay so backups full backups would be initiated from any of slave moreover if
09:30 - 10:00 you want to have incremental backup so you can take that backup using binary log or whatever mechanism you want to use so that should be maybe for ten minutes maybe for every 30 minutes maybe every five minutes make up you can take now these servers where from it will except traffic first of all to the first offload from this proxy server or if there is an additional requirement then you can you can open the security group as well because we should not consider that we can open a complete series of IP
10:00 - 10:30 address for example we have a CIDR of 10.0.0.0 slash 16 so there could be some testing servers there could be some like server photo server for a pre prod or whatever so we should not do that reason being if someone hits a delete query and that worry is being triggered from a testing server so for sure we need to ensure that testing server should not have same security group as this is associated with production server fine second thing if we need to provide
10:30 - 11:00 additional access to these databases that obviously the slave database would X should accept connection from some of the office address or maybe from some additional additional IP addresses reason being we want to give production access to some time to be in HTM to analytics team so they can also run query in on live server but we should give only read access on to these slaves because sometimes they need to go for real-time analytics so we can provide them access so this security group will
11:00 - 11:30 allow them to connect we are using any VPN or using any jump server so this server will allow only those users to connect on port 3:06 who are coming from this jump server or who are coming through this particular VPN server fine so this security group will allow this VPN server as well as this jump server to access port 3306 now there is one interview question that how do you ensure security so security we have talked about application security now the server security if we talk about so
11:30 - 12:00 port 22 will be open on these servers load balancer does not allow us to login so there is no need to have port 22 open so these application servers for configuration for checking logs and for managing other things we will allow port 22 on these servers and the on these application servers proxy server my sequel server so these servers will allow for 22 either from your office address or from VPN server or from jump server so port 22 should not be opened for public if it is open for public then
12:00 - 12:30 definitely that can be a threat for your organization fine moreover these servers do not need to have public IP address associated so no one can log in to these servers directly when these servers have only private IP address so either one jump server or you can call it best and host or a VPN service required to access these servers fine so this is some how you is you set up your infrastructure sorry now the question is when we are talking about
12:30 - 13:00 about these error codes so if these are 3-3 so we need to we need to check out here that if there is a wrong redirection working if it is 400 so we need to check out on this web servers that someone might have misplaced the file or some path has been modified where no data is available so first of all we will log in to these servers we will check now what is going to be sequence after login to these servers first of all we will check that whether service is running or not so for
13:00 - 13:30 checking that we can use net stat come on so net is shut command will provide us two things first of all the particular port which is supposed to be opened here is that working if that is open then which service is running on that it could be possible someone has misconfigured that may be a party service is running but previously our services were running on nginx so both uses port 80 and 443 so someone has it's configure to the server that can because of this error moreover we will check disk that how much disk is
13:30 - 14:00 available then after we will check load on these servers then after we'll check free memory available on these servers fine so first of all these stuff swivel jack and later on we will check the application logs find that what logs are being printed now if everything is working fine here on this level then we will go to our application server we will check load CPU memory disk space and if everything seems to be fine but it still our server is working slow so we will have to check that whether our network is not becoming a bottleneck
14:00 - 14:30 whether the particular application has so many files opened so we will reuse obviously al SOF command if there is some different different errors so we will check the locks of have Legation and based on that we will perform rest of the stuff then after we will check the connectivity with database if the error code is 5 X X so it means the affinity there has to be some problems from backend side back an immense database side application is sending requests to database but database is not responding in time so we will login to
14:30 - 15:00 database we will check the running is the current status of database whether it is how many connection it is accepting how many open connection does it have how many processes are running there and like load free memory and obviously error log syllabary logs so these are the things we will check on application server and on database server as well fine so first of all what we are going to chain we are going to check load then after we are going to check process then after we are going to check 3 memory and finally we will check
15:00 - 15:30 that application logs okay so disk load CPU memory and finally long so these are the things we need to check when we are working with these application servers and if we have to answer these questions to them yep definitely you know it now then how to answer this question that how will you troubleshoot if your application is down so based on these error code so if it is 3xs then these servers if it is 4 accepts then these servers if it is 5 xx then these servers fine so these servers
15:30 - 16:00 will be responsible to fix our infrastructure so very much have a good time and happy learning from server again if you have any further questions so please do write in comment box I shall be happy to assist you on that thank you very much have a good time happy learning