Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this detailed video by The CS Classroom, viewers are introduced to the intricate world of databases as covered in the IB Computer Science curriculum, including both SL and HL content for Option A. The creator breaks down fundamental concepts such as the difference between data and information, database management systems, and SQL commands, moving towards more complex topics like normalization, data integrity, and various data mining methods. The video aims to provide viewers with an understanding of these concepts, illustrated with examples and practical applications, helping students in preparing for their IB exams. Additionally, there's a focus on the more advanced HL topics, such as object-oriented databases and data warehouses.
Highlights
Introduction to both SL and HL database topics for the IB Computer Science curriculum 🎓.
Breakdown of essential database concepts such as tables, primary and foreign keys 🔑.
Hands-on examples of SQL commands for typical database operations 🖥️.
In-depth explanation of normalization and its importance in database structuring 📏.
Introduction to HL topics like object-oriented databases and the role of data warehouses 🌐.
Key Takeaways
Databases hold collections of data organized for easy access and manipulation 📚.
Normalization is crucial to reduce redundancy and improve data integrity 🌟.
SQL is utilized for managing database operations like storing, retrieving, updating, or deleting data 💻.
Advanced data concepts like data mining can reveal patterns and trends in large datasets 📊.
Understanding database management systems is essential for efficient data handling and security 🔐.
Overview
Welcome to a comprehensive journey into the realm of databases, tailored specifically for IB Computer Science students tackling Option A. This video, crafted by The CS Classroom, serves as a robust guide covering everything from fundamental terminologies to advanced concepts, ensuring both SL and HL students are equipped for their exams.
Dive deep into essential database components and understand the functionality of tables, keys, and database management systems. Discover the importance of SQL in managing database operations effectively. The video further illuminates the process of normalization, crucial for optimizing database efficiency.
As we step into the advanced HL topics, explore the nuances of object-oriented databases and the expansive world of data warehouses. Learn how these systems cater to large data volumes and facilitate complex data analyses, providing invaluable insights into real-world applications and industry practices.
Chapters
00:00 - 01:30: Introduction and Notes The chapter introduces 'Option A - Databases' which includes both Standard Level (SL) and Higher Level (HL) content pertinent to Paper 2 for the course. The speaker emphasizes that this resource will cover both SL and HL materials, and suggests checking the timestamps in the video description to skip to specific content. The video is expected to be lengthy due to the comprehensive nature of the topic.
01:30 - 04:00: Fundamentals of Data and Information This chapter starts with an emphasis on the flexibility of learning resources, allowing readers to focus on the parts they find useful while skipping unnecessary details. Study guides for both SL and HL option A are mentioned as part of the learning materials. The core of the chapter is dedicated to explaining the fundamental distinction between data and information.
04:00 - 06:00: Overview of Databases The chapter provides an introduction to databases, which are systems used to store and manage data. Data itself is described as raw facts and figures that need context to become useful information. An example given is a score: 'Json got a 95 out of 100,' which, when contextualized, can be seen as, 'Jason scored 12 points.' This forms the basis of understanding how databases structure and interpret data to provide information.
09:00 - 12:00: Primary and Foreign Keys The chapter discusses the concept of primary and foreign keys in databases. It highlights the importance of these keys in organizing data and ensuring its integrity. Primary keys uniquely identify each record in a table, while foreign keys link records between tables. This distinction is crucial for managing data relationships and enforcing referential integrity, which ensures that relationships between tables remain consistent. The chapter emphasizes understanding these keys as fundamental to working with databases effectively.
12:00 - 16:00: Data Types in Databases This chapter discusses the concept of databases as organized collections of data related to a specific topic or context. It emphasizes the structure of tables within databases, noting that while not all databases rely on tables, the focus here is on relational databases, which do. The chapter also introduces the role of Database Management Systems (DBMS) in creating and storing databases, and mentions the use of queries within these systems.
16:00 - 20:00: SQL and Select Statements The chapter introduces SQL, known as Structured Query Language, used to store, retrieve, delete, and edit data in databases. It is described as a formal language with a specific syntax that allows for querying databases and sending commands to obtain information. An example of a database with tables is mentioned, suggesting a similarity in structure to familiar tabular formats.
20:00 - 21:00: Complex SQL Queries This chapter delves into the structure of Excel spreadsheets, specifically focusing on the concept of columns and their various names such as attributes or fields. It explains that these columns, which can include data such as order IDs, product names, and customer IDs, contain sets of values. The chapter highlights how each table is composed of these fields or columns, along with rows of data.
26:00 - 27:00: Secondary and Candidate Keys The chapter discusses the structure of a database table, focusing on how each row (record) is related through keys. It uses the example of an order table with fields like order id, number of items, item type, and customer id. The data shows the relationship between orders and customers, illustrating the concept of candidate keys.
31:00 - 32:00: Database Schema The chapter 'Database Schema' discusses the structure of a database, emphasizing the role and interrelation of tables. It explains how tables in a database are interconnected, using an example of an 'orders' table and a 'customers' table. The 'orders' table includes a 'customer ID' field, which acts as a foreign key linking to the 'customers' table, aligning IDs across tables to map and access customer data accurately. This illustrates the foundational concept of relational databases, where such keys help in maintaining relationships between different sets of data.
32:00 - 33:00: Relational Databases The chapter titled 'Relational Databases' discusses the structure and function of databases, particularly focusing on tables composed of fields and records. These tables are interconnected through foreign keys, creating a cohesive database system. The example provided involves customer IDs and a customers table to retrieve respective customer information, illustrating how these linked tables form a database, which could be named according to its purpose, such as 'customer orders' or 'customer data'.
35:00 - 39:00: Database Integrity The chapter titled 'Database Integrity' explains the utilitarian role of databases. Databases are essential tools for processing, sorting, searching, and querying data. Furthermore, they facilitate the generation of reports and allow users to retrieve specific data elements according to different criteria.
43:00 - 46:00: DBMS and Features The chapter discusses the use of databases to search for information and present it in a specific format, allowing for access to certain pieces of information. It highlights the capabilities of databases to support data validation and verification, ensuring data integrity. The chapter emphasizes the standardized format of databases, which permits multiple systems to access and share the database using the same protocols. It also notes the ability of databases to store a substantial amount of information.
49:00 - 51:00: Database Transactions and ACID Properties This chapter discusses the advantages of using databases over Excel spreadsheets for handling large quantities of records. It also includes a personal anecdote from the narrator, who has Indian roots, reminiscing about India during the 90s.
56:00 - 59:00: Data Integrity and Redundancy The chapter discusses the concept of databases and how India has advanced in digitizing various sectors, especially government-related data, highlighting the progress made in societal digitization. Despite this, traditional methods may still persist in some areas. The chapter also distinguishes between 'data validation' and 'data verification,' emphasizing their purposes within databases.
60:00 - 62:00: Normalization Concepts The chapter titled 'Normalization Concepts' introduces the idea of data verification and validation in databases. It explains that data verification involves ensuring the input matches expected values in the database, such as checking login details like emails and passwords against the database records during login processes on platforms like Facebook or Instagram. Additionally, it briefly mentions that data validation occurs within the database, which entails checking that the input complies with rules for the expected type of input.
67:00 - 82:00: Normalization Examples The chapter titled 'Normalization Examples' discusses the automation of data entry verification in databases, specifically using the example of entering a credit card number. It explains how databases can automatically check if the entered credit card number meets specified rules, such as length, integer format, and overall correct format, though the process of data verification might not fully be automated within the database itself. The role of databases in broader data verification processes is also highlighted.
82:00 - 90:00: Advantages of Normalization In this chapter, the concept of entities in database tables is discussed. An entity refers to a real-world object or person and is represented by a row in a database table. For example, a movie entity is represented in a table with each row representing a distinct movie. The chapter focuses on how normalization is advantageous when organizing and structuring these entities within a database.
90:00 - 96:00: Database Anomalies The chapter titled 'Database Anomalies' introduces the concept of entities within databases, specifically focusing on entities related to movie stars and characters. It clarifies how an entity is represented by a table in a database and describes the relationship between entities and rows. Each row in a table corresponds to a unique entity, such as a person with specific attributes. The narrative emphasizes understanding entities in the context of database management.
100:00 - 104:30: Database Administration and DDL The chapter discusses the importance of understanding entities in the context of database administration and Data Definition Language (DDL). It mentions that recognizing each row in a table as an entity is crucial, especially when dealing with broader concepts like normalization. The example given is a person's table where each row represents an entity of a person.
111:00 - 116:00: Data Modeling and ERD This chapter introduces one of the fundamental concepts in databases: primary keys. It clarifies that while 'customers database' is often mentioned, it typically refers to a 'customers table' within a database. The distinction is important because a database can have multiple tables, each requiring a primary key, also known as a surrogate key. A primary key is essentially a column with unique values in every row, crucial for identifying individual records.
120:00 - 125:00: HL Content: Object-Oriented Databases The chapter discusses the concept of primary keys in object-oriented databases. It emphasizes the importance of primary keys as unique identifiers for each row in a database table. Each value must be distinct to ensure proper identification and access. The example provided highlights how inconsistencies in primary keys can lead to confusion when retrieving data.
127:00 - 132:00: HL Content: Data Warehousing The chapter discusses key concepts in data warehousing, focusing on primary keys and foreign keys. It explains that primary keys are essential for uniquely identifying rows in a table. However, if there's a division of values like customer ID, managing such primary keys becomes complex as it may result in two values instead of one, which hampers uniqueness. The chapter further elaborates on foreign keys, which are crucial for linking two tables in a database. Using an example, it describes how elements like employee number, employee name, and department number are organized, highlighting the importance of foreign keys in establishing relationships between tables.
136:00 - 176:00: HL Content: Data Mining Methods This chapter provides an overview of foreign keys in database design, specifically in the context of data mining methods. It explains that a foreign key is essentially a primary key from another table, reflecting its role as a unique identifier in that original table, but found within a separate table. The example given involves an employee and department table relationship, where the department table contains foreign keys corresponding to employee primary keys, illustrating how these keys link the relational data across tables.
183:00 - 190:00: HL Content: Spatial Databases and Segmentation This chapter introduces the concept of spatial databases and segmentation, focusing on how foreign keys link records across various tables in a database. The example given highlights how a foreign key, such as 101, connects a record from one table to another, providing specific details like the department name 'HR' and a location 'Delhi'. This connections through foreign keys are crucial in managing multiple interconnected tables within a database.
208:00 - 210:30: Conclusion and Additional Resources The chapter discusses key concepts in databases, specifically focusing on primary keys, foreign keys, and data types. Primary keys are essential for uniquely identifying rows in every table, while foreign keys are used to connect different tables. Additionally, it highlights the different data types for fields or columns in a database, mentioning text for strings and character data for single symbols, numbers, or letters.
00:00 - 00:30 hi guys welcome to option A databases this contains both the SL and HL content for paper 2 option A um so just a few notes before we get started so as I just said this video is going to contain both SL and HL content if you're just interested in the HR content for whatever reason go ahead and check the timestamps in the description and you can go ahead and navigate over there this will be the longest video that's just the way it works out I think some people forget that this is an entire
00:30 - 01:00 paper's worth of content so that's kind of the way it goes again if you don't need to know if there's a piece of information you don't need to know for the most part you can kind of skip around and just go to the parts that you do find useful additionally I will be I will be releasing study guides for both SL and HL option A so look to the description for study guides let's get started so we're going to kind of start with the fundamentals here and that is the distinction between data and information
01:00 - 01:30 we are working with databases which hold data so you should probably know what it is now data is just a raw facts it's unprocessed there's not really any any like context it's just text and numbers so an example of data would be Json got a 95 out of 100 on the final exam I guess you have a little bit of context but really what we have is a number which is 95 out of 100. information on the other hand is data plus context so we go from Json got a 95 out of 100 to Jason scored 12 points
01:30 - 02:00 higher than the class average on the exam or Jason got the highest score in the class now while we will be dealing in data this distinction is important to know is important to note or at least the IB thinks so so to conclude data is raw facts information is data given meaning or interpreted by the user it is processed and structured in a specific way it's not just a bunch of just numbers or words or whatever it is
02:00 - 02:30 now a database itself is a organized collection of data connected to a specific topic or context it's made up of tables which are often related to each other now this part isn't strictly true there are different types of databases that don't include tables but just for the first part of this video I'm just going to pretend that those are the only types of databases we know about which are also called relational databases a database is created and stored by a piece of software called a database management system and we can use queries
02:30 - 03:00 or commands to store retrieve and delete delete and edit data in databases written in SQL apologies for the typos now SQL is called structural structured query link query language so that's an actual language with syntax that we can use to make queries to a database to send commands to a database and get information accordingly now this is an example of a database with tables so a table is a lot like a table you'd see on
03:00 - 03:30 an Excel spreadsheet we have different columns so we have an order ID column a product column a pillow column and a customer ID column and each of these columns holds a set of values now these columns are often called attributes they can they can also also be called fields and each one of these tables so it has these fields these attributes or columns and it contains rows of data and each
03:30 - 04:00 row is the data in each row is related to each other so for example right here we have order id1 that order corresponds to a corresponding order of paper and I guess there are 400 total piece of paper being ordered it's being ordered by customer ID and we have five of these rows Each of which are called records so in this particular table we have we've got four fields and five rows now we're talking
04:00 - 04:30 about these tables because tables a database has many tables it basically consists of tables and those tables are often related to each other so right here we have a table called orders and it has a field called customer ID which is called a foreign key now the IDS in this customer ID correspond to our customers table right here so if you have an ID of one in our customers table that corresponds to this right here and when we access these customer IDs we
04:30 - 05:00 take these customer IDs and we check our customers table we can get information for that respective customer so we've got tables with fields and Records and these tables are connected to each other by a foreign key and these tables collectively make up our database and this database may be called like customer orders or customer data or something like that so database tables no I wrote that wrong
05:00 - 05:30 my bad rows and rows and Fields not field not fails Fields okay cool so let's move on so to give you a bit more data about uh databases and why we would use them databases are useful because they allow us to process data so sort and search data also query data so to to go through data and get specific aspects of those data based on whatever we want we can also easily generate reports
05:30 - 06:00 using databases on search information and in a specific format this allows us to access certain pieces of information and present them in a certain way we have ways to automatically support data validation and verification which will go into a bit more in the next few slides a database follows a standardized format so one system will have to access a database using the same protocols another system has to use and therefore a database can be shared by multiple systems and also we can store a greater number
06:00 - 06:30 of Records than in an Excel spreadsheet for example because database is built to accommodate a large number of Records this meme right here I guess this maybe it's a meme Chinese hackers Indian databases I kind of like so just to give you a bit of context maybe some of you guys have picked up on this but like my well okay I have an Indian name my family is originally from India um but I grew up in the US anyways I think that probably when I visited when I in like the 90s this is probably what Indian
06:30 - 07:00 quote-unquote databases look like but actually India's made a lot of strides um in digitizing um just in digitization in general particularly involving involving government related data just all over Society they made a lot of strides however I'm still sure in some schools somewhere this is what the database looks like okay so we talked about data validation versus verification this is one of the purposes of using a database now data verification checks that the
07:00 - 07:30 input matches expected values in a database so for example when we're logging into Facebook or Instagram that's an example of data verification we're putting in our login details and we're checking against the database to make sure that the email and password pair that we entered also exists in the database um now this is what a database is used for data validation actually does take place inside the database and data validation means checking that input follows rules for the type of input entered so for example
07:30 - 08:00 if you're entering a credit card number in a database we can automatically check whether the data entered well basically if we have a credit card number and we try to put that in a database based on the rules that we've specified it's going to check whether that credit card number is of a certain length whether it contains integers whether it follows the correct format and that can kind of be automated in a database now data verification I don't really see that as being automated like within the database but the database is definitely used for data verification in tables in general
08:00 - 08:30 like well in databases in general but specifically with regards to tables we make use of a concept called entities so an entity is a real world object or person and that's represented by a row in a database table so if we go back here right here the entity we're representing is a movie so we're representing a movie well we're representing movies in general this table represents movie entities let's say and each one of these is an
08:30 - 09:00 individual entity I guess those terms are kind of used interchangeably but this is an entity of stars or movie stars um and I guess this is an entity of movie characters the point is when we're talking about an entity we're talking about what is represented by a table so right here if we're representing a person um then an individual any Row in this person's table is an is a person entity those entities have attributes which are
09:00 - 09:30 represented in the table now this may seem like kind of a trivial or kind of just like a a useless concept but when we're talking about tables on a kind of a broad level later on especially when working with Concepts like normalization does come in handy to know what an entity is The Entity is a person we have multiple um you know multiple people in our person's table um yeah so basically representing the entity of a person and you can say each row in a database is a person entity
09:30 - 10:00 now one of the most fundamental concepts to primary or sorry to databases and well really just databases I guess is its primary keys now right here this says customers database but I would say this really represents a customer's table so remember in a database we have multiple tables and in each table we need to have a primary key which is also called a surrogate key right here now a primary key is a field meaning a column with values that are all
10:00 - 10:30 different they're all distinctive so if you look right here each one of these is a different value and each one of these could be used to identify these people in this in this particular table and primary keys are important because they are unique values that are used to identify any given row in a table so they're used for identification which means they have to be all different because if these are the same for example then it doesn't make sense because when we go to access uh eight zero zero one zero zero zero zero zero
10:30 - 11:00 and customer ID we're going to end up with two values instead of one so it's not uniquely identifying a particular row now primary keys are a fundamental concept and foreign keys are another fundamental concept so right here um we kind of saw this we saw this set of tables before but a foreign key is used to link up two tables so right here we have employee number we have employee name and we have Department number department number is a foreign key
11:00 - 11:30 a forum key basically just means it's a primary key from another table but it's well it's a primary it's the primary key of another table but located in a separate table so right here these are all primary keys in our employee table so this is our department table and for Department number we have foreign keys each one of these foreign keys are actually primary keys in the employee
11:30 - 12:00 table so for example 101 is actually right here and that gives us Department name of HR and a location of Delhi so foreign keys are used to connect records in one table to records in another table and again when we have a database multiple tables this is how things are generally going to work they're all going to be connected to each other in some way or form maybe through maybe through intermediary tables but that's the way it works so foreign keys and primary keys are two
12:00 - 12:30 really important Concepts primary keys are used I uniquely identify rows in every table foreign keys are used to connect one table to another now when we're working with tables in a database every field or every column is going to have a specific data type so you may have text which represents a string of values character which may represent one single or actually I I guess text is just a string character may represent one single symbol number or letter
12:30 - 13:00 you may have a Boolean which can be true or false yes or no you may have integers which are whole numbers reals which are decimals or date times which are a combination of a date and a time or just a date and time I don't you may have to the reason I'm actually going over this is because you may have to identify what data type specific uh Fields would would have or would match up to given some prompt or given some information about uh a set of data so it's worth knowing
13:00 - 13:30 these different data types and what they represent a lot of them are actually quite similar to what you'd see in programming text is basically just a string real can be represented in multiple ways it's it can be represented as real I think in Java I think in Python Emos we more commonly call it a double um daytime that's not really like a data type in programming but it's often used like the for the date time format is often used in programming in general now again like just to illustrate like if we
13:30 - 14:00 were to actually start working with tables what we'd have to do is we'd have to create a table and this right here is SQL which we're going to learn about in a second and we're using it to create a table and when we create that table we're going to specify the fields and their data types as well as possibly some additional characteristics so right here we've written a query using SQL the language we use to interact with the database and this creates a table with four fields for a name age address and salary
14:00 - 14:30 and the name is going to have a text data type you're just going to have an data type address is going to be a series of characters salary is going to have a real data type because it's going to be a decimal now you will have to understand SQL on the exam you're never going to have to write this code in the exam but I just wanted to put this here so you have more real world understanding of what tables are and like how this whole table database thing works so generally if you're writing code we would connect to our companies.db database
14:30 - 15:00 we would we would send this query that query be executed on the database and then we'd close our connection to the database now there are other types of SQL commands that we can run or we can send to our database to get specific values from the tables within a database now in the context the IB exam we're going to be sending queries to get data from Individual specific tables so right here I'm going to break down these queries these queries specifically you're going to have to know how these
15:00 - 15:30 work for the IB exam and you may even have to write some so one of the most common ways in which we can find or get values from a particular table is going to be using the select from format you select from keywords so right here we're selecting a a specific field from the table products so this is our field and this is our table this means that our output is going to be all five of these things so we're
15:30 - 16:00 going to get Chase Chang and a seed syrup Chef Antoine's Cajun seasoning Chef Anton's gumbo mix on the other hand right here if we select the product name and the price from products we're still working with the same table but we're not only going to get all the product names but we're also going to get all the prices so we get Chase 18 Chang 19 and a seed syrup 10 Chef Anton's Cajun season 22 Chef Anton's gumbo mix 2135 this so that's how selection from Works we're
16:00 - 16:30 selecting a field from a table I don't know why like my formatting went haywire here but this is basically just representing the concept that I just went over and these are generally just called select statements now there are a few other things we can do with select statements so let's take a look at this one right here we have a table name of products so we're going to select product name okay from so we're going to select everything in that field in the field product name
16:30 - 17:00 from the table products According to some criteria so where the price is greater than 20. so that means that we're going to for all of those rows or records in our table where the price is greater than 20 we're going to Output the field we're going to Output the product name so that means really this price here is greater than 20 and the price here is greater than to greater than 20. so we're just going to show this and this right here on the other hand right here we have
17:00 - 17:30 select star from products where supplier ID is not equal to one so we need to select star that means we're selecting all the fields so product ID product name supplier ID category ID unit and price but we're only going to show all five of those fields for every record where the supplier ID is not equal to one that's what this one here means not equal so that means for these for these two records we're going to show everything that's what the star means
17:30 - 18:00 so again this is another select statement but here we're using this where keyword right here in order to set some criteria for what's going to be displayed um now this is similar but here we're checking for the existence of a specific string within a particular field so we're going to select so right here if we say select star from products so we're actually going to display all of the fields from the table products
18:00 - 18:30 where product name like a so where there's a in the product name so that's going to be this one this one this one this one and this one so really we're going to display the whole table basically not very useful right here we're going to select product name so we're just going to display the product name field from the table products where product name has cha that's what the like keyword does that like is checking for a specific string it can actually do more complex
18:30 - 19:00 operations like looking for patterns but in the context of IB exam this is mainly what you're going to see so this one has this one so these two have ch have shot in them so we're just going to display Chase and Chang so we're just displaying product names where the product name can contains cha now when you when you look at um some of what's going on in our in the IB exam in
19:00 - 19:30 terms of what types of questions you're asked to answer you're going to see a distinction between simple and complex queries now right here we have it what's what we what we could consider a simple query so select product name from products where price is greater than 20. that's simple this is going to be complex just because we've integrated um some Boolean operators like and or nearly and no are actually the really big ones um so all of a sudden once we start
19:30 - 20:00 working with um right with more than one criteria that's where we're kind of entering uh the sort of Realm of complex queries now that these are really scary in any way they're it's a lot like working with if statements in programming but just keep in mind that you might see the use of and or or to sort of create more complex criteria and accordingly complex queries that you'll need to work you'll need to either understand or write um on on paper 2 option A
20:00 - 20:30 here's a quick little cheat sheet um for everything we went over again this is mainly what you'll need to know for the IB exam I'm going to include this on the study guide that will be posting online and you can take notes on it from here so now that we've kind of gone over the basics of databases we're going to start getting into some more complex Concepts the first one is a secondary key so secondary keys so of course we're going to have a primary key or we should have a primary key in every table
20:30 - 21:00 but secondary keys are also are those values in a table that are also capable of functioning as a primary key we're just not using them for that purpose so we can you also use an account number as a as a primary key because all of these values all the account numbers are unique and the same is true with email addresses but they're not being used to primary key so for now they're just called secondary or alternate keys the difference between a primary key and secondary key is while they both uniquely identify records primary Keys cannot be null because they are
21:00 - 21:30 essential to identifying records in the database all right our secondary keys can be null also we can only have one primary key at a time versus we can have multiple secondary key Fields so this is going to have a caveat and the slides coming up the next term I want to go over is a candidate key so a candidate key basically encompasses both primary and secondary keys they can each they can all identify unique records in a database and I mean this isn't really like a huge concept you just need to know that both primary
21:30 - 22:00 and secondary keys are also referred to as candidate keys now getting to composite primary keys so if we have if we're if we have a table and one field is not enough to uniquely identify records but two Fields can do it together they together can function as a primary key so a name is not enough because multiple multiple people can have the same name and they often do in a class but a name but one anyone having the
22:00 - 22:30 same name and birthday as anyone else is extremely unlikely so together we could use these we could use both these fields as a composite primary key and the usage of both of two fields to identify a specific record in a table is therefore referred to as the use of a composite primary key we've so we've gone over keys in general and as I said before we've gone over a lot of the fundamental concepts involving databases and tables now when we're planning databases and we're
22:30 - 23:00 getting into the nitty-gritty of actually working with or creating our own databases we're going to represent them as database schema so here we're going to have different tables we're going to show how those tables are connected to each other via primary key and foreign key pairs what their data types are what the field names are what the table names are um and you may even have the length of that specific field so here it's going to be 100 characters and here there's going to
23:00 - 23:30 be a 12 it's going to be 12 characters so database schema is basically an overview of a database and how and it's basically an organizational chart for a database and we'll be referring to and looking at database schemas a lot in the coming slides here's an example of a larger scale database scheme so now let's get into what a relational database is in short this is basically what you've seen for the last like for basically everything before this in this video it's just the term for a particular type of database
23:30 - 24:00 that we just went over that uses columns for attributes uh uses rows for records one term that we didn't really talk about before is the fact that or is tuples and the fact that records are also referred to as tuples and you will often see this term on the IB exam in place of Records or rows tables are related to each other using foreign keys and each table has a primary key so we described a database the beginning of um of this video and that is a relational database now there
24:00 - 24:30 are other types of databases particularly object oriented databases which if you're doing HL we'll talk about later on now one Hallmark of uh of related relational databases is the concept of referential integrity now what this means is that every row has an identifier or a primary key and that every value in a foreign column will be found in the primary key of the table from which it originated so that means is that basically right here we've got a
24:30 - 25:00 departments table and we've got a students table this right here is a foreign key table and this is a primary key table now what referential integrity means is that all of these Department IDs will be found in our department table as we can see right here the foreign key so the ID of six is not in the department table so we get an error but the ID of one is in the department table so that is valid and that follows the principle of referential integrity this this ensures the relations between
25:00 - 25:30 tables are consistent basically they are all connected in a logical manner now when we're working with databases the overarching piece of software that allows us to manipulate databases is called a dbms or a database management system so these in short allow us to read store change and extract data in a database or from a database some examples of dbms systems you might have seen are SQL Lite MySQL or postgres
25:30 - 26:00 school and here again just we were talking about before we have examples well this kind of shows what a dbms is then we also have some examples of other types of databases including objects databases which we'll see later on and there are multiple components in a DBS or a database management system there's a data dictionary which manages metadata this metadata controls how the database is structured there's features for data safety including allowing for backup and recovering a database that's
26:00 - 26:30 gone down I'm also checking for data Integrity which is a term we'll go over later there's a query processor this allows us to accept queries in SQL and then return the appropriate output or conduct the conduct operate or conduct the correct operation we have a storage engine that is really what handles the create read update and delete operation so really just like this is kind of the Workhorse of the dbms a dbms also allows for concurrency so that means we can have multiple users accessing databases and making changes
26:30 - 27:00 the same time without kind of colliding with each other and then finally there are security features which enforce user access policies and basically what that means is that we have features allow us to control what users can access what part of the database and who can make changes to the database now what we're going to do is we're going to go into some of the concepts presented here and that involve database Management Systems in a bit more beat a bit more detail so first one is a data dictionary a data dictionary is also referred to as a metadata repository Repository
27:00 - 27:30 so this is a file or set of files that stores information about the database and the tables inside so this is information about the structure of the database itself and about the organization of the database itself rather than data now this will include at a minimum the names and descriptions of tables the names of fields or columns data types and lengths of fields and the
27:30 - 28:00 relationships between tables so this data dictionary is what allows the data to be organized in a specific way with certain tables relationships between those tables now this is an example of a so right here we have data so we just have a table in our database here we have our data dictionary and right here you can just see there's a bunch of information about how this table right here should function so we have our various fields we have the data types for those fields we have descriptions for those particular fields for a developer right here we also have
28:00 - 28:30 links so what's the maximum length of any information that's going to be put in those fields um yeah so that's an example of metadata that would be characterized would be used to specify the characteristics of a given table now the next concept that the next sort of functionality that a dbms allows is concurrency now concurrency is the process of managing simultaneous updates or transactions at the same time
28:30 - 29:00 as to what transactions are we're going to go over that in the next slide or in the next couple of slides but regardless concern concurrency means we can have multiple people accessing our database at the same time and making changes to database or even to the same row although what concurrency does is it prevents access by more than one user to the same Row Record at a time so for example if you had two people who want to adjust this who want to change the same row you're basically force them to get into a queue so first
29:00 - 29:30 um so the first person's change should be made and then after all of those changes are made then the second person would conduct operations on that row basically it makes sure that one row isn't written by one user while at the same time it's being written by a different user which would just be a mess as I said before it forces sequential updates now I think this this is probably the last concept that we're covering with regards to dbms's but the dbms also allows us to secure our database
29:30 - 30:00 so we can specify access rights she talked about earlier we can say what users are able to conduct what operations in the database we can specify or we can mandate that there are audit Trails so this is a record of any changes that are made to a database so someone messes the database or does something that that deletes large portion of the database we can see you did that uh you have data locking and data locking corresponds to concurrency now this means that certain rows that a row that's currently being changed is
30:00 - 30:30 locked and cannot be accessed so while that isn't like Security in the traditional sense this just means that data is being preserved right so it's not necessarily we're securing it from a bad actor but we're just making sure that everything's the way it should be and the same thing goes for validation validation just make sure that new data follows a rule submit so make sure that data follows the correct data type and length and again this isn't necessarily security from like a hacker but just making sure that data is in the correct format um we can have encryption which is encryption encrypting data in a database
30:30 - 31:00 and then we can have backups so that if something goes wrong with the database or someone makes a change that they're not supposed to to a database we have an intact copy that we can kind of load up and use in place of the corrupted database now the next concept I want to cover is database transactions so basically what a transaction is it's a group of tasks in terms of programming it's going to be a set of SQL statements that are executed in order as if they were one command now this works differently from a
31:00 - 31:30 typical uh database operation so usually you just run a command and that changes immediately done now what happens in a transaction is that we basically have a set of operations and we conduct all of its operations so basically we keep a record of everything that needs to be done for each of those operations so for operation one we're just we're going to go ahead and we're going to we're going to keep track of everything needs to be done for this to be for this to take place and for option two and then for any other operations we have without
31:30 - 32:00 actually making permanent changes to the database so in some ways you could say we're actually making temporary changes to the database we're just kind of basically putting in all the pieces in order to be able to make a permanent change but we don't actually make a change until we commit all of the CH all of the operations or all of the um all the temporary sort of changes that we've that we made or that we've planned it's for operation one we might say we might make a list of things that need to happen then for operation two and then
32:00 - 32:30 for operation three and so forth but those changes aren't actually going to be permanently made to the database until we commit them so you can you can either think of this as sort of a plan for changes or even just temporary changes that are being made but that are not permanent so if all if all of these are done so if we've attempted a temporary operation one we attempted a temporary operation two we've done all of those and there are no errors then we can commit to we can commit those changes permanently so basically transactions allow us to
32:30 - 33:00 conduct a set of tasks um but in a way that doesn't actually alter the database so we're basically making either temporary changes or we're we're just we're basically conducting operations on a database to see whether a task is possible or a set of tasks are possible without actually making permanent changes to this database then once you've gone through and done that and there are no errors then we can permanently commit our changes and one of the advantages of transactions is that if anywhere here we do get an error we can roll back any other temporary
33:00 - 33:30 changes that have been made it's like it's like nothing ever happened basically now maybe that was a bit convoluted I guess the best way to summarize this is these are all temporary changes and nothing is really done until we actually commit those changes to our database which makes them permanent so this is kind of how it works um so we start in the active State and we start conducting our operations so we'll have a partially committed state that just means that we basically temporarily done some things but if there's failure
33:30 - 34:00 we go to a failed State and then we abort and then we just terminate all operations however if everything is successful then we just commit our changes and we're good to go now transactions are important because they follow a set of principles called asset principles and that's atomicity which means all tasks in a transaction are performed are none so remember a transaction is a collection of tasks that are temporarily done before being committed now basically unless all these tasks are
34:00 - 34:30 successful we're not going to do any of them next is consistency so this means that all data must be valid according to existing rules that must follow the correct data type the correct length length Etc next we have isolation this means that no transactions are going to interfere with each other so if you have multiple transactions running at the same time they're going to they're not going to be really affecting each other if anything they're going to be conducted in sequence so one transaction another transaction another transaction but the
34:30 - 35:00 point is they're not going to run into each other while they're trying to commit their respective tasks or conduct their respective operations and next the last principle is durability so once a transaction is complete the change the database is permanent even in the case of system failure so atomicity uh optimisticity did I said right consistency isolation durability now you need to know what all of these are for the IB exam so the purpose of a transaction is to make sure that there's never an
35:00 - 35:30 incomplete set of changes especially like while and particularly to make sure that a one set of changes doesn't it doesn't collide with another set of changes so really at their heart transactions are about making more complex changes to the database using multiple different SQL statements and with transactions we can make sure that that uh two different sets of tasks don't run into each other but also all tasks are committed or none also whenever you make commitments or
35:30 - 36:00 whenever we make changes to database yes they are permanent but we also log them so we log every single change that's made so that makes it really easy if we want you to roll back changes that have been made so transactions may use sorry like basic changes made using transactions rather than just simple SQL statements or simple uh um like operations are actually more convenient because if you want to change them we have them logged and we can just roll them back really easily or revert them now what we're going to do is we're
36:00 - 36:30 going to go through a few sets of Concepts that lead us to normalization now normalization is sort of one of the biggest and most I would say complex Concepts in option A and so it's important is to ask to be able to understand certain language before we tackle that the first is data integrity now data Integrity is it's a very broad term and it's used a lot in option A just to basically say that like the data should be like what the user means it to be
36:30 - 37:00 um so we can say that our first to three things accuracy so we have the correct data in our database that data is retained and preserved it isn't changed for any unnecessary reason and also the relationships are retained and preserved so everything is like as we've as we've planned it um completeness so all necessary data is available again there's no data that's being deleted for no reason and validity all data meets all predetermined rules that could be for length for data type whatever
37:00 - 37:30 here we have a great um we have like a great graphic that kind of talks about how we can preserve data Integrity so validating input that leads to validity and that connects to validity moving duplicate data which also connects to completeness like duplicate data is not really what we mean when we're talking about completeness because we just don't really need it um access controls are necessary for preserving data Integrity because you don't want if someone who you don't want is accessing the database they might
37:30 - 38:00 violate the accuracy completeness or validity of the data um and that's audit Trails also helping that so you can see who's changing what and backing up data is another way to preserve data Integrity because if something goes wrong for example if we lose some data we can back that data up and we can maintain the completeness of the data in our database so you just need to know that data Integrity just basically means that everything's good in the database the next concept is going to be data
38:00 - 38:30 redundancy and this is a huge concept so data data redundancy is this is a situation where the same piece of data is stored in two or more different places so for example if we look at this table right here we have this row so we have a doctor ID the corresponding doctor and the room for that doctor and that's three times in the database we don't need to have it in the database step anytime so we could just have a foreign key to a table with it with all the doctors and that would be sufficient
38:30 - 39:00 so if we have the same piece of data replicated multiple and multiple different places the same database or the same table those are all examples of data redundancy and they're considered they're not considered good practice in terms in terms of making databases in terms of building databases and that's one of the things that normalization the concept we're going to tackle next seeks to resolve now while data redundancy is something we're trying to get rid of in the following slides
39:00 - 39:30 they're both pros and cons so obviously the cons are more storage spaces required because you're just storing more information in well you just Hands-On have more information but other problem is data inconsistency so if you have different if you're the same piece of data at multiple locations it may get updated in one location but not others which can be really confusing for a software system or this can be really confusing in general in a system of course the pros and this really depends on how you store your data and where you store it even if you do have multiple
39:30 - 40:00 copies of that data you can have fat you may have faster data access speeds because if you have your data in more locations you don't have to go as far in order to be able to access it in a database or just in Ram in general also if you have data in more places if you lose data in one place then you can just replace it from another location but this all relies on you being able to keep track of that data and know exactly where it is at all times so now let's get to normalization so normalization is a process by which larger tables in a database are divided
40:00 - 40:30 into smaller tables while ensuring data Integrity which we talked about before and reducing data redundancy so some things you want to get rid of are attributes at multiple values so this right here is a great example of an attribute with multiple values I guess right here and this literally has a table inside of a table or even this or even like let's say we have a particular table or we have a particular attribute right and we have
40:30 - 41:00 an attribute for uh weather okay then we have both temperature and we have the type of weather in the same uh in the same field in the same cell that's an example of having multiple values in one attribute for one field um you don't want attributes that are repeating the same type of information we don't want attributes that aren't really related to the information table we just don't want redundant information similar to what we saw in the last slide
41:00 - 41:30 and normalization is a process Again by which we're dividing up our database in order to prevent these particular characteristics now generally you want to use normalization to reduce data redundancy as we touched on before to reduce table complexity so if we have tables so if I were fewer database divided into multiple smaller tables rather than one big table and overall we're going to have less complex tables and this makes insertion updates and deletes uh less
41:30 - 42:00 error prone and also this is we also just want to make sure that data is stored logically because this makes it easier for us to write SQL queries but also for developers or anyone who's working with our database it's just a lot easier to understand which makes it less likely that they're going to either damage the databases data in some way or um just do something they're not supposed to involving the database maybe just structure it in a way that's illogical or that doesn't help the system function more efficiently now we're talking about databases there are three normal forms that we're going
42:00 - 42:30 to refer to so each of these represent a set of standards for normalization you can consider the first normal form to be the most stringent or to be least stringent form of normalization and the third normal form to be the most stringent the first you know the first normal form mandates that we eliminate duplicate columns and columns and multiple types of values you also want to create Separate Tables for each group of related data with unique primary keys now the problem with normal forms is
42:30 - 43:00 that if you look online or you look in a textbook there's kind of like in some ways it kind of explain normal forms like they distinguish between the first second and third normal forms in different ways so for example some sources will say that the first normal form mandates unique primary Keys While others will say that the second normal form mandates it I think the IB exam actually had primary keys in the second normal form but that I mean like
43:00 - 43:30 that doesn't matter as much just because for us to even have second normal form you need to meet all the requirements for the first normal form so regardless of where you put it you're gonna have to be doing that anyways anyways that's the first normal form the second normal form meet all the requirements your first normal form and then eliminate partial dependency which we're going to talk about in an example here and in the third normal form we need to meet all the requirements for the second normal form and um we also need to eliminate Transit dependency which is another term that
43:30 - 44:00 we're going to go through in our examples now what's important is that on the IB exam you're going to have to take tables that are that are unnormalized and then put them into for example third normal form and understanding these Concepts is important but you're just going to need a lot of practice and it really comes down to kind of seeking out certain errors like multiple types of value or multiple types or just multiple values in one field and understanding how you can split up tables to get rid of those
44:00 - 44:30 before we get to actually normalizing let's look at some examples particularly of one and F 2nf with partial dependency and 3nf with transitive dependency okay so remember with first normal form we don't want to have any duplicate columns we don't really want to have any Columns of multiple values just generally we are trying to have unique columns with one set of one piece of data and primary Keys as well you might as well go with that
44:30 - 45:00 so this is our first table and immediately we can see that there is a problem right here with this particular field the problem is that we have multiple values in this field so we need to rectify that problem there's really not no other problem here with this table so what we're going to do first is we're going to separate these values in one table so we are going to create a new table and this is just kind of an intermediary step and uh you're
45:00 - 45:30 not actually going to do this on the IB exam but I'm just doing this to show you like how we get from non-normalize to first normalized form so right here in this table we're going to have we're just going to repeat so we have Jeff right here Jeff at a 3.9 and was majoring in economics and math so now we're going to have two rows that reflect the information so you got Jeff and we have one row for economics and one row for Math and similarly with Alice he's majoring in math and physics we have Alice and one row for Math and
45:30 - 46:00 one row for physics but that's only the first step now we can't have we don't really have a primary key here right because we have student ID but we have them repeated so those those aren't really function as primary keys so what we're going to do is we're going to split this up into two tables we're going to have one table just with the student data and that's this table we're going to have another table that combines our student IDs with majors so we're going to have each student ID next to its respective major and then
46:00 - 46:30 wherever we have multiple student multiple Majors we'll have a student ID we'll have a student ID that corresponds to each one of those majors however again we can't use this as a primary key because we have keys that are repeated so we're just going to create a new primary key and that's going to be a simple ID this should be one two three four this should actually be five and this should be six right here and that's basically getting to first normal form
46:30 - 47:00 we don't have anything that's repeated we don't have any duplicate values you have primary keys and all on all of our tables and we basically have to create those two tables to rectify this problem right here um this right here would have had the problem of not having a primary key I suppose we could have added another primary key but I mean I think that this is definitely a more efficient solution because right here right here we're only repeating our student ID versus here we're repeating all three of these of these attributes
47:00 - 47:30 so remember the name of the game is to repeat as little as possible to eliminate redundancy now at second normal form we have a bigger problem and that is partial dependence so the pro with partial dependencies is basically when when any given field only well first of all partial dependence occurs we have a composite primary key okay so right here we have two primary Keys we have the employee ID and the field ID which are both unique
47:30 - 48:00 now the problem here is that the employee name has nothing to do with the field ID I mean we do have employee D and field ID but employee name is really just directly tied to employee ID and field name is really just directly is really just directly connected to field ID so this is an example of partial dependence because field name doesn't really have a connection to the employee ID an employee name doesn't really have a connection to the field ID those are just very separate things so you could
48:00 - 48:30 use an employee ID to get an employee name but not a field name and you could use a field ID to get a field name but not an employee name at least that's what we see from this table now there's a couple of ways to rectify this actually there's really only one way of just splitting it up to different tables so we're going to have one table for the former primary Keys the employee ID and the field ID go together so those are going to be one key then we're going to have two tables for the fields that were formally dependent or for the for the partially dependent fields so employee
48:30 - 49:00 name is was dependent on employee ID so that's going to be in one table field name was dependent on field IDs that's going to be another table now oftentimes like we're making an assertion that field name is not connected to employee ID in any way from looking at this table like that's what we see now we're making a judgment call um and oftentimes when you are normalizing your databases you're gonna have to look at the context and you have to decide whether there is only partial dependence or full dependence on how that works but here we're assuming there
49:00 - 49:30 is partial dependence probably to illustrate the concept and this is how we Rectify it so third normal form basically encompasses the concept of transitive dependency now we can say right here that pass fail is transitively dependent on total marks and the reason for that is because this attribute right here depends wholly on this one right here with really no connection to name or seat number like these have no real impact on what pass fail is and total marks obviously is
49:30 - 50:00 going to depend is going to depend on the primary key which is going to be name well actually it's not going to be named this case it's going to be seat number but pass fail depends on total marks entirely and doesn't really have a connection to seat number again this is taking context into account and this is a problem in third normal form so in order to do that in order to like sort of Rectify that what we're going to do is we're actually just going to create a separate table with total marks and pass fail like so
50:00 - 50:30 and we're going to have seat number name and total marks in one table and this is kind of this comes down to having a logically organized tables I'm right here we can also use total marks as our primary key because all the values are unique so that's basically third normal form now it's important to know what all these concepts are but I kind of like when we actually get to normalizing these the logic is going to look different we're going to be using these concepts of transitive dependency and
50:30 - 51:00 partial dependency but it's really going to be more from the perspective of reducing repetition and logically organizing tables in as intuitive ways possible so that being said let's move on to actual normalization in the IB context okay so now let's take a look at how to actually normalize tables which is one of the most common tasks on option A now here we have a set of steps I'm going to caution You by saying that I
51:00 - 51:30 actually when I saw these I don't follow these steps in exactly the same order these are all things that need to happen at some point um but I think it'll become more clear in the examples that I work through uh following this slide um as to how I use these steps to approach these problems the first thing I would say is you need to split your big table that you're given into separate logical Tables by entity so for example if you've got cars mechanics and mechanic shop and you have attributes for all of those in the same
51:30 - 52:00 table you need to split it up and in each one of those smaller tables you need to include what data corresponds only to that given entity so for example what corresponds only to cars needs to go in one table you need to get rid of any attributes of multiple values by creating new tables so if a particular column attribute field has more than one value that should just not be the case you need to get rid of that by dividing up your table into multiple tables make sure that there are no attributes
52:00 - 52:30 that are only dependent on one primary key otherwise new table that's a partial dependence make sure no attributes are dependent on another attribute that is not the primary key otherwise new table that's transitive that falls into 3F with 3nf make sure that all tables are linked to your relationships so every table does not need to be linked to every other table but it needs to be linked to at least one other table to the primary key for and key relationship because ultimately they are all in the same
52:30 - 53:00 database and we need to be able to access other values and other tables through whichever one table that we are starting with and finally you need to make sure that every table has a primary key now we're going to go through four different examples the reason for this is that I know that these steps may not have been the most intuitive and honestly with solving these questions for me sometimes I just have to Intuit my way around them so I'm not necessarily following these exact exact steps so I'm going to use the Brute
53:00 - 53:30 Force approach to teaching and just try to solve as many examples as possible Walk and Walk to how I do them and then just hope that you guys understand what to do so let's get started so right here we've got our first table and we actually need to put this into 3nf so third normal form now the first thing we're going to do is we're going to look at the different types of entities that we have so obviously right here we've got product
53:30 - 54:00 we have salesperson and we've got manager okay so when we write our answers to these questions we're not going to be creating new tables we're just going to be writing the tables with we're just going to bring the table names with their fields so immediately I can say we're going to have a table called a product and that table is basically going to have a number it's going to have a number it's going to have a unit price and that's kind of a product name for
54:00 - 54:30 sure now I'm not too sure about date and time because for each one of these products for each well it's weird because for each one of our sales people we have two dates and times and the fact that there's multiple values in this column tells us that we're going to need to use this to split to split into a new table anyways so I'm kind of going to leave that out it just doesn't I know we're gonna have to do something in that column so I'm just not going to touch it and it's probably not going to be in the product table now we're going to leave this open-ended I'm not going to close the brace here so that we can add or we can change other
54:30 - 55:00 stuff later on now the next thing that we know we're going to need is a sales person table and we can just start off by saying let's say sales person and we know in that one we're going to have a um a salesperson number and a name and next we're going to have a manager table because we have like two manager attributes we're going to have we'll just say manager and then manage your number well I'm just going to go ahead and erase
55:00 - 55:30 that real quick because we're just going to have a number well I just don't want to write that much actually you have a number and we're going to have a name okay so now that we have this what we want to do is we want to basically deal with what's remaining which is the date and time um column which we haven't really figured out how to deal with and in order for us to figure out how to deal with that we need to look at how this fits into the rest of the table
55:30 - 56:00 so right here basically we have a date and time for every single purchase so someone buys this product and that product is sold by a salesperson and that like that sale by that say by that sales person takes place at a specific time and we know all of these dates and times are different um so what we're going to do in order to split the state and time up in order in order for us to have a table where each row has an individual date and time we're just going to say we're going to
56:00 - 56:30 create a new table called sales because that's really what's represented by each one of these times and that's going to have a product number it's going to product number that's going to have A salesperson number so we'll say sales person number and that is going to also have our date our date and time let's say DNT okay so we've kind of accounted for everything by this point we have a product with a number unit price and
56:30 - 57:00 product name and this works um because of the fact that like all of these different attributes depend primarily on a product in this table if we had all of these attributes a product name salesperson manager all on the same table that would basically mean that they'll basically violate our um a rule of Transit dependence because of the fact that well maybe not necessarily transitive it would in some way Transit dependence because you have all these different attributes that are
57:00 - 57:30 dependent on other attributes that are not the primary key like manager name is dependent on manager number sales person names dependent on sales person number um and product number and like we have all these different things are only dependent on the product number and not the salesperson number or the manager number so because of that that's kind of why we divide these into all these tables now what we need to do now is think about our primary keys so it's pretty clear that in our product table that the product number is going to be our our primary key and the reason for that is
57:30 - 58:00 that every product number corresponds to something different so like we don't really have the same product number for anything so we have 1919 440 as a saw as a saw three five six four seven is a drill so we can see that product number is is unique and has something individual that corresponds to it and that the same kind of goes through for unit price pretty much every product number has a unit price or even if it has the same price as something else to least distinguished by product number
58:00 - 58:30 and product name which could also make it a composite key if you really think about it although let's see no yeah I guess that could be but anyways next with salesperson we have a number and a name there's no sales person like every single sales person has a unique number so that's going to be our primary key there so we're just going to draw a star by primary keys and every manager also has every like single manager has a unique name it's like 16
58:30 - 59:00 is Benson and 45 is Rogers so we're gonna have a primary key there now with sales um we do have some product numbers do we have any product numbers and uh and like salesperson numbers that overlap um let's see let's just go down or the overlap or that correspond to different sales people it's like okay we have all right we've got which one of these is actually repeated so we have three five six four seven right here which corresponds to 234 and we have three five six seven
59:00 - 59:30 four three five six four seven right here which corresponds to 199. um so what that's telling is is that we basically do have we could use a product number in sales person number as a composite key um let's go and see if there's anything else any other problems um so we have three two four yeah I think that would actually work um so this would actually be that would be a composite key but if we wanted to
59:30 - 60:00 have I guess the other the problem is that the date and time wouldn't necessarily correspond to the product number but when more correspond to the sales person number although actually that's that's dubious I think that these together could be a composite key and that would be correct however if we wanted to just make sure in this case we could also add a sales um we could add a sales ID which would actually ensure that we have a unique primary key
60:00 - 60:30 for each value in our sales table so again when in doubt like just create an ID and what this is going to be is just going to be one two three four five each one of those corresponding to a unique sale and that's basically how we normalize this like I know maybe this didn't follow this didn't look exactly like this right here but I just wanted to walk you through my thought process so that you had a good understanding of how you go through and do these kinds of problems that being said we're going to do three more and if you guys get tired
60:30 - 61:00 at any point just go ahead and Skip and Skip on but it took me a while as a teacher to really understand like how this worked looking at previous exams and mark schemes um and looking online so I just want to give you as many opportunities as possible to understand how this works yeah and this was our this is basically our answer right here um this as you can see all of this the same right here we just had purchases and so that we had sales instead of purchases and we have purchase ID instead of sales ID but regardless this
61:00 - 61:30 would still be correct and notice we get four points for correct table four points for correct primary key now this is our next example so right here um I guess we have like courses that are being taken um and students are taking them so like immediately we have a few things first of all we have a student um we have students so that's going to be one of our tables okay so I know we have students and then
61:30 - 62:00 we have subjects and we know that we're probably going to need another subject table because there are for example multiple subjects associated with the same student like anster and whenever that's the case that means that we need we probably need another subject table but what even what compounds that makes that even more the case is that we also have exam grades so you have exams that are associated with subjects that are also associated with students and we could say that exam grades correspond to our subjects or set the grant exam grades are
62:00 - 62:30 connected in some way to our subjects as well um so that being said well I think I'm gonna I'm gonna stick with student first and then we'll go from there so with student I think we know that we can have uh um ID which will be a key name and gender attached and I know with grades for sure because every grade is determined by a unique combination of a of a student ID
62:30 - 63:00 a subject because a student if you have the student ID you have the name and gender so we could say that every every exam grade is going to be a combination of a student ID and each of these is going to have to have their own ID well we'll take first of all each of those is going to be a combination of a student ID um a subject and the grade itself um so in that case like I would say that
63:00 - 63:30 yeah um right here the problem is that we can have multiple student IDs with um in different subjects but and with different grades but the fact that we can have multiple student IDs in the same table means that we probably also need a primary key for this so we could just say exam grade ID and there's something missing um you don't really have the subject here and I think this is kind of a problem
63:30 - 64:00 because we need students to be associated with subjects in some way right so we have we have exam grades where we're just showing like what the exam grade is we have students there's no real Association like there's no really way to tell like based on a student what subject they are taking um so I would probably say that um the best way to do this hmm actually subject is kind of on itself there's nothing else that really depends
64:00 - 64:30 on subject or is there so I think probably what I'd put right just here is I'm just going to put subject because subjects are kind of just on their own so that means every student is going to be actually that's not going to work right because students have multiple subjects so I guess we're back to square one so you've got a table with students um student ID name and gender we've got exam grades it's student ID subject grade and exam grade ID we do have
64:30 - 65:00 subjects included in here I just don't really see any reason why we need a table for subjects there's no other information associated with subjects besides the exam grade so I think that's it let's go ahead and see what the answers tell us and I guess we're right so you have students and they call it sub they call it subject student and they call it subject student but that was pretty much it um so yeah I mean I know that's kind of roundabout there but I I do feel like kind of explaining my like weird thought process is the best way to look at it
65:00 - 65:30 and I guess this one actually asked us to normalize to 2nf um but I think that the way that we did it there's not really any Transit like any transitory it was a transitory transitive dependence anyways so I'm pretty sure 3nf would look the same yeah I can't think of any real examples here where something is dependent on one attribute but not dependent on the ID or whether there's something weird going on there okay so all right let's move on so right here
65:30 - 66:00 uh we clearly have a few entities we have a truck we have a driver we have we have we have trailers so let's start with the truck um so for truck we're going to have an ID and it looks like that's unique for every truck we're gonna have a make we're going to have an energy source for a driver we're going to have a name
66:00 - 66:30 and a telephone number which seems to be unique for all of them and can therefore be used as a key for a driver next you've got trailers and it kind of looks like these trailers are all making different Journeys um and like so basically because we have coupled from couple two we have dates and um so it seems like we've got a few things going on so I mean I guess the other issue is also the fact that in trailer ID and in
66:30 - 67:00 trailer space we have multiple values and same attribute which is an immediate red flag and those values also seem to go together so like we have tr2 and 60 tr9 and 30 tier 2 and 60. so we're kind of seeing some patterns here um so I kind of want to separate this well so it seems like every so if we look at trailers in general if we look at our data for trailers we've got a trailer ID we have the space in the trailer and then we have dates that correspond to that trailer ID and trailer space
67:00 - 67:30 but those dates can differ as well so for here we have tr260 and then we have it 118 to 18 6 18. then we have tr260 from 1 8 18 to 31 12 18. so I'm going to take it like I think these these dates involve Journeys so it seems like we've got a we've got like trailer attributes um so we have specific data like connected to trailers and specific data connected to
67:30 - 68:00 um the journeys of these trailers and I think it's kind of clear because it seems like anytime a trailer ID is repeated like tier two they have the same trailer space also if you look at for example tr9 that has a trailer space of 30 and right up here right over here and right over here but they're on different dates as well so we're just going to create a table called trailer and we're going to say uh trailer ID and we're going to say trailer space and then we're going to get to Journeys so
68:00 - 68:30 we'll create a table called Journey and each one of those Journeys I think should have an ID so it's a journey ID but it should also have a trailer ID because each one of these is corresponds to a trailer ID so of ID trailer ID and then there's going to be a coupled from and a couple two so all of these have primary keys so that's the primary key right there um that's the primary key right over here right over here right over here and
68:30 - 69:00 we've kind of resolved that issue where we have multiple values or multiple values for the same attribute so let's look at the Mark scheme let's see how we did um okay I think we did pretty good right here they use the driver instead of the telephone number which I guess you could have done like either one um Each truck had a truck ID okay so this one seemed to be correct and this one right here seemed to be correct
69:00 - 69:30 um and I think we got this one as well ah except I guess we're missing the driver aren't we let's go back right here so yeah on the truck we didn't have a driver which is my mistake um so I think each of these should have a driver so I guess each of these should have a driver right here because every truck I mean does obviously have its own driver and I guess that's why we don't want this to be our primary key as well if you want to change the name to our primary key and then right here we're
69:30 - 70:00 going to have driver or driver name and we can just use the name I know like we can just use the name as a as a primary key because if you look right here all of them are significantly different right so that kind of solves that error that we had and I think the other error that we had was right here for Journey they had truck ID instead of just a regular ID and I think that makes sense um and the reason why is because clearly like every single one of these Journeys corresponds not every trailer doesn't
70:00 - 70:30 necessarily correspond to a specific truck ID because of the fact that for example we have tr9 right here and tr9 right here um sorry I accidentally erase that we have tier 9 right here and we have tier nine right here but it does seem like every single Journey corresponds to a specific truck ID because every single one of these is unique right so we all have unique dates and they all correspond to a truck ID so right here we're going to have uh well I guess maybe I will just add it at the end right here
70:30 - 71:00 let's see that blank right there it'll add this right here so we'll have truck ID and together actually these can form a composite primary key because you're good you're always going to have unique combinations about the trail ID and the truck ID so a little I guess a little mistake there um let's put starters right here but overall that's the logic of how we do this so that being said let's move on to our final example um
71:00 - 71:30 and that's wine so right here um again we're just going to do what we basically did where we look at uh like the different entities that we have and right here it seems that we obviously have wine okay so you've got wine and that wine seems to be attached to well okay so we have wine and then we also have a problem right because our description has three different types of values and reality for it to be normalized there should only be one value per attribute
71:30 - 72:00 so I think we can conveniently say that every wine well okay yeah so we can completely say that every wine has a name um it has a I mean it comes from Vineyard right so it has a Vineyard it has a year uh I guess dry is like a flavor uh alcohol content so we'll say AC and with unit price what we want to do is so we obviously have wines right and
72:00 - 72:30 we have these that correspond to you know prices we want to consider this kind of carefully because we also have store IDs okay so multiple stores that are in some cases selling the same thing so for example sauvignon blanc with stormy that's from stormy Bay that is uh 2017 dry and 12 we have like these rows are basically the same up until you get the store ID where we have a store ID of two right here and a store ID of one right here and then um stock quantity which is also different so those are
72:30 - 73:00 different for that one um however the unit price is the same let's go see if there are any other examples like that um so yeah I guess right here we have Shiraz down here and seems like everything is the same including the unit price and obviously the stores and the stock quantities are different okay so I guess we can conveniently say that um like the the unit price is also something that can correspond to each particular y
73:00 - 73:30 EP so name Vineyard year flavor alcohol content and unit price and let's see do we have I mean a lot of these have some of these have different names but like so these are the same name so you have sauvignon blanc right here Sauvignon blank right here but they have different characteristics so they have for example this one is 2017 dry and 12 and this one right here is dry and 13 so I'm gonna go off on a Ledge and I'm going to say that we also need an ID here as well for each of these for each of these combinations okay so
73:30 - 74:00 that's wine now the next is going to be Vineyard um it seems like right here if you look at our store IDs like for example you have store ID one two and one and they're both in the same Vineyard so I'm gonna guess that store ID isn't something that is specifically tied to Vineyard um so like the vineyard here is Storm Bay but then right here we have a store idea one that's that's tied to James tree as well so I don't think there's a relationship there but they all like it there does seem to
74:00 - 74:30 be a relationship between stormy Bay and region so like everything that stormy Bay is in Gisborne everything that's pecan blockers and gives burn everything that's James tree is in uh Hawks Bay so I think we can say like for Vineyard we're gonna have a table and the relationship there is going to be between the um Vineyard the name the vineyard name and the region and that's going to be a primary composite key because every single combination is unique we have stormy Bay Gisborne James tree Hawks Bay we have John glad Gladstone so I guess we don't
74:30 - 75:00 really need to worry about Keys much there um and next we're gonna have store IDs we need to deal with the stores um so I think what store like the thing is with store ID um I'm trying to see like what the store ID is actually going to be tied to so store id1 it seems like it's in stormy Bay but when we go to story D1 and James 3 so it's not attached to the vineyard as we said um but it seemed like the relationship
75:00 - 75:30 between wine store IDs and stock quantities so think what I want to do is I go to Wine ID um and I want to we're just going to create a store wine store wine table so we're going to have different store IDs so yeah store ID is going to be one of them we're gonna have store ID we're going to have a wine ID and then we're going to have our stock quantity and in this case store like the
75:30 - 76:00 combination of store ID and wine ID should be able to function as a primary key because we don't really have a case in which we have the same um like wine and the same store so even these two right here so like our sauvignon blanc are summoning blanks they have all the same characteristics except for the store ID so yeah we've got I think we've got right the these could just both functions primary Keys these good functions primary keys and we're just
76:00 - 76:30 gonna have an ID right there I'm sure there's something that could function as a primary key right there but I think we're going to stick with that and I think we've accounted for all the different attributes so let's look at the Mark scheme and see how we did so yeah I think we did pretty good wine ID wine uh Vineyard um okay yeah that's actually perfect awesome so those are actually like four great examples of um of different types of uh
76:30 - 77:00 normalization problems and like my logic didn't exactly follow the steps that I gave you but as you can kind of see you really just need to reason through them and like for example if you go back to the previous problem the first thing was always separating it into entities right and then once we separated it into entities then we had to make sure that we had unique combinations well first of all we need to make sure we have primary keys for every for each table but we
77:00 - 77:30 also need to make sure that we never had tables where a set of information was repeating so every single table had only unique rows and that's kind of resolving like doing a little bit of like kind of um negotiating between this attribute and that and that attribute so I think a combination of dividing into entities um and that uniqueness coupled with having to have a primary key on every on every single table is really what helps you solve these problems so if I were to
77:30 - 78:00 summarize I would say then entities um unique rows of data unique rows of data and then primary keys and that will get you ninety percent of the way there okay one last thing I do want to say is that basically all these questions that we went through right here I normalize them to uh normal form three or the third normal form
78:00 - 78:30 um you like there was probably there may be some questions to ask you to normalize the second normal form but basically in order I mean like if you normalize it to the third normal form then that by definition means that you've already normalized it to the second normal form so basically just doing what we did on any of these problems is most likely going to work for you and I I can I can't think of a situation which it wouldn't and that's the caveat I want to leave you with uh now we're going to do is we're going to go on to talking about the advantages of normalization so we
78:30 - 79:00 spent all this time working on normalizing databases and now I want to talk specifically about why we would do that so first um if you remember like when you're normalizing we were trying to make sure that in every single row and every single table there was a unique row of information that means that we have we're essentially eliminating as much redundancy as possible which means that less data storage is required to store the information in our new smaller databases
79:00 - 79:30 um that's probably one of the biggest advantages the second is that data is more likely to be consistent consistent meaning it's not going to be outdated or there's not going to be duplicate information where one piece of information has been updated and the other hasn't so because there is less redundancy that means that there isn't the same information scattered around different places in your database table and therefore there aren't all of those that all those places don't need don't need to be updated so that means you're
79:30 - 80:00 only going to have a specific piece of data once most likely in your database and that's all it needs to be updated um next we can also say that we would have increased data security and this actually just really comes down to the redundancy question we know where every piece of data is and we know it's generally located only in one place and knowing the location of those pieces of data make it easier to protect so there's not going to be a piece of data somewhere we didn't know about that isn't adequately protected either by encryption or by access rights or
80:00 - 80:30 whatever else next updates and complex and complex operations can be conducted more quickly and efficiently due to table structure so this is a lot to do with there being smaller tables and with information being easier to find this has this directly leads to being able to write simpler queries and being able to find and conduct operations uh more quickly and more in a more targeted way and finally the last Advantage is the
80:30 - 81:00 fact that tables are more logically organized so looking so a database is easier to understand and this is important when you're working with multiple developers or if you're trying to explain your database schema to a team of business analysts or a team of non-technical staff just having tables more more locally or logically organized makes it makes a database overall easier to understand and the easiest way to get to that level of logical organization is through the normalization process that
81:00 - 81:30 we just went over now one last sort of vaguely connected well it is connected but one random sort of concept we need to talk about is anomalies and anomalies are directly connected to uh the normalization done the second normal form been working with the second normal form we want to make sure that primary keys on every on every database and that we don't have any partial dependence now if we don't normalize According to second
81:30 - 82:00 the second normal form we have three possible anomalies which are basically problems that can arise the first is the insertion anomaly this is when a row can't be inserted due to a missing foreign key so you've got all the data but you don't have the data for a foreign key that connects it to another table and so you just can't insert that data the second is deletion anomaly so it's when you you want to delete one attribute or you want to delete one row
82:00 - 82:30 but all the attributes for a specific entity are lost when you do that and the second is the update anomaly so when data is only partially updated in a database and let's look at some examples to actually highlight what that is because these are kind of like just the dictionary definitions but I just want to show you in more depth what each of these anomalies means so right here we have a database of teachers and the departments in which they work in along with like their phone numbers now
82:30 - 83:00 an insertion anomaly in this case is connected to the fact that we can't add a new Department without also adding a member of Staff right and this isn't necessarily connected to foreign key specifically but if you go back to our definition we basically say that we can't um basically an insert anomaly is when a girl cannot be inserted due to missing data or without some piece of data that may not otherwise be necessary and here there is literally no way for
83:00 - 83:30 us to add a new Department without adding a new teacher even if we have no desire to add a new teacher that department at all I mean not that we're going to have empty departments but every time we want to create a new Department I don't know that it makes sense that to add a new teacher especially when we could be adding existing teachers so that's an example of insertion anomaly I think it just kind of helps better understand what an insertion anomaly is going to go ahead and check that now next a deletion anomaly so for example if you want to delete a Betty
83:30 - 84:00 Flood from the table um if we delete that record with Betty Flood then we're also getting rid of any information connected to the geography department and part of the reason for this is because there's no separate table for geography which we would probably create if we were doing normalization this is an example of a deletion anomaly we really just want to delete Betty Flood but in doing so we also delete the entire geography department and any information associated with it like the department ID or the phone number
84:00 - 84:30 and finally um an update or not what the update anomaly is related to the fact that if the phone number for The English Department changed to 307 instead of 301 we have to change it for three different records so the English Department right here you've got 301 right here and 301 right here so if you want to change the phone number we'd have to we'd have to change it not only here but also right here versus if we had another table for phone numbers which was connected to this table via a foreign key then we
84:30 - 85:00 would really just need to update it once in the table for phone numbers and then that's what we would access when we try to access our foreign key in this column so these explanations are maybe in the previous slide maybe a bit theoretical but I hope that these examples really made it clear what we're talking about when referring to each of these anomalies and that is pretty relevant to option A because you will have questions where it sort of asks you to Define what these anomalies are or to talk about where and how these could occur in a given table
85:00 - 85:30 now the next thing I want to talk about is database administrators so basically a database administrator is like that is a person and that's database Administration is basically a job and the role of a database administrator is to ensure that that data in a database is performant meaning we can efficiently Access Data it's secure meaning that only those who have specific access rights are able to access specific data and recoverable so
85:30 - 86:00 if anything goes wrong then we're if anything goes wrong with the database due to external factors then we have a way to recover all of the data in that database should be more specific the database administrator is responsible for updating the database adding or updating new data relations fields and tables now they may not specifically be updating new data this is usually up to whatever application uses that database but they may create new new relations new fields and new tables within the database and if that's being done in an automated
86:00 - 86:30 way they're at least going to test and make sure that those relations fields and tables are functional and um and operate efficiently the next rule as you said before was in maintaining security now maintaining security refers to accessing or assigning access levels and passwords to users just making sure that certain users can only access certain parts of a database nexter's managing backup procedures this might be as simple as setting up a
86:30 - 87:00 system that backs up the database every night at 12 pm so that if something were to go wrong in that time span then there would be a a relatively up-to-date database that we could just slide right in to to continue to function within the system although I guess only only backing up once a day is probably kind of a problem probably once every hour or even once every minute depending on how many transactions there are um and then I guess this kind of this is very connected to establishing a recovery a recovery plan for a database case of a disaster
87:00 - 87:30 malfunction so the Republic the recovery plan might include um just switching out the current database with the most recently backed up version there needs to be a plan or protocol for that to happen and that's the responsibility of the database administrator now one thing that the database Administration database administrator has access to is commands in a database definition language now when we looked at SQL commands before those are mainly to for example
87:30 - 88:00 um they were to update to to update specific values in a table or a database to create new rows to delete rows to basically change the data itself Now ddl commands are things like create alter drop truncate and rename these are all commands in SQL the same language we use to update our tables or to add information but they're generally only available to dbas and rather than directly affecting the data in the database they allow us to Define and
88:00 - 88:30 modify the structure of the database itself the structure and metadata of the database allowing us to Define what the database looks like and how it functions on a more fundamental level and the database definition language is used to generate the data dictionary now if you remember the data dictionary was the set of metadata or the set of information that basically told us what tables will this will exist in the database what data types the attributes those tables will hold there and how they're how different tables and
88:30 - 89:00 database are related to each other among many other things so to conclude the database definition language allows us to generate a data dictionary or allows us to use specific commands to modify or Define the structure of a database and tables within the database includes it includes commands like this which fit into which are part of SQL okay so the next concept we're going to talk about is data modeling now data modeling is the process of creating a visual representation of a whole part of an information system and
89:00 - 89:30 what we're really talking about here is planning out what a database is going to look like and how information is going to be related to each other in a database oriented system or in just a system in general in the context of computer science that's obviously going to be a software system and we do this so that all stakeholders meaning programmers any anyone involved in business pretty much anyone who's working on the system has an understanding of how things work now when we're doing data modeling what we're going to illustrate at a minimum is going to be the type of data used
89:30 - 90:00 relationships between those types of data how data is grouped and organized within the system and then data attributes so more or less we're basically creating a database schema um now the three types of models we're going to look at are going to be conceptual logical and physical this is kind of going from just thinking about how like what entities we have to work with and how those are going to be related to each other they're actually just creating a schema with like data types um relations foreign keys and like the
90:00 - 90:30 data types for the attributes so let's go ahead and let's look at an example of what that looks like so right here I've got three pictures with terrible spelling but great illustrations the first one is conceptual so as you can see with conceptual right here we basically just have our different entities and we're just we can just see how they're going to be related to each other um when we get to logical not local what we're looking at what we have right here is we still have those uh four entities then we also have what
90:30 - 91:00 attributes we're going to have and maybe some foreign Keys like right here in order to make it clear like what exactly that relationship is then we get to the physical data modeling where we have pretty much the most data we've got our data types we we have all of our attributes very clearly we have the exact database names that we want to use and we also have I'll probably say that the relationships as well are interest are Illustrated and maybe a more easier to understand way
91:00 - 91:30 um so basically in terms of sophistication we're starting with conceptual and then we're going to physical we basically just have a database schema now some of the advantages of data modeling are kind of similar to what normalization like kind of what the code normalization is in many ways except data modeling occurs in the planning stage and normalization is actually a process like an actual process a um a I don't want to say programming but it
91:30 - 92:00 is it's basically a an operation that occurs or that's done in the database now to go back to just really what data modeling accomplishes it helps us avoid redundancy a lack of integrity so it helps us prevent situations where our data doesn't always fit the rules that we set for it and then a lack of consistency and this kind of goes back to redundancy this means that we're not going to have different pieces of data in different places that aren't being updated
92:00 - 92:30 and once we once we've modeled the data it makes a lot easier for the database administrator or for a developer to actually create the database to write the code in order to generate that database and they kind of summarize just a lack of modeling may lead to a structurally different deficient database and you might write in run into a lot of problems down the line that you didn't expect to run into and at the end of the day you may just I may just have to make massive and really inconvenient changes to your database to kind of Rectify a
92:30 - 93:00 lack of data modeling you have to kind of deal with the consequences rather of a lack of data modeling now one reason like one way to model data particularly the conceptual stage is to use entity relation relationship diagrams Circle or erds in short now there are many different ways to draw them this method right here is the one I prefer and this comes directly from the IB exam so right here we want to draw a diagram to illustrate the following situation each individual may
93:00 - 93:30 only hold one driver's license so there's one Arrow between a license and individual some helpful text each individual May own more than one vehicle hence we have this Arrow right here with multiple arrows pointing to vehicles and each vehicle may be owned by one individual only um so right here if we had multiple individuals holding multiple Vehicles you probably have multiple arrows as well so that being said um let's look at an example
93:30 - 94:00 as to just give you a better idea of like how this works and how you draw these diagrams so right here it says taxonomy is the part of science that focuses on naming and classifying or grouping organisms I definitely trust the IB staff to come up with an example like this Swedish naturalist Carl Carl Linnaeus developed the current hierarchical classification system of taxonomic classification in the 1700s the classic the classification is hierarchical so one class has many
94:00 - 94:30 orders and one order has many families so birds can be classified under the Dominion hierarchical tax taxonomy as ABS in one system of classification ABS has 23 orders one of which is the order of circaniformis trolled six families within it oh my God uh construct an entity relationship diagram that shows relationship between the class order and family so this right here it doesn't
94:30 - 95:00 really matter I mean I guess it's just an example what you need to pay attention to is one class has many orders so we're just going to draw a box to say one class has many orders and one order has many families or I guess rather than orders we can just say order right there and honestly I would probably just write here has many just to make it like really clear what the relationship is that'd be a valid entity relationship diagram and what we're going to do is
95:00 - 95:30 we're going to look at the Mark scheme which I think has another type of entity relationship diagram um and so right here it says award a one for each correct relationship with labels um award one for a top Town tigram top-down diagrams during relationships basically you have to illustrate this and if we look at this this basically says class uh so one class contains n n could be any number right so that could be many
95:30 - 96:00 uh so one class contains an N number of orders and then we can say that one order contains and Families so this looks different from the entity from the ERD diagram we just drew but it's just as valid and it illustrates the concepts presented in a perfectly normal way I would say the one thing is just like don't get tripped up on this like ridiculously like complex not complex but I'd say pretentious question all right I know there's like a lot of big words here it seems super complicated but this is just a classic
96:00 - 96:30 reading comprehension comprehension exercise and you just need to drill down and it's this you're looking for this is the relationship right here that you're looking for so now what we're going to do is we're going to take a look at the HL curriculum now the HL curriculum luckily enough is doesn't really involve the same level of I don't say problem solving we don't have complex problems like with normalization or even ERD a lot of it's just memorizing a bunch of disparate facts so that being said let's go ahead and
96:30 - 97:00 get started so the first topic we're going to cover is object-oriented databases now these are another type of database similar to relational databases except here data is stored as an object now an object is basically just a construct so it's basically a group of data and functions that can be used to access or manipulate that data all in one self-contained unit now this up this concept of objects
97:00 - 97:30 comes directly from object oriented programming um but the point is there's really no tables there's no rows there's no columns basically every set every piece of data is going to be an object and that object we're going to have information that corresponds to that given entity so I guess for example if we have if we're trying to store data on let's say students okay so if you have students who are going to have maybe we'll have their current grade obviously their name their ID
97:30 - 98:00 and then attach this object we may also have some other functions like uh get I don't know get final grade or something like that I guess for that we'd also have to have a final grade attribute the point is is that we're going to have data and functionality combined into one package and so basically for storing students
98:00 - 98:30 we're going to have a bunch of different objects stored in memory or stored in our hard drive each one of those to correspond to a specific student and those calls to be related to other types of objects so for example we could have a a student object related to a classroom object um and so forth except it's going to be rated it's going to be related to one specific classroom project or one specific class sorry I didn't mean to say classroom project I meant to say one specific class object
98:30 - 99:00 so a group of students may be related to one class object another group may be related to another class object and so forth now there's no SQL and data is manipulated using these methods and these methods are generally going to correspond to an object-oriented programming language like Java C plus plus Etc now the format for each one of these objects that's going to be in an object-oriented database is called a class this is the equivalent of a schema
99:00 - 99:30 I guess it'd be like a table schema in um in relational databases right so basically like a class is going to specify what data can be stored in the object what data types is you're going to have and any other aspects that kind of Define what each object is going to look like and moreover we can actually combine classes so if you have a class for students and we have a class for IB candidates we can actually inherit so we can
99:30 - 100:00 inherit the functionality from students into IB candidates and then we're going to have a class that encompasses both of their functionality we could call it IB students so not only can we just use classes to create objects but we can combine classes so that we can then have more complex objects that combine the functionality of both of those classes in more than one class that's called inheritance when we do that we can also have we can
100:00 - 100:30 also use a concept called encapsulation we can use other object-oriented Concepts to specify what functionality is going to be inherited from one class to the next I'm going to kind of stop there getting too far into objects or object orientation that's just kind of a really high level overview of what object-oriented databases are this is very heavily related to object-oriented programming and that by itself could take up a whole other video and in fact that is actually the focus of topic D which is why I don't
100:30 - 101:00 understand why it's here at all but anyways you should know this so we're going to go over some advantages and disadvantages of object-oriented databases over relational databases the first is that we can store a larger number of data types in object oriented databases or in objects then we can in tables in relational databases so we can have text numbers pictures audio video pretty much I mean whatever we want within reason whereas as we saw earlier in the slideshow and the SL side
101:00 - 101:30 um we have specific data types that we can use um in relational databases moreover when we have complex data structures and relationships so we're talking about um when we're talking about for example something a particular entity that might take like four or five tables to describe in a relational database we can just do that in one object using a particular class and therefore we tend to get better performance better overall data database performance meaning that we can access
101:30 - 102:00 data more easily more quickly and more efficiently um or can I we can access complex data or complex data structures more efficiently quickly and easily using object oriented databases than relational databases we also have the concept of reusability so remember how we're talking about classes and how we can inherit functionality from one class into another to create a totally different object now basically create classes we can use the code for these classes over and over again in different contexts so
102:00 - 102:30 that we can easily Define objects and create complex data structures a lot more quickly than we can in relational databases so in relational databases we would actually need to create we would need to like add columns maybe split a table into different tables we'd have to do a lot more legwork to create new and complex data structures in an existing database than with object-oriented databases and finally object-oriented databases tend to describe or represent real world entities more accurately just because
102:30 - 103:00 for example if we have a person we can easily specify a bunch of attributes that correspond to that person it could be like eye color weight um like like height um I suppose hair colors a bunch of different characteristics really easily and we can also have functions that allow us to take whatever characteristics are there and combine them to get some other Insight or whatnot so it tends to be the case just objects in
103:00 - 103:30 general are able to more flexibly um be used to represent real world entities then relational databases are which are basically like you know just I don't it's obviously not an Excel spreadsheet but it's very much just you need to have these fixed sort of um fields and that's all you're going to use versus we could also inherit from other objects or other classes and we can create something really complex using a combination of classes quite
103:30 - 104:00 quickly now there are also disadvantages to object-oriented databases when we're just working with really simple data and relationships that aren't too complex object-oriented databases actually tend to be less efficient than relational databases relation data databases can also be simpler just because this whole row column model is actually quite simple to wrap your head around there's a lot of ways easier to understand that object oriented programming which can be difficult for a particularly non-programmers to understand we have more tools for working with
104:00 - 104:30 relational databases they're just a larger Community surrounding relational databases so that means we have more options for using relational databases and you know kind of building on those and standards and support for relational databases are more stable so changes in a database are less likely to be required this kind of goes back to the second point we kind of we don't really need to make like we're not going to have any dramatic updates and there's a lot of support for working with relational databases so we kind of know
104:30 - 105:00 what we're going to be working with when we start a project um and we don't really need to worry about having to change things later on to meet um a changing standard or something that's different with the with a particular dbms that we're using um also object-oriented databases are less secure we don't have the sort of formal access rights capability or even backup or um yeah I guess or even backup capability that we had that's built into our dbms systems and with relational
105:00 - 105:30 databases finally like this is kind of the double-edged sword of object oriented databases like they they can be whatever you want you can pretty much create anything right like you can have multiple classes you can have different attributes you have a lot of flexibility this also means that you don't really have a concrete data model in the same way that a relational database that has these rows and columns functions so if you don't really have a university agreed upon data model you can just end up with a mess
105:30 - 106:00 so relational databases tend to be more rigid but also more standardized and easy to understand now the next concept we're going to talk about is a data warehouse now a data warehouse is similar to database but operates on a much larger scale and is while it is still relational it contains much larger quantities of data and is often used for large organizations and businesses for data mining and Analysis so data warehouse is a much larger
106:00 - 106:30 version of a database it's used by large organizations it contains larger quantities of data its main function is to take data from a bunch of different sources combine them using something called ETL and then put them into a data storage repository called a data warehouse can then be analyzed to provide insight into that data which can be useful for marketing can be useful for any sort of like I guess political campaigning or fraud detection or anything that
106:30 - 107:00 requires very large data sets moreover well actually let's just go ahead and look at some of the advantages so we can get a better idea of how this is like why we'd use this but also like how this is different from databases so we have large amounts of data we're going to have better performance than databases the system is optimized for quick retrieval and have efficient analysis of large amounts of data now um our system called ETL which we're going to talk about more in the next few slides is quite quick at consolidating
107:00 - 107:30 data from multiple sources and transforming it into a standardized format for easy and quick querying and this means that we can take data from a bunch of different places so if we go back to our previous slide that could be a CRM system um well I guess an Erp system I kind of forget what some of these acronyms are these are basically different software systems the billing system or just flat files um and you can take so we have a bunch of software systems we can feed them all into our ETL system and that will take
107:30 - 108:00 it and put everything into a standardized uh data format which is easy for any organization to access all in one place rather than having to individually go to these different systems all with their own format for displaying and storing data and we can do that pretty quickly um and I guess standardized standardization of data goes along with that an ETL system is constantly sort of analyzing consolidating and transforming this data which get which is why I said we which is why we talked about timely access
108:00 - 108:30 uh finally we have access to historical data so generally a database is going to focus on the the data that's being used day to day and historical data is not as important but the whole object of data warehousing and data warehouses is to store data from as far back as possible so you can use that data to gain insights and make predictions about the future now one of the big Concepts we talked about was ETL and this is the process from which multiple data from multiple sources is collected processed and sent
108:30 - 109:00 to the data warehouse and this is generally a software system in and of itself and it has three parts extraction um extractions when we're actually just getting data from those different software systems and this is this could be structured data so data that has a particular structure um organ Arts organize a certain way or unstructured data which is maybe a bunch of numbers um and this whole process of importing it of importing data and just kind of bringing it to our ETL system just kind
109:00 - 109:30 of taking it in like kind of sucking it in that's extraction now transformation is when we take all this data that we've got from these software systems and then we standardize it and then we make sure that it has an appropriate level of quality and accessibility so some examples of what we do in transformation would be deduplication so removing duplicate data by cleansing so removing inconsistent or missing values that would just kind of throw off any analysis and standardizing so making sure that all values follow the same rules additionally
109:30 - 110:00 um are also kind of in the same structure and the final functionality of our ETL system would be loading and this is when we actually take the data that we've extracted and we've sort of cleaned out cleaned up and we transmitted to our data warehouse that's sort of the middleman between these multiple software systems and our data warehouse now finally to get to the main differences between a data a data warehouse and a database so the data warehouse we have mainly
110:00 - 110:30 historical data with the database we have current data or data that's relevant to the day-to-day operations of a software system um the data warehouse we're typically dating with we're typically dealing with terabytes of data the database for dealing with gigabytes of data although it could be larger and we could have our data but you could have multiple database systems working together this isn't as relevant as you think the data warehouse data is maintained over time timeliness is not as important because
110:30 - 111:00 we don't have users continually accessing a database or mainly using it for data analysis every now and then the database however data is up to date because for example if we're talking about the database for Instagram then we need we're cons or like continually hitting that database trying to get data and updating data there as I was saying with a um with a data warehouse users mostly have read-only access versus with databases they can both read and write to the database so the writing part with the data warehouse is going to be done through ETL we're
111:00 - 111:30 not really going to have users reading or we're not really going to have users writing to it to a data warehouse um again actually this should be flipped right here so really a data warehouse is used for making decisions uh databases used for everyday transactions a data warehouse is optimized for more complex queries across large sets of data whereas the database is actually going to be faster for simple queries like create read update and delete so for example if we're just creating an
111:30 - 112:00 account on Facebook a database is going to be an answer but if we want to look at all the database all the Facebook users and kind of divide them into into specific groups based on their behavior then we're going to want to use a data warehouse and finally the data warehouse we're dealing with a relatively small number of users we have more data but we have fewer users who are accessing it to analyze it and with the database we're generally going to have more users again think about a think about a system like um like Instagram or Facebook or
112:00 - 112:30 something like this now the main thing we're going to do using a using a data warehouse is going to be something called data mining and data mining is kind of a nebulous term it really just refers to just getting it's analyzing a bunch of data to find patterns or relationships within that data and we're going to look at five different data mining methods that you should be aware of now the pros and cons of data mining are that we can unders it helps us
112:30 - 113:00 understand behaviors Trends or hidden patterns we can detect fraud or risk and generally data mining operations like the the five that we just kind of mentioned help analyze large amounts of data very quickly more quickly than you just trying to use an Excel spreadsheet for some for whatever reason the cons are the insights gleaned through data mining use personal information and can reveal insights that might compromise privacy or anonymity data mining can be very very expensive due to the amount of just like computing
113:00 - 113:30 power and data that's required to really do data mining that kind of gets to the point to the point where I mean that kind of gets to the next aspect which is that data mining requires large databases or large sets of data and this can kind of storing maintaining and even paying for uh this amount of data can present its own problems now the first type of data mining is cluster analysis
113:30 - 114:00 now cluster analysis basically looks at data and then puts it into groups um it doesn't make any like it doesn't it doesn't have any groups that it's trying to like it doesn't have anything that's looking for so it's unsupervised learning so it's purely just getting a data set and then based on the particular aspects of that data which could example be Netflix users or Tinder users it puts them into groups um now as you can see right here it uses
114:00 - 114:30 unsupervised learning so there's no prior input or training of this cluster analysis algorithm that's required it's it's doing its thing um if you haven't seen the paper 3 video yet I kind of actually uh recommend you do that because this isn't normally the case but a lot of the concepts in paper 3 actually match up with the HL Concepts in option A and again this option A has been there for like the last six or seven years and had the same Concepts so it's actually kind of lucky that there is that overlap now some usage examples include finding
114:30 - 115:00 a demographic of customers with similar spending habits finding and then also finding viewers in Netflix with similar watching habits and recommending stuff to them so cluster analysis basically just divide our data into our data points into groups okay so the next concept or the next type of data mining that we're going to cover is called classification now classification is similar is kind of similar to Cluster analysis but we're using supervised learning so basically what we're doing is we're getting some
115:00 - 115:30 training data and that data includes well okay so what class the point of classification is to allow us to make a prediction of some Factor based on given data so if you look at this example right here what we want to do is based on some data about a professor we want to predict whether they are tenured or not okay so with classification what we're going to do is we're going to get some training data and that data
115:30 - 116:00 includes both the data the information the professor and then whether they're tenured or not so we're going to train our algorithm using that using that data and then our algorithm should be able to just only get the data on the professor and then predict whether they have been tenured or not effectively classifying them into two groups which is tenured and not tenured um so basically new data is added to the model and compared against the predicted outcome based on the training data the training data is kind of like teaching a
116:00 - 116:30 student how to solve problems and then we give that then we're going to give the student problems that they haven't seen before and see if they can solve them that's classification again something that we did Cover in paper three so that which actually I'd probably say in my explanations in paper three are probably going to be better than this one just because that was a focus there and right here I'm just trying to cover this in as concise away as possible that's classification um so if we're comparing cluster
116:30 - 117:00 analysis and classification cluster analysis uses unsupervised learning while classification uses supervised learning cluster analysis we're putting in data without any information attached to it so it's unlabeled classification we have information attached to it by information I mean that we have those tenure we have the whether it's tenured or not right so we're telling the algorithm exactly what we wanted to predict versus with cluster analysis we're just giving a bunch of data and just seeing what it comes up with and the third Point kind of speaks to that so with cluster analysis we have no
117:00 - 117:30 prior knowledge of relationships in the data classification we understand the relationship in data in fact that's where asking the algorithm to like exploit for us right we're asking it to figure out we're asking it to glean the relationship between data and whether those professors are tenured or not and then use that to make future predictions and both of them output groups of data points so cluster analysis just gives us a bunch of groups that telling us what the significance of those are and with
117:30 - 118:00 classification we are actually just we're basically um taking the data and splitting it into multiple predetermined groups uh as kind of set forward in our training data now the next step of data mining is Association analysis now the association analysis for breaking up data sets by variables and we want to find relationships in data sets between those particular variables so for example right here we have an association rules Network and this is
118:00 - 118:30 basically what our output is from doing Association and right here it's looking at the data we can say that someone who always plays football always plays basketball um and the strength that Association of that relationship is is shown right here right so a lot of these aren't really that strong except for the black one right here so you can say that someone who plays baseball um probably plays football as well just based on the data that we have so with Association analysis we are
118:30 - 119:00 finding relationships between different variables or if we're talking about in a database context different fields we're specifically looking for the dependency of one item on another and this is going to be unsupervised learning so we're giving all this data but we're not actually telling the algorithm like we're not giving it any training data so we're not telling it about previous relationships between particular factors we're just giving it all of those all the data so all the data for those fields and just seeing what it finds
119:00 - 119:30 now examples of using Association analysis would be doctors seeing what symptoms often lead to what disease diseases or retailers seeing what factors lead to purchases so we might have data on the amount purchase you might have complete data on a given customer and we can kind of see we can kind of see if there's any association or what Association associations there are between specific variables and the amount of the type of product that a particular customer or customers have bought
119:30 - 120:00 now next we have link analysis so link analysis I've actually seen most commonly like in a um in like sort of a criminal or a um a national security type context if you've seen movies before with like webs of a bunch of like terrorists who are all linked together somehow that's link analysis so link analysis establishes relationships between different entities in the same data set and the algorithm Begins by deciding what constitutes a link so it kind of decides like okay what constitutes a
120:00 - 120:30 connection between any given entity like what is the commonality here that we can use to basically decide that there's a link right and then once it does that it decides what entities meet the criteria for these links and which do not and then also to what extent so between so the links themselves are going to have weights on them to show okay this is a really strong link or this is not a very strong link so basically every one of these These are actually also these entity we call
120:30 - 121:00 them entities right but in like sort of the link analysis world they're referred to as nodes so basically trying to we basically have data for each one of these nodes and then we're analyzing to see like how they're connected what constitutes a connection and then once we once we've done that once we once once we've sort of figured out a set of connections between any given node then we are trying to connect them and see like whether they meet the threshold for being for having like a sufficiently strong relationship
121:00 - 121:30 um a great example is using cell phone data to determine the hierarchy in a criminal Network and sort of the last method that we're going to go through is deviation detection this one's pretty easy I mean basically it's unsupervised learning but we basically analyze a bunch of data and we just see like um what we just see what stands out so we're looking for anomalous patterns or significant deviations from the norm and like right here for example we have this graph Rush is probably a great example so right here for Russia we have
121:30 - 122:00 the number percentage of votes for winner and voter turnout um I guess we have this like sort of outlier here for like a bunch of Voters where there's 100 percent turnout and there's all and every single one of those people and like I guess I guess maybe this is like by District or I'm not really sure how this works but wherever there's like a ton of turnout they're all voting for The Winner right um probably I mean they've been paid I guess it's not only in Russia but whatever anyways
122:00 - 122:30 um the point of this graph is to show that we're specifically looking for anomalies right and that's kind of the point of deviation detection um and some examples of using this might be an unusual switch in pre-electoral voting opinions or even just to kind of detect anomalies in voting data or a sudden Pro or anti-canidate sentiment in a particular state so something that kind of is deviating from the norm is that whoever's campaigning can kind of take a look at that and try to figure out what happened
122:30 - 123:00 now this is kind of like we're kind of done with data mining at this point and we have two little Concepts that we're going to go over that are just kind of on their own weirdly but still appear on the HL exam and the first one is spatial databases so spatial databases are similar I mean there are relational databases but they they're specifically like kind of databases that handle data related to locations geometric figures or space in some way so that means they're going to hold points they're going to hold the points or coordinates or particular polygons or
123:00 - 123:30 3D shapes or they may be Geographic coordinates as well and we can use the data in a spatial database to model or to model the structure of geometric or 3D objects or or something on a map right um and this is an example of a spatial database because we have latitude and longitude in this particular database also there's specific like rdbms or rdbms relational database management system there's specific types of database Management systems that have
123:30 - 124:00 functionality related to um related to spatial sort of analysis and these will have functions that specifically allow us to deal with like coordinates or groups of coordinates and um I mean they may also be tailored in other ways to More optimally Store uh like data that relates to space geometry or location the last concept we're going to cover is just data segmentation which is kind of
124:00 - 124:30 like it's kind of weird to throw this in here because it's a bit fluffy and I don't really see where it fits in because it's like we're going from like machine learning to like just dividing data but anyways data segmentation is the pro is basically the process of just taking your data and then dividing it by certain parameters like so for example you might take the data on everyone who's bought stuff in your store and divide it by age to see what people from each age group are buying and which which groups to Target
124:30 - 125:00 so we might segment our data by by demography um by like the aspects of people by psychography I don't really I guess by their psychology um by their behavior so what they're doing or by geography where they're located that one I can't exactly remember what that is um and these right here are some of the things you can do with data segmentation create ideal buyer personas improve your product Target your marketing to a specific group um you can also analyze your reviews
125:00 - 125:30 so you might analyze reviews based on demography So based on their age group or where they're from then you can use that again to more um more optimally Target those customers data segmentation just doesn't apply to marketing it can apply to all sorts of different uh use cases but marketing I think is an ideal way to describe it that being said that is the end of option A I will say this has been one of the more difficult videos that I have created
125:30 - 126:00 just because of the extent and the complexity of some of the concepts you're dealing with I hope you found value in it there is a study guide in the in the description that you can purchase which takes all this data and puts it into a pretty lean and easy to understand package it's great for revising it also goes to help support the channel and all the work that I'm doing and furthermore all I really want to say is if you like this video please like it on YouTube subscribe to my channel
126:00 - 126:30 um and comment as well I love hearing you guys feedback you can also join the Discord if you have any questions other than that have a nice day