#1 How to Pass Exam DP-203 Azure Data Engineer Associate in 18 hours | Part 01 Storage Accounts
Estimated read time: 1:20
Summary
In this comprehensive guide on Azure Data Engineer Associate Exam DP-203, the E Learning Free Channel dives into the intricacies of Azure Blob Storage, explaining its suitability for unstructured data, access methods, and benefits such as high availability and encryption. Different components of Blob Storage, like storage accounts, containers, and blob types, are discussed alongside performance tiers, data encryption methods, and access controls including Azure Active Directory. Furthermore, redundancy solutions for disaster recovery are explored, detailing geo-redundancy strategies and failover processes.
Highlights
- Learn the essentials of Azure Blob Storage and its role in data engineering 📚.
- Discover how to access and utilize Blob Storage effectively 🌟.
- Explore the benefits of using Azure's cloud services for massive data storage and disaster recovery 🛠️.
- Understand the nuances of encryption and access management in Azure Blob Storage protocols 🔐.
- Delve into the performance advantages of the premium tier and how it benefits computing workloads ⚡.
- Uncover how Blob Storage supports redundancy across multiple regions for fail-safe data management 🌐.
- Get insights on handling big data solutions within Azure's infrastructure efficiently 📊.
- Master the art of designing and managing data lakes to optimize data processing and accessibility in Azure 🌊.
Key Takeaways
- Azure Blob Storage is ideal for unstructured data, offering versatility in data storage and access 🌐.
- Security is top-notch with built-in encryption and support for Azure Active Directory 🔐.
- Blob Storage can be accessed via HTTP/HTTPS and supports Azure-specific integrations 📡.
- The service can handle massive data loads, providing options for geo-redundancy and disaster recovery 🌍.
- Different performance tiers allow for cost-effective data storage solutions 🏷️.
- User authorization can be managed carefully using shared access signatures and role-based controls 🛡️.
- Planning a data lake within Azure requires strategic thinking and knowledge of various Azure services 💡.
- Big data processing is a breeze with Azure's scalable and flexible tools ⚙️.
Overview
Azure Blob Storage stands out as an indispensable tool for handling unstructured data types. Because of its flexibility and performance characteristics, it offers optimized solutions for storing data at varying access frequencies and transactional speeds. This includes encryption capabilities that ensure data remains secure, both at rest and in transit, aligning with compliance requirements for diverse business needs.
Geo-redundancy forms the backbone of disaster recovery planning within Azure Blob Storage. By spreading data across multiple regions and data centers, Azure ensures there is always a backup ready in case of a single point of failure, enhancing application stability. This asynchronous replication provides business continuity during regional outages, although manual failover may sometimes be needed.
Planning and implementing big data solutions in Azure Data Lake Storage requires a strategic look at architecture and data governance. Leveraging Azure's suite of tools for data ingestion, processing, and analysis allows businesses to harness the full potential of their data assets, ensuring high performance and scalability. Through role-based access controls and advanced security measures, Azure maintains the integrity and accessibility of critical data.
Chapters
- 00:00 - 06:00: Introduction to Azure Blob Storage This chapter provides an introduction to Azure Blob Storage. Azure Blob Storage is a service from Microsoft's Azure cloud platform designed for object storage. It is ideal for storing unstructured data and does not have specific requirements for how data should be organized or typed. Additionally, Blob Storage is versatile in its functionality and can be accessed natively via HTTP.
- 07:00 - 19:00: Blob Storage Performance and Scalability Blob storage now supports NFS version 3, enhancing its accessibility over HTTP or HTTPS. This allows blob storage to serve files directly to browsers or media streams. It can host static websites and perform simple transfers, making it versatile for various uses such as backups, archiving, and logging.
- 21:00 - 28:30: Azure Storage Geo-Redundancy Azure Storage Geo-Redundancy focuses on the benefits and features of using Azure Blob Storage over a traditional file server. Key benefits include high availability options, which involve storing multiple replicas of all data within the same data center or across different regions. These features enhance file sharing, data analysis, and data service efficiency with applications like Azure Analysis Services.
- 35:00 - 46:00: Storage Account Disaster Recovery The chapter covers the topic of 'Storage Account Disaster Recovery'. It explains that storage is encrypted at rest and can be optionally encrypted during transit. The service is highly scalable, allowing for storage of up to five petabytes per account on-demand, and up to 190 petabytes upon request. As a fully managed service by Azure, it requires no additional user intervention, and Microsoft provides client access libraries for popular programming languages.
- 46:00 - 57:00: Azure Data Lake Storage Generation 2 This chapter discusses Azure Data Lake Storage Generation 2, focusing on client access and controls using various programming languages such as .NET, Java, Node.js, Python, Go, Ruby, and PHP. It explains that blob storage is divided into three components, emphasizing the importance of understanding these components for a well-designed storage infrastructure. The storage account serves as the unique namespace for accessing data, underlining the necessity of a distinct account name.
- 59:00 - 72:00: Big Data Processing with Data Lake This chapter discusses the integration of Azure Blob Storage with data processing in a Data Lake environment. It explains the organizational structure of Azure Blob Storage, where a container acts like a folder in a file system. The chapter also highlights that there is no limit to the number of containers in a storage account and outlines the three types of blobs supported by Azure Blob Storage: Block Blobs, which store text and binary data, and Append Blobs.
- 72:00 - 86:00: Planning a Data Lake This chapter covers the types of data storage in a data lake, specifically focusing on block blobs, append blobs, and page blobs. Block blobs are used for storing large amounts of unstructured data, append blobs are suitable for operations like continuously writing logs since they allow appending data, and page blobs are used for storing files that require random access, such as VM disks or databases. The chapter also briefly mentions the need for authorization to access data stored in blob storage accounts.
- 86:00 - 95:00: Data Lake Storage Best Practices The chapter on 'Data Lake Storage Best Practices' discusses multiple methods for enforcing role-based access controls using Azure Active Directory for authentication. It highlights the use of Shared Access Signatures (SAS) as a means to provide delegated access to the contents of the storage account. Additionally, it explains that SAS tokens are user-created, specifying the permissions granted and their validity period.
- 96:00 - 113:00: Creating a Data Lake Generation 2 Account The chapter discusses how to create a Data Lake Generation 2 account. It explains that a token can be signed using either a shared key or an admin's active directory credentials. Shared keys are particularly used to build a connection string, which is essential for programmatically accessing the contents of the storage account through an authorization header in the request. Moreover, it is possible to set up anonymous access to enable read access to resources within a blob storage account without requiring authorization.
- 113:00 - 116:30: Course Conclusion The chapter concludes the course by discussing the encryption of data in an Azure Blob Storage account. It highlights that data is encrypted at rest automatically with 256-bit AES, which complies with FIPS 140-2 standards. This encryption is a default feature that cannot be disabled and requires no manual intervention or configuration. Additionally, the option to use a customer-provided key for encryption is mentioned for scenarios where security policies require it.
#1 How to Pass Exam DP-203 Azure Data Engineer Associate in 18 hours | Part 01 Storage Accounts Transcription
- 00:00 - 00:30 in this video we're going to talk about azure blob storage so microsoft azure blob storage is the service available through the azure cloud platform for object storage blob storage is ideally suited to storing unstructured data and does not have specific requirements for data organization or type blob storage can function in a multitude of ways by design it is accessible natively over http
- 00:30 - 01:00 and https with the addition of nfs version 3.0 support being accessible over http means blob storage can be used to serve files directly to the browser or to a media stream static websites can be hosted or simple file transfers can be performed now any file or blob can be hosted in a blob storage account so they can be used for backup archiving logging files
- 01:00 - 01:30 file sharing or data analysis by data services like azure analysis services so using azure blob storage provides many benefits over a traditional file server the service offers high availability options with multiple replicas of all data being stored within the same data center within the same availability zone or replicated across regions data stored within blob storage
- 01:30 - 02:00 is encrypted at rest and can be configured to be encrypted in transit the service is scalable to store large quantities of data with up to five petabytes per storage account available on demand and up to 190 petabytes available upon request the service is fully managed by azure requiring no additional intervention from users client access libraries are provided by microsoft for most popular programming
- 02:00 - 02:30 languages to integrate client access and controls to the storage accounts into your applications libraries are provided for net java node.js python go ruby and php now blob storage is broken up into three components and understand them is critical to a well-designed storage infrastructure the storage account is the unique namespace that your data will be accessed through the unique account name
- 02:30 - 03:00 is combined with the azure blob storage endpoint to create the addressing for the data the container is an organizational unit for blobs it is a relational and functions in the same way as a folder in a file system there is no limit to the number of containers present within a storage account azure blob storage supports three types of blob so you have a blob block blob which is a text and binary data and you can have a way to append blobs
- 03:00 - 03:30 which are stores of block data as well however they can accept append operations to the end of the block marking them suitable for operations like continuously writing logs and the third are page blobs which are used to store random access files like virtual machine disks or databases now access to data stored in blob storage accounts requires authorization so this can be provided through
- 03:30 - 04:00 multiple supported methods so role-based access controls can be enforced using azure active directory as an authentication mechanism shared access signatures or sas provides a token that provides delegated access to the contents of the storage account a shared access signature is user created and specifies the permissions granted to the user as well as a period of validity before the sas
- 04:00 - 04:30 expires the token can be signed with either a shared key or using the admins active directory credentials now shared keys are used to build a connection string used programmatically by an application to access the contents of the storage account using an authorization header in the request anonymous access can be provided for read access to resources within a blob storage account bypassing the need for authorization
- 04:30 - 05:00 now encrypting the contents of an azure blob storage account can happen in two operations data written to the account is automatically encrypted at rest using 256 bit aes and is fips 140-2 compliant this cannot be disabled and requires no manual intervention or configuration by the user encryption at rest can be performed using a customer provided key if security requirements do not support using the
- 05:00 - 05:30 default platform provided keys now granular storage encryption can occur within the client by including the encryption key using client-side encryption and decryption operations this is limited to whole blobs only and cannot be used piecemeal microsoft provides multiple methods for performing seed loading of large quantities of data into azure blob storage accounts so you can send your own hard drives via
- 05:30 - 06:00 courier to the azure receiving location who will then extract the data and load it into a blob account on your behalf now there are cli tools like az copy for performing data operations against and within storage accounts they are available for download on both windows and linux platforms microsoft includes the azcop utility in their azure storage datamovement.net library
- 06:00 - 06:30 for interacting with blob storage blob fuse is used within the linux file system to access data in a blob storage account and is delivered as a virtual file system driver now for organizations who already have data in the cloud but are looking to consolidate it or move it into a blob storage account azure data factory provides a platform for data transformation and copying operations and can use shared access signatures account keys
- 06:30 - 07:00 service principles or managed identities for authentication and authorization now finally azure data box is a service offering where microsoft will ship you a secured container with data drives included you can copy your data to the box and ship it back and they will upload the contents to a blob storage account before securely wiping the devices
- 07:00 - 07:30 in this video we're going to talk about blob storage performance and scalability now microsoft azure blob storage is a service for storing any form of unstructured data on highly scalable infrastructure the blob storage service supports two performance tiers of underlying physical hardware the standard tier is the most cost effective option and it's optimized for high capacity it uses spinning discs for the hardware so offers a less throughput and slightly higher latency
- 07:30 - 08:00 this makes it suitable for storage options for backup data sets less used for file sharing batch data processing and media sharing the premium tier uses solid state drives for the hardware offering higher throughput and less latency than the standard tier so the premium tier should be used for workloads that will basically be read many times or have a high rate of small transactions that can benefit from the improved response times so examples of workloads that would gain
- 08:00 - 08:30 from using the premium tier are things like data streams interactive workloads high performance computing and data transformation processes so access tiers are available on blob storage accounts regardless of whether they are standard or premium tier and offer cost-effective methods for storing data based on its frequency of access so the hot tier is used for storing data that is frequently accessed or written to so this tier is set at the
- 08:30 - 09:00 storage account level has the highest storage cost now the cool storage tier is used for data that is not frequently accessed so it offers a reduced storage cost but higher cost for data access and is set at the account level so example workloads are short-term backups or archives and accumulated data sets for batch processing later the archive tier is set at the blob
- 09:00 - 09:30 level and offers the lowest storage cost but the highest access cost and data is not immediately available it is used to store data that is infrequently or never accessed like backups data in the archived tier requires rehydrating prior to access so may be unavailable for hours and there's a cost penalty for removing it within 180 days of storage now policies can be applied to the storage account to manage data
- 09:30 - 10:00 the life cycle management policy is set of rules based policy that set governance over the data residing within a storage account so the rules are applied daily and execute at the account level the container level or against a subset of blobs defined through name prefixes or index tags the policy manages the full life cycle of the data so they can identify age blobs and delete them
- 10:00 - 10:30 version blobs to mark them as archival when their access time has basically met its threshold or take a snapshot of them defined by their stage in the life cycle so life cycle management can also automate data moving through the access tiers data in the hot tier can be automatically moved to cold if it has not been accessed in a defined period or can be moved from cold to hot if it has been accessed data can also be marked as archival and
- 10:30 - 11:00 moved into the archived tier for access requirements that have been met there so data access performance can be enhanced through efficient data partitioning partitioning is performed automatically based on the name of the blobs so the partitioning scheme in blob storage uses a range-based approach to load balanced data and try to limit hot spots as well as to scale the data up without reducing performance
- 11:00 - 11:30 the partition key is based on the full blob name which is constructed from the account name the container name and the blob name so to improve performance of load balancing operations which bounces the partition key range you should follow the best practices for blob size blocks so for data stored in the standard performance tier accounts try to ensure blobs are larger than 4 megabytes to force the system to use high
- 11:30 - 12:00 throughput block blobs for data in the premium performance tier try to keep the block size above 256 kilobytes now appropriate naming conventions will reduce latency as the partitions are read and balanced as well so microsoft recommends prefixing the name of either the account the container or the blobs with a three-digit hash and try to avoid append only if you're implementing time stamping or numerical identifiers on
- 12:00 - 12:30 your data so blob storage accounts support basically three types of blob and the type is specified on creation and cannot be changed afterwards so block blobs are best used for large amounts of text or binary data the blob is comprised of multiple blocks that are assigned a unique id that can be managed individually append blobs are optimized for continually appending data
- 12:30 - 13:00 they function very well in situations like constant log writing where a new block is only written to the end of the blob append blobs cannot update or delete the already written blocks and a block can be set up to 4 megabytes in size now page blobs are used for random access storage blocks are 512 bytes in size and when the blob is written to pages it can be modified from a single page up to four megabytes in individual page increments
- 13:00 - 13:30 so latency is the time taken to deliver information to and from a blob in a single request in storage operations the latency is related to the number of input output operations per second also called the request rate so to determine the throughput required for your application to perform within acceptable tolerances multiply the request rate by the request size this helps to determine how much network bandwidth is required between
- 13:30 - 14:00 the application and the storage account azure portal provides two metrics for latency in the storage account so we have end to end latency which is the time from when storage receives the request from the client to when the client acknowledgement is received after it has received the last packet from the storage account and server latency is the time taken when azure storage received the final packet in the request until the response is returned from storage to the client
- 14:00 - 14:30 in this video we're going to discuss azure storage geo-redundancy now replicating objects between blob storage accounts and azure can be used for achieving multiple goals if the organization is serving data to geographically dispersed clients the data can be replicated between storage accounts to be closer to them to reduce latency or data can be replicated to a storage account that is only used for archiving
- 14:30 - 15:00 or from an archive to a cool or hot tier after hydrating it so object replication is applied through a policy that defines the rules to apply the rules specify both the source and the destination storage accounts the source and destination container and the blocks to be replicated object replications are performed asynchronously and do not come with an sla for replication completion so source and destination can be
- 15:00 - 15:30 frequently out of sync blob versioning must be enabled on both storage accounts and triggers the replication process when a blob is modified when blobs are deleted only the current version is deleted previous versions will remain now there are constraints and limitations imposed on the object replication process so things like the following snapshots on data are not supported blobs in the archived
- 15:30 - 16:00 tier cannot be replicated blobs can be replicated from hot to cool and cool to hot tiers and immutable blobs cannot be replicated so data in azure storage accounts is always replicated for resiliency but there are optional tiers for replication to expand on that base replication for additional cost so regional
- 16:00 - 16:30 replication is not enabled by default but will replicate data from the source region to another region within the azure platform a region is set of data centers connected locally but geographically distant from another azure region availability zones are independent data centers within azure region while they are connected to the other data centers within the region they have their own cooling power and
- 16:30 - 17:00 networking to ensure continuity if one becomes unavailable data replicated to an availability zone will fail over automatically if there is an outage however data replicated to a secondary region will require a manual failover if there is an outage replications between regions happen asynchronously so there is a possibility that data in the secondary region is not up to date with the primary
- 17:00 - 17:30 region at the time of an outage the last sync time is a property attached to the blobs within a storage account that has replication enabled that basically has a date stamp with the last successful replication to the secondary region so this can be referenced to determine which is the most up-to-date information when services are restored and a fail back is being considered locally redundant storage is
- 17:30 - 18:00 automatically enabled for all storage accounts so it replicates the data three times within a single data center in the primary region which means that data is replicated across different storage arrays in different aisles to remove component liability from causing an outage locally redundant storage offers eleven nines of durability durability is a measure of the likelihood of loss of data through corruption or component
- 18:00 - 18:30 failure but is not reflective of availability so the data in a locally redundant storage configuration is written synchronously to all three locations so a component or rack failure will not affect data consistency now zone redundant storage expands on the locally redundant storage pattern so data is replicated three times across
- 18:30 - 19:00 three availability zones within the region recall that an availability zone is an independent data center with its own power supply cooling equipment and networking so zone redundant storage options offer 12 9 durability over the course of a year if a zone fails azure will automatically fail over to the next zone re-pointing dns entries your application should support retries and back off
- 19:00 - 19:30 as the failover process is happening to enable resuming operations after the process is complete zone redundant storage is not available in all regions microsoft publishes a list of currently supported regions that should be referenced before an account is created archive tier is not supported in zone redundant storage only the hot and cool tiers can be replicated geo-redundant storage is a service for
- 19:30 - 20:00 applications that require high availability or are globally distributed data is written in the standard locally redundant form writing to three different locations within a single data center then an asynchronous process replicates the data from one of these locally redundant copies to a secondary region the data that lands in the secondary region is written locally three times the net
- 20:00 - 20:30 result is locally redundant storage in two different regions redundant storage is an extension of the geo redundant storage and offers the highest level of data redundancy on azure storage so data is written as zone redundant in the primary region spanning three independent data centers the data is then replicated to a secondary region where it is written as locally redundant in the zone it lands in
- 20:30 - 21:00 the result is six copies of the data across four different data centers in two different regions this method of storage offers 16 nines of durability this is somewhere in the region of 10 bytes of data lost to corruption or other factors per exabyte of data per year this method of storage is only available on general purpose version 2 accounts and is limited in the regions that currently support it
- 21:00 - 21:30 in this video we're going to discuss storage account disaster recovery now ensuring service continuity and data loss prevention is a concern of all businesses using cloud services like azure blob storage can help to mitigate the potential for data loss through a combination of functions built into the service as well as incorporating design patterns and applications that utilize the storage
- 21:30 - 22:00 azure blob storage provides high availability high durability storage for text and media blobs it offers multiple options for replication to ensure the data resides in multiple places so geo-redundant data storage replicates the data three times within the same data center but on different storage in the primary region it is written to then it replicates the data out to a secondary region so geo-zone redundant storage takes this
- 22:00 - 22:30 concept a step further replicating the data across three availability zones which are individual independent data centers in the primary region before replicating it out to a secondary region applications utilizing geo-redundancy for continued operations can use azure geo-redundant storage for their data storage however this carries with it some design concerns
- 22:30 - 23:00 so data in the secondary region of a geo-redundant storage deployment is in a read-only state until a failover is initiated this ensures temporary outages do not result in split data situations and applications can be designed to take advantage of this by flipping to a read-only mode to continue serving the data without compromising consistency so data is replicated asynchronously
- 23:00 - 23:30 between regions so caching and data fetches may be affected and should be accounted for the last sync time attribute is written to blobs and indicates when the last replication took place this will help to ensure data consistency when services resume after an outage the storage client library provides functions for redirecting requests to the secondary region automatically if access to the primary region times out
- 23:30 - 24:00 initiating an account failover repoints the dns entries to the secondary region and provides write access to the data so should be used with caution especially if the app is built to automatically switch to write only with data in the secondary region failing back can require manual data merging and cause consistency issues the circuit breaker pattern is a design
- 24:00 - 24:30 pattern used to prevent applications from continuously repeating operations that are failing and then providing a mechanism for automatic detection of resolution and resuming operations so when the application is in a closed state the request is routed normally to the operation the system has a proxy function that maintains records of recent failures if a threshold is exceeded by the number of failures then the application moves the proxy to
- 24:30 - 25:00 an open state which initiates a timer now when the proxy is in an open state the request will fail and the proxy will return an exception to the application the proxy can assume a half open state which is initiated after the timer expires after the open state attempts to resume a closed state so this half open state will restrict the number of requests that can be passed to the service
- 25:00 - 25:30 that went down to protect it from getting accidentally overwhelmed and triggering another failure so there are some items to consider when the developers are looking to implement the circuit breaker pattern so when the proxy enters the open state and starts failing requests it will automatically send exceptions back to the application which must be architected to respond to the exception the application may stop specific
- 25:30 - 26:00 operations to that resource or perform alternate operations to continue functioning without complete failure the application should log all failed requests and the exceptions thrown for both ongoing health monitoring and post failure analysis the application can be built to perform on its own testing against the failed service to determine its availability rather than using a timer before attempting to resume a closed state this could be pinging the service to
- 26:00 - 26:30 determine if it is up or something more advanced a manual override or reset option could be made available to application administrators to reset the failure and continue operating when they are made aware of the service becoming unavailable again so failover for a geo-replicated azure blob storage account is a manual process initiated by the customer by default data is only ever written to the primary region
- 26:30 - 27:00 by your application and all replications to the secondary region are read-only so initiating the failover process updates the dns entries in azure to point the primary domain name of the secondary storage account and resumes write access to the secondary data now along with the dns update and enabling writing the failover process will configure the storage accounts to be locally redundant accounts discontinuing geo-replication processing
- 27:00 - 27:30 the customer must re-enable geo-redundancy after a failover event has occurred now geo replication storage accounts will replicate archive blobs however they will require rehydrating prior to re-enabling geo-redundancy storage account failovers are not supported by azure file sync enabling the azure data lake generation 2 hierarchical namespace
- 27:30 - 28:00 will disable geo-replication capabilities premium and immutable block blobs do not support geo-redundancy failover operations for situations where performing the failover operation is not desirable copying data out of the secondary region into a third region for application access can help to remove the overhead of having to re-enable geo-redundancy and fail back data to the primary region
- 28:00 - 28:30 in this video we're going to discuss azure data lake storage generation 2. so azure data lake generation 2 is a data lake storage solution for centralizing the storage of structured and unstructured data now microsoft has built azure data lake storage gen 2 on the azure storage platform benefits from most of the capabilities of azure storage
- 28:30 - 29:00 without requiring additional configuration it provides a hierarchical namespace on top of the blob storage this namespace is used to order blobs in a folder hierarchy to maximize efficient data operations and enforce atomic operations against the directory accessing the data lake instance can be done through the hadoop distributed file system command line interface or for short hdfs cli
- 29:00 - 29:30 and through remote services like ambari web hcat and ssh alternatively you can use the azure data lake storage generation 2 rest apis to interact with the file system the interface provides operations for creating and deleting file system folders and files the most powerful method for interacting with the data lake is using the azure blob file system driver also called abfs
- 29:30 - 30:00 which is a dedicated driver for hadoop running on azure storage making it compatible across dave lake gen 2 azure hdinsight and azure data bricks and azure synapse analytics the driver presents the hierarchical file system view and enables a user to perform any action on the data as well as acting as a source or destination for mapreduce apache hive
- 30:00 - 30:30 and apache spark hierarchical namespace are on an overlay that provides various features on top of the file system by implementing directories the namespace can perform functions against subsets of data without having to walk through directories to get there this also enables atomic operations where everything succeeds or everything fails without
- 30:30 - 31:00 having to first process all the data contained in the data lake so access control lists for managing user authentication and authorization against the data can be implemented at the directory and file level throttling and bottleneck management can be performed by the abfs driver to remove as many errors as possible ensuring the maximum throughput of valid data the namespace improves performance through the way
- 31:00 - 31:30 data is read and manipulated but also through removing the need for data transformation operations now including the hierarchical namespace opens the door for various performance enhancements so although azure data lake gen 2 is built on object storage traditionally object storage processes folders as a virtual entity referenced by a uri the hierarchical
- 31:30 - 32:00 namespace offers a true folder hierarchy with distinct folders this translates to various changes in how data is handled so renaming and relocating files within directories is an operation against metadata so only performs the operation once compared to traditional object stores that would copy and then delete the original copy requiring two operations for every move queries can use
- 32:00 - 32:30 partition scans to prune the data through a predicate push down to significantly reduce the amount of data that is processed operations are atomic meaning they either basically pass or fail everything the system will never get stuck in a partially implemented state or result in hung operations now there are best practices for working with azure data lake gen 2 that should be followed to ensure
- 32:30 - 33:00 optimal performance so use azure active directory security groups rather than individual users for setting permissions within the data lake there is a maximum of 32 entries per access control list or acl so this helps ensure that you do not exceed the threshold and will significantly reduce the number of operations performed against the data when adding or remove user access the azure provided firewall can be
- 33:00 - 33:30 enabled on storage accounts providing security against external activity hadoop distributed copy is a hadoop tool for moving data between locations using mapreduce jobs against the cluster it is optimized for hadoop clusters so should be the tool of choice for moving big data over network connections apache uzi is a scheduling system that can initiate copy jobs on a timer or when certain data
- 33:30 - 34:00 operations have occurred azure data factory can also schedule jobs but has limitations to its throughput for large data security and processing speeds rely heavily on the data layout when it is in the data lake so partitioning and directory structures should be planned in advance of data ingestion now controlling access to the data lake and the data within it can be managed through multiple
- 34:00 - 34:30 mechanisms a shared key is a programmatic administrative key to the resource used within your application to perform any operation against the azure storage layer and the contents of it a shared access signature is a uri generated mainly and provided to partners to delegate restrict time controlled rights to the storage account role-based access controls provide high-level user authentication
- 34:30 - 35:00 and authorization to the storage account using azure active directory identities while access control lists provide granular permissions to files and folders within the storage account now the azure platform offers native tools for securing the azure data lake gen 2 instances azure defender is a security intelligence tool that monitors for unusual behaviors and access attempts
- 35:00 - 35:30 against the storage accounts azure storage encryption encrypts data in the cloud using 256-bit aes and is on by default requiring no user intervention private endpoints are a networking feature that securely connects your clients on a v-net to the access of the storage using a dedicated secured link rather than going over the internet
- 35:30 - 36:00 and this video we're going to discuss big data processing with data lake azure data lake storage gen 2 is a data source for the centralizing of storage of both structured and unstructured data it is primarily a landing zone for big data operations and analysis to take place all architectures in a big data solution follow the same four principles in terms of processing stages
- 36:00 - 36:30 step one ingesting data is the process of acquiring it from the source data in big data architectures may come from sensors application logs generated files or a multitude of other sources two data that is ingested from multiple sources require a centrally accessible storage platform the data lake offers storage capabilities for mixed data types
- 36:30 - 37:00 from many sources three data preparation is the process of removing bad data transforming the data to a usable state and pulling samples into usable blocks this is also the stage where training models for data science efforts would take place and fourth the purpose of the data analysis solution is to determine the answer to questions
- 37:00 - 37:30 posed by the business the presentation layer abstracts the answers from the querying process and generates reports and visualizations for businesses to consume a big data architecture follows the same principles but can vary by organization and which tools are used so data is processed in two ways either in a batch or in a stream a batch will cue up a quantity of data
- 37:30 - 38:00 and then execute on it a stream processes the data as it is generated in near real time data is collected from various sources and is either stored in a data lake to be prepared for batch or push through it for streaming in either instance the data once processed is either stored in a new data store for analytical processes to execute against or it is passed directly to the
- 38:00 - 38:30 presentation layer where reports and visualizations can be created all big data architectures share a common set of components these components can be distributed or tightly integrated the source is where the data is being generated like application logs and data stores or databases or real-time streams like hardware sensors and data storage requires a solution
- 38:30 - 39:00 capable of ingesting large amounts of high velocity data in multiple formats data lakes are more frequently deployed for big data architectures now batch processing operations must filter and aggregate data to make it common for use in reporting this processing involves an application that can read analyze and write data in large quantities hadoop hd insight and azure data lake
- 39:00 - 39:30 analytics can fill this role once data has been processed it is moved into the analytical data store so the structured data this is frequently a data awareness like azure synapse analytics or a no sql service like azure cosmos db for unstructured data this can be a blob storage account reporting services are used to define and visualize the outcomes derived from
- 39:30 - 40:00 the data visualization looks like power bi for tool sets and can be leveraged directly from azure or standalone technologies can also be used notebooks using python can be used externally as well now utilizing the big data architectures described in this video on a service like azure can bring many benefits microsoft have introduced an assortment of different technologies onto the azure platform through
- 40:00 - 40:30 microsoft built offerings to microsoft supported open source hosted solutions like databricks this offers a lot of flexibility and choice for ensuring data can come from any source in any type now using a cloud platform offers elasticity and growth opportunities for ensuring resourcing is there when it is required rather than being limited to the available hardware of an on-prem solution performance can be scaled up through
- 40:30 - 41:00 parallelism and resourcing available through the infrastructure of the cloud companies with existing solutions can leverage their investments to still gain the benefits of the big data architecture without having to retool or migrate their existing landscape along with the benefits that come through the big data architectures there are still challenges that organizations may face so big data solutions by their nature are dispersed
- 41:00 - 41:30 landscapes and can be complex with multiple data ingestion or storage or processing and visualization efforts taking place simultaneously performance and application modifications may have unintended consequences and can cause backlogs and queuing for resources many of the tools require specialized skills and knowledge making resourcing them difficult for management big data is still an emerging field and
- 41:30 - 42:00 tools are continuing to be added and removed as their value peaks and drops expecting change in the big data landscape should be part of the organizational culture so organizations looking to implement the big data architecture should adhere to some best practices so distribute the workload by leveraging parallelism distributed file systems and splittable file formats can help to increase the performance of
- 42:00 - 42:30 data operations planning data partitions and structures ahead of adoption can return significant performance gains and troubleshooting failures becomes easier to resolve the schema can be built using schema on write or schema on read so schema on read adds flexibility to the architecture and improves performance by reducing bottlenecks through checks and validation operations
- 42:30 - 43:00 historically business intelligent operations require processes to perform etl processes as they move data into the data warehouse big data operations against a data lake can be for transformation in place before moving to the analytical data store for accelerating processing and enabling flexibility in the operations in this video we're going to discuss
- 43:00 - 43:30 planning a data lake now when it comes to planning and building a data lake deployment in the cloud there are various architectural and design decisions that should be considered so planning an enterprise data lake due to the potential scale may require the ability to sustain high throughput and store significant quantities of data so platform quotas may impact the ability to meet requirements
- 43:30 - 44:00 where storage limits may be reached or bandwidth costs push the cost out of range distributed organizations with data collection happening in multiple geographic locations may require different regulatory compliance in different data zones likewise organizations with data collection from different regions may have multiple data lakes deployed across the globe to reduce latency but have a requirement for essential
- 44:00 - 44:30 data collection for company-wide statistics and business intelligence depending on how the business is structured there may be multiple data lakes operating independently of each other based on business unit requirements and funding although zoning data lakes is part of the process sometimes it is optimal to deploy multiple lakes to keep zones or stages of data modification independent of each other
- 44:30 - 45:00 so zoning data lakes is the process of separating data into different states each state is a transformation of the data not of the contents it represents the processes taken to make the data usable through removing errors and taking only relevant data to the next stage the model proposed here is just an example and each organization may have different requirements for their data transformation
- 45:00 - 45:30 so the raw zone is the landing area for data arriving from a source it is unedited unfiltered and in its raw state data should be immutable and made available only as read-only a data life cycle management system should be implemented to prune old data to an archive when it exceeds its useful life the cleanse zone performs data cleanup exercises like removing irrelevant columns data
- 45:30 - 46:00 validation and data is stored into relevant business areas instead of by application the curated zone is where data is staged for consumption by reporting and business intelligence applications data is dimensionally modeled and presented through a self-serve portal but it should be for note that a data lake typically is not a store for a dynamic visualizations or interactive reporting
- 46:00 - 46:30 data for these operations should be extracted and stored in a data warehouse or other faster data stores the exploratory zone offers a testing area for data scientists and analysts to perform operations without fear of affecting production and for testing new models and concepts so the performance of a data lake can be heavily affected by the folder hierarchy so it's
- 46:30 - 47:00 crucial to ensure this is properly developed the naming convention for the folders and the files inside them should be human readable consistent throughout and should be able to be deciphered by unfamiliar users who are new to the environment permissions should be applied at a level that is grander enough to secure the data but without being too low level that it affects performance whenever a change is
- 47:00 - 47:30 made partitioning strategies should be designed to reduce hot spots within the data while still serving sufficient file sizes high cardinality keys can lead to oversized partitions and unusable file sizes now data within a folder should follow the same schema and be of the same format now access to data lakes should be through a multi-layered approach
- 47:30 - 48:00 combining management controls to manage permissions to data so role-based access control provides course grain permissions to the data lake or to folders inside it these are used to allow or deny permissions to the folder structure but typically do not dictate the ability of the user to perform actions against the data access control lists used to define the find green permissions to the data this is where the ability of the user to
- 48:00 - 48:30 read write modify or delete the data is set security groups are performed over directly adding and removing user principles to the data lake contents this will reduce the overhead operations when a user is added or removed and provides a single source for management with larger quantities of data come larger impacts on the speed of the query data scientists must optimize the data
- 48:30 - 49:00 to meet the query requirements and to achieve optimization there are many different file formats that can provide benefits and drawbacks to different situations some of the popular ones are parquet orc and avro so parquet is an open source storage format for hadoop produced by the apache foundation it is a column oriented and optimized for bulk complex data and supports data compression and multiple encodings optimize row column
- 49:00 - 49:30 or orc converts row data into columns and stores multiple rows per file bringing with it the ability to perform parallel processing across the cluster orc has some of the highest compressibility available through data and column skipping avro uses json to define data protocols and types it is a data serialization framework that serializes data
- 49:30 - 50:00 in a binary fashion rather than a text file and uses remote procedure calls for data movement good governance and planning will add significant value to the organization's growth and the data lake's usability in the future a data catalog should be created which contains the metadata for sorting and cataloging the data along with management tools and a search function for analysts to find the data sets they require
- 50:00 - 50:30 data quality is a standard the organization can apply to ensure the data is complete consistent and standardized it also dictates where masking should be applied to secure personal identifying or other sensitive data data compliance for the data of hipaa and where the data originates and is stored gdpr as well as the type of data being handled which pci dss are all common concerns
- 50:30 - 51:00 for any organization managing data especially global organizations that have to adhere to multiple regulations these should be accounted for and planned into the design of the data lake self-service tools to control access but allow users to access data without having to continue through data engineers for each request will improve relations and accelerate delivery to the business and planning for growth and data life
- 51:00 - 51:30 cycle management will stop concerns before they become issues and the expectation should always be for the environment to scale as the potential growth we have seen in recent years has been an indicator of that [Music] in this video we're going to discuss data lake storage best practices now data lakes are a consolidated repository for organizations data in either a
- 51:30 - 52:00 structured or unstructured form because of the nature of the lake to be a central repository it must be able to do several things first ingest data from multiple sources the organization's application landscape should storage data within the lake regardless of data type structure and velocity the lake should have sufficient throughput and writing speed
- 52:00 - 52:30 to ingest the data from all sources as it is generated the lake must be securely accessible to anybody who needs access with access controls and data segregation in place to limit users to only the data they require and data must be accessible when it is needed and be kept updated to ensure its useful life span life cycle management should be implemented to archive data
- 52:30 - 53:00 self service portals should be implemented to enable users to access data from modeling and reporting against without requiring data sets data lake architectures are enablers of the microservices application design and should be designed in the same way so processes should be decoupled to avoid cascading failures in event of one component or pipeline encountering an outage the data store and zones within
- 53:00 - 53:30 the data lake should be agile able to adapt to change and should be delivered in a minimum viable product delivery method so data pipelines in the lake should be built on highly available components removing single points of failure and providing service continuity even when one fails auditing data access and events within the data lake
- 53:30 - 54:00 is critical to both security and to troubleshooting issues so the lake and all data interfaces should provide logging capabilities data storage is the fundamental purpose of a data lake but with many different data pipelines and different tools and users accessing it it will require careful planning to meet all requirements so the storage platform should be scalable and elastic
- 54:00 - 54:30 able to expand on demand while not hitting limits that could cause the lake to stop functioning durability refers to the data's ability to tolerate events causing data loss not an inability to access it which is availability durability can be caused by failed hardware corrupt data rights and human error the data lake should provide a platform that offers high durability to avoid losing data
- 54:30 - 55:00 now because the data lake will be the repository of company data it must be secure and adhere to any related regulation like hepa gdpr and pci dss now processing data stored in a day late can be complex operation with multiple stages determining how to perform these functions may affect the data lake's structure and deployment
- 55:00 - 55:30 the framework should enable big data processing at high speeds usually using parallel processing against multiple files in a data set the azure data lake gen 2 product is an apache hadoop overlay on an azure storage account with azure data bricks providing apache spark functionality to work in concert to provide big data processing at scale tools like azure data factory can be used to perform etl actions against data in a lake
- 55:30 - 56:00 the processing compute should be scalable able to add resources like cpu and ram as the data load increases azure data bricks offers scalable compute groups to bring resources when they are required some organizations require constant computing so on-prem deployments may have financial sense or financial impact however for organizations that deploy compute resources
- 56:00 - 56:30 as it is required for operations the operational model delivered through azure data bricks may be more cost effective clusters can be deployed on demand and torn down to stop during office closure times now data stored in the data lake will often be accessed by different consumers looking to achieve a different goal the service should be designed to enable all consumers which can determine its hierarchical structure and the file formats
- 56:30 - 57:00 so data warehouses are typically stores of structured data modeled and curated data the business users will extract data from the lake to insert into the warehouse for querying by other processes azure data factory can be used to perform etl processes from azure data lake gen 2 to azure data warehouse or azure synapse analytics apache hive and other big data querying
- 57:00 - 57:30 tools can be leveraged to execute queries against the data stored in the data lake directly for ad hoc and interactive querying against structured and unstructured data machine learning is used for predictive analysis and forecasting so the data lake users may run mining models against the data artificial intelligence like chat bots and interactive utilities can use the data if it is suitably structured for them to
- 57:30 - 58:00 access and read from now governance in the data lake controls how data is stored and access to ensure security integrity and usability a data catalog should be automated maintained and presented from a central point accessible to all frameworks accessing the data lake it contains the metadata and location information for data stored in the lake that analysts can use to determine which data set to query against so data quality should
- 58:00 - 58:30 be assured through controls and checking the quality ensures irrelevant and noisy data is reduced compliance ensures the data meets regulation and organizational policy for data storage and masking practices self-service is going to be utilized by most organizations but there must be controls and auditing in place to monitor its usage for attacks and abuse
- 58:30 - 59:00 security should be a priority for all organizational practices but for the data lake acting as a store for all company data it's critical so security should be approached the same whether the data lake is in the cloud or on-prem and should be accessed for all data pipelines data should be encrypted at rest and in transit if the data is especially sensitive encrypting using hardware security modules can be implemented the network feeding
- 59:00 - 59:30 data into and retrieving data from the data lake must be secure for deployments in the cloud there should be a vpn service or a dedicated link like azure express route to ensure data does not travel over the internet access controls should be implemented for managing access and permissions to the data through identity management utilities and policy and because applications feed data into the data lake they must be secured and their data must
- 59:30 - 60:00 be treated as untrusted unable to be executed once it is inside the data lake [Music] in this demo we're going to show you how to create a data lake generation 2 account so i'm now in portal.azure.com logged in and this is where you need to be to do this demo azure data lake gen 2 is an extension for the azure blob storage service
- 60:00 - 60:30 gaining all the benefits from the underlying storage service like replication scaling and throughput while adding a hierarchical file system the hierarchical file system offers more granular security higher performance for data operations and atomic operations deploying the data lake gen 2 product is not a separate sku but instead is performed as part of a storage account instance so what we need to do first is create a new resource so i'm going to go ahead
- 60:30 - 61:00 and click the button link create a new resource now i'm going to search for storage accounts so storage account go ahead and click on that and what we need to do here is select the storage account that is done by microsoft right so there are several here but the primary one already selected is the one i'm looking for at the very top just simple storage account by microsoft go ahead and click on create now i have to do a little bit of
- 61:00 - 61:30 verification here so i need to specify a resource group i'll go ahead and create a new one right now dp203 dlg 2rg there we go all right and i'm going to just change this before i specify the storage name i just want to change this to east u.s so that's at the very top of my list okay now storage accounts they have to
- 61:30 - 62:00 be lower case so i'm going to call this dp203 dlg2 there we go now when it comes to the account type here we need to keep this general purpose v2 for the data lake for what we want to accomplish here there is right after the location though there's performance and there's standard and premium and it's a radio button so the performance tier determines the type of drives the storage account resides on standard tier uses
- 62:00 - 62:30 traditional spinning disk hard drives and provides higher latency retrieval at a lower cost this tier is fine for mass storage and data archiving the premium tier offers solid-state drives for consistency low latency access at a slightly higher cost any applications or time-sensitive requirements probably should use premium tier note that selecting the premium tier will limit the type of blob that can be used in the storage account
- 62:30 - 63:00 to page blobs now data lake gen 2 can only run as i mentioned on the storage v2 account type i'm just going to put this back to standard here so because the hierarchical data namespace is an overlay of the blob storage all replication types are supported with the data lake gen 2. so for my use i will require zone redundant storage here so let's pick zone redundant storage which is the second list
- 63:00 - 63:30 zrs and now i'm in good shape to click on the next networking button so in the networking section you have two configuration items the connectivity method determines if the storage account will be accessible to the internet or kept internal to your organization only the networking routing method will determine if your traffic will go out over the internet or if it can be routed internally within the microsoft network go ahead
- 63:30 - 64:00 and click on data protection because we leave everything default there storage accounts offer various methods of protecting data for different scenarios point in time restores allow you to take a periodic snapshot of your containers to restore the contents quickly to a recent state soft delete is effectively a recycle bin on a timer it allows you to retrieve accidentally deleted files without having to restore them from backup it is available at the blob and file share level
- 64:00 - 64:30 the ability to activate soft delete on containers is in preview at the time of this recording now i'm going to look at the bottom section here under the tracking so versioning can be enabled to track changes to files and revert to previous versions if data is modified and the blob change feed is not currently supported on data lake gen 2 so enabling it will remove our ability to enable the hierarchical namespace
- 64:30 - 65:00 but it provides transaction logging capability to track changes made to blobs and their metadata so that's an overview of that gonna go ahead and move into the next section here under advanced so the advanced tab storage accounts is where we will find the configuration items for modifying the storage account to perform specialized activities and for additional security items so enabling the secure transfer option will force all incoming connectors to
- 65:00 - 65:30 use a secure protocol connections to the rest api will force https and smb 2.1 3.0 and the linux smb client will all require encryption to be enabled so disabling the allow shared key access will disable connections from apps using the shared key or shared access signature methods for authentication the minimum tls level can be set to enable
- 65:30 - 66:00 more compatibility but comes with the risk of less secure protocols in use the infrastructure encryption is in preview at the time of this video and is unavailable unless requested but it adds an additional layer of encryption using different keys and different algorithms so under the blob here we can see we can allow access to public blobs within the account if this is disabled then the ability to set the
- 66:00 - 66:30 acl to anonymous access is also disabled and the default blob tier can be set to either write data to the hot tier which has slightly higher storage costs but lower access costs or to the cool tier which has lower storage costs but higher access costs now this nfs version 3 is also in preview at the time of this recording so it's going to be disabled however when it is available clients will be able to connect directly to the storage account
- 66:30 - 67:00 use the nfs v3 protocol so next section basically handles the types of specialty storage options available to azure storage the storage account can be turned into a data lake gen 2 deployment or an azure file instance now additionally the ability to use customer provided keys for securing tables and queries will be available soon but is also in preview at this time so i'm going to go ahead and click on
- 67:00 - 67:30 enable and the radio button here for the data lake gen 2 storage and then i'm going to go ahead and click on next to move to the tag section and basically tags can be set on the account for tracking and management purposes not needed for here i'm just going to go ahead and click on review and create and do a quick scan here to make sure everything looks good i have the right resource group i'm in the right location i have my storage account name
- 67:30 - 68:00 i have zone redundant storage zrs selected i have my general purpose v2 selected my performance is standard and as i come down here i can see my blob accessed here is hot and my hierarchical namespace which is what this demo is all about is enabled this was when i selected that data lake option it's inferred as the hierarchical namespace in the review section at this time i can now go ahead and click on create and azure
- 68:00 - 68:30 will go ahead and create the storage account for the data lakes here so my deployment is in progress this will just take a moment okay it's now done i can click the go to resource button and quick verification that yes indeed all the options and settings have copied over and now set up and ready for use [Music]
- 68:30 - 69:00 so in this course we've examined how to plan for an effective deployment for data and big data storage we did this by exploring data storage capabilities of azure blob storage blob storage performance scalability geo-redundancy and disaster recovery we looked at azure data lake storage gen 2 and big data processing and looked at planning and azure data lake storage best practices
- 69:00 - 69:30 and creating a storage account so in the next course we'll move on to examine how to plan a data structure for efficient storage and transactions