1 / 11

Best Practices to Build a Data Lake

A Data Lake is a vast pool of raw data that comprises structured and unstructured data. This data can be processed and analyzed later on. Data Lakes eliminates the need for implementing traditional database architectures.

Fibona
Download Presentation

Best Practices to Build a Data Lake

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Best Practices to Build a Data Lake https://fibonalabs.com/

  2. The need for big data is inevitable. Data is the new currency, and it is estimated that 90% of the data in the world today has been created in the last two years alone, with 2.5 quintillion bytes of data created every day. With this amount of data being created, companies are facing greater challenges to ensure that they are using their data in the best way possible, out of which creating a Data Lake is one such method. A Data Lake is a vast pool of raw data that comprises structured and unstructured data. This data can be processed and analyzed later on. Data Lakes eliminates the need for implementing traditional database architectures. This blog post will discuss the best practices for building a data lake. So, without further ado, let’s get started.

  3. BEST PRACTICES TO BUILD A DATA LAKE • REGULATION OF DATA INGESTION Data ingestion is “the flow of data from its origin to data stores such as data lakes, databases and search engines”. As we add new data into the data lake, it is important to preserve the data in its native form. By doing so, we can generate outputs of analysis and predictions with greater accuracy. This includes preserving even the null values of the data, out of which proficient data scientists squeeze out analytical values when needed. WHEN SHOULD WE PERFORM DATA AGGREGATION? Aggregation can be carried out when there is PII (Personally Identifiable Information) present in the data source.

  4. The PII can be replaced with a Unique ID before the sources are saved to the data lake. This bridges the gap between protecting user privacy and the availability of data for analytical purposes. It also ensures compliance with data regulations like GDPR, CCPA, and HIPAA, etc. 2. DESIGNING THE RIGHT DATA TRANSFORMATION IDEA The main purpose of collecting data in Data Lake is to perform operations like inspection, exploration, and analysis. If the data is not transformed and cataloged correctly, it increases the workload on the analytical engines. The analytical engines scan the entire data set across multiple files, which often results in query overheads.

  5. MEASURES TO HELP IN DESIGNING THE RIGHT DATA TRANSFORMATION STRATEGY: • Store the data in a columnar format such as Apache Parquet or ORC, these formats offer optimized reads and are open-source, which increases the availability of data for various analytical services. • Partitioning the data concerning the time stamp can have a great impact on search performance. • Small files can be chunked into bigger ones asynchronously. This helps in reducing network overheads. • Using Z-order indexed materialized views would help to serve queries including data stored in multiple columns. • Collect data set statistics like file size, rows, histogram of values to optimize queries with join reordering.

  6. Collect column and table statistics to estimate predicate selectivity and cost of plans. It also helps to perform certain advanced rewrites in the Data Lake. 3. PRIORITISING SECURITY IN A DATA LAKE The RSA Data Privacy and Security survey conducted in 2019 revealed that 64% of its US respondents and 72% of its UK respondents blamed the company and not the hacker for the loss of personal data. This is due to the lack of fine-grained access control mechanisms in the data lake. Along with the increase of data, tools, and users, there is a dynamic increase in the risks of security breaches. Hence curating a security strategy even before building a data lake is important. This would grab the attention of the increased agility that comes with the use of a data lake.

  7. The data lake security protocols must account for compliance with major security policies. POINTS TO REMEMBER WHILE CURATING AN EFFICIENT SECURITY STRATEGY: • Authentication and authorization of the users who access the data lake must be enforced. For instance, person A might have access to edit the data lake whereas person B might have permission only to view it. They must be authenticated using passwords, usernames, multiple device authentication, etc. Integrating a strong ID management tool in the underlying Cloud Solutions provider would help in achieving this. • The data should be encrypted at all levels i.e., when in transit and also at rest so that only the intended users can understand and use it.

  8. Access should be granted only to skilled and well-experienced administrators, thus minimizing the risk of breaches. • The data lake platform must be hardened so that its functions are isolated from the other existing cloud services. • Host security methods such as host intrusion detection, file integrity monitoring, and log management should be enhanced. • Redundant copies of critical data must be stored as a backup option in another data lake so that it comes in hand in cases of data corruption or accidental deletion. 4. IMPLEMENTING WELL-FORMULATED DATA GOVERNANCE STRATEGIES A good data governance strategy ensures data quality and consistency.

  9. It prevents the data lake from becoming an unmanageable data swamp. KEY POINTS TO REMEMBER WHILE CRAFTING A GOVERNANCE STRATEGY FOR A DATA LAKE: • Data should be identified and cataloged. The sensitive data must be clearly labeled. This would help the users achieve better search results. • Creating metadata acts as a tagging system to organize data and assist people during their search for different types of data without confusion. • No data should be stored beyond the time specified in the compliance protocols. This would result in cost issues along with compliance protocol violations. So, defining proper retention policies for the data is necessary.

  10. THANK YOU

More Related