Core Data Pipeline
The Business Problem
- Large organizations, such as Goldman Sachs, create very large amounts of data. Traditionally, most of this data is managed in functional silos, making it difficult and expensive to discover, query, and analyze data across these silos.
- Overlapping and sometimes redundant repositories, differences in meaning of data, and lack of transparency on ownership present challenges, inefficiencies, and inconsistencies.
GS Data Lake
The Data Lake is a data-centric ecosystem that will allow transactional, operational, and reference data to be:
- Registered, ingested, validated, stored and archived in its native form.
- Secured and entitled to authorized users.
- Modeled and made query-able using a unified query service.
- Cleansed, enriched, transformed, and analyzed through hosted compute engines.
- Managed as a first class asset with transparency on ownership, lineage, and provenance.
The GS Data Lake is being built to handle ALL the data in the firm in an accessible registry, where it can be rapidly analyzed for business purposes using hosted query services.
Data Lake Users
There are three main actors in the data lake.
- Producers – responsible to register and publish their data to the data lake and ensure it meets data validation standards and SLA.
- Refiners – will cleanse, enrich, and transform the data and re-publish the curated version to the data lake.
- Consumers – will browse available data and run reports, queries, and analytics.
- We are building this infrastructure using open source components such as Hadoop, Spark, Flink and Hive as well as commercial offerings and custom-developed software.
- Technology Stack:
- Apache Ecosystem - HDFS, MapReduce, Flink, Spark, Hive, HBase, Kafka, Avro, Parquet, Tomcat, ZooKeeper
- AWS - S3, Snowflake, Redshift
- Java EE
- Jersey RESTful Web Services,
- Elastic Search
- SAP Sybase IQ
- Bash Shell Script