GOLDMAN SACHS · DATA LAKE

Core Data Pipeline

The Business Problem

Large organizations, such as Goldman Sachs, create very large amounts of data. Traditionally, most of this data is managed in functional silos, making it difficult and expensive to discover, query, and analyze data across these silos.
Overlapping and sometimes redundant repositories, differences in meaning of data, and lack of transparency on ownership present challenges, inefficiencies, and inconsistencies.

GS Data Lake

The Data Lake is a data-centric ecosystem that will allow transactional, operational, and reference data to be:

Registered, ingested, validated, stored and archived in its native form.
Secured and entitled to authorized users.
Modeled and made query-able using a unified query service.
Cleansed, enriched, transformed, and analyzed through hosted compute engines.
Managed as a first class asset with transparency on ownership, lineage, and provenance.

The GS Data Lake is being built to handle ALL the data in the firm in an accessible registry, where it can be rapidly analyzed for business purposes using hosted query services.

Data Lake Users

There are three main actors in the data lake.

Producers – responsible to register and publish their data to the data lake and ensure it meets data validation standards and SLA.
Refiners – will cleanse, enrich, and transform the data and re-publish the curated version to the data lake.
Consumers – will browse available data and run reports, queries, and analytics.

Technologies

We are building this infrastructure using open source components such as Hadoop, Spark, Flink and Hive as well as commercial offerings and custom-developed software.
Technology Stack:

Apache Ecosystem - HDFS, MapReduce, Flink, Spark, Hive, HBase, Kafka, Avro, Parquet, Tomcat, ZooKeeper
AWS - S3, Snowflake, Redshift
Java EE
Jersey RESTful Web Services,
Elastic Search
SAP Sybase IQ
Unix
Slang
SQL
MongoDB
JavaScript
Bash Shell Script
Pandas
Git
Subversion
Gradle
Maven

SPR

Search This Blog