This repository contains a Proof of Concept (PoC) for setting up and using DataHub, an open-source metadata platform for data discovery, management, and governance. The PoC demonstrates how to install and configure DataHub using Docker, load sample data, and optionally integrate with local Kafka and Airflow instances for data ingestion and processing.
Before you start, ensure that your Docker daemon is running. You can verify this by running:
docker --versionIf Docker is not installed, follow the instructions on the Docker website to install it for your operating system.
Once Docker is installed and running, you can proceed with the steps below to set up and use DataHub.
- Docker
- Python 3
To install the required packages, run:
python3 -m pip install --upgrade -r requirements.txtdatahub docker quickstart [--version TEXT (e.g. "v0.9.2")]datahub docker ingest-sample-dataGo to the local_airflow directory and run:
docker-compose -f docker-compose.yml up -dIf you want to test Kafka ingestion to DataHub, go to the local_kafka directory and run:
docker-compose -f docker-compose.yml up -dIf you want to test Kafka ingestion to DataHub, go to the local_datahub/recipes directory and run:
datahub ingest -c kafka_test_recipe.dhub.yamlIf you want to test Kafka ingestion to DataHub, go to the scripts directory and run:
python3 eth_tx.pydocker exec kafka_test_broker \
kafka-topics --bootstrap-server kafka_test_broker:49816 \
--listdocker exec --interactive --tty kafka_test_broker \
kafka-console-consumer --bootstrap-server kafka_test_broker:49816 \
--topic transaction \
--from-beginning