Data Warehouse and ETL Pipelines
Our Data Warehouse, powered by PostgreSQL and SQLAlchemy, provides a centralized repository of structured data collected from various public and private sources. The objective of this Data Warehouse is to furnish the due diligence and TPRM processes with accurate, updated, and contextually relevant data, allowing for comprehensive and in-depth analyses.
Data Warehouse
The data warehouse is structured into several schemas based on the data category, namely Litigation Sources, Regulator Sources, and MCA Master. Each schema contains multiple tables, each table storing unique data points as described by their fields.
Litigation Sources
This schema comprises tables like delhi_hc_records, ecourts_hc_records, itat_records, ncdrc_records, nclat_records, and nclt_records, which store various data points related to litigation records from different courts and tribunals.
Regulator Sources
The regulator_sources schema encapsulates the cibil_records table, which stores key financial data fetched from the Credit Information Bureau (India) Limited (CIBIL).
MCA Master
The mca_master schema contains tables such as mca_company_records, mca_signatory_records, mca_company_signatory_appointment_records, and mca_company_charge_records. These tables maintain a host of data points related to companies, signatories, and company charges obtained from the Ministry of Corporate Affairs (MCA).
ETL Pipelines
Extract-Transform-Load (ETL) pipelines play a vital role in our data infrastructure. They are responsible for handling the data flow from the point of extraction from public data sources or third-party data providers, through the transformation phase where raw data is cleaned, normalized, and transformed into a structured format, and finally to the load phase where data is loaded into the data warehouse.
Extraction
Data scrapers perform the extraction process, pulling data from various public data sources or third-party data providers. The extracted data is typically raw and unstructured.
Transformation
The raw data is then transformed to give it a structured format, making it usable for analysis and reporting. This process involves data cleaning (removing duplicates, correcting errors), data normalization (maintaining consistency), and data aggregation (grouping related data).
Loading
After transformation, the processed data is loaded into the respective tables in the data warehouse. The loading process ensures data is correctly mapped to the right tables and fields, maintaining the integrity of the data.
Summary
The Data Warehouse and ETL pipelines are pivotal to the overall functioning of our platform. The warehouse provides a structured and scalable solution for data storage, while the ETL pipelines ensure the quality and usability of data by processing and loading it into the warehouse. Together, they establish a robust data infrastructure that fuels accurate and comprehensive due diligence and TPRM processes.