Skip to main content

Data Cleaning Script

· 4 min read

Overview

This Python script is designed to clean, validate, and process real estate broker data from an Excel file. The script ensures data integrity by retaining all rows, handling missing values, formatting phone numbers, validating emails, and standardizing text fields. It outputs the cleaned data to an Excel file with flags for missing or invalid data, providing a comprehensive and traceable cleaning process.

Features

  1. Data Loading: Reads data from an Excel file while preserving original formats, such as leading zeros.
  2. Column Name Cleaning: Strips whitespace from column headers for consistency.
  3. Missing Data Handling: Fills missing critical values and flags rows with incomplete data.
  4. Duplicate Handling: Flags duplicates instead of removing them.
  5. Phone Number Formatting:
    • Formats numbers to the E.164 standard for UAE.
    • Retains original phone numbers alongside formatted ones.
  6. Email Validation: Ensures emails are valid and replaces invalid entries with a placeholder.
  7. Date Conversion: Converts date fields to a standard format and flags invalid dates.
  8. Text Standardization: Standardizes text fields, ensuring consistent formatting.
  9. Logging: Tracks each step of the cleaning process, providing detailed insights into row retention and transformations.
  10. Comprehensive Output: Outputs the cleaned data to an Excel file, retaining original values and adding flags for review.

Approach

The script processes data sequentially using the following steps:

  1. Load Data

    • Reads the input Excel file into a Pandas DataFrame.
    • Ensures all columns are treated as strings to preserve data integrity.
  2. Clean Column Names

    • Strips leading and trailing whitespace from column headers.
  3. Handle Missing Values

    • Fills missing critical fields (e.g., email, broker number) with placeholders.
    • Adds a Missing Data flag to identify incomplete rows.
  4. Flag Duplicates

    • Flags duplicate rows based on critical columns without removing them.
  5. Clean Phone Numbers

    • Formats phone numbers to the UAE E.164 standard.
    • Retains the original phone number for reference.
    • Flags invalid phone numbers.
  6. Validate Emails

    • Uses a regex pattern to validate email addresses.
    • Replaces invalid entries with a placeholder (invalid@domain.com).
  7. Convert Dates

    • Converts date columns to a standard datetime format.
    • Flags invalid dates for review.
  8. Standardize Text Fields

    • Converts text fields to uppercase and removes extra whitespace for consistency.
  9. Output Data

    • Outputs the cleaned data to an Excel file, including flags for missing or invalid data.
    • Saves original and processed values for transparency.

Usage

Prerequisites

  • Python 3.8 or later
  • Required Python libraries:
    • pandas
    • numpy
    • openpyxl

Installation

  1. Clone the repository or download the script.
  2. Install dependencies using:
    pip install pandas numpy openpyxl

Running the Script

  1. Place your input Excel file in the same directory as the script.
  2. Update the input_file variable in the script with the name of your Excel file.
  3. Run the script:
    python cleanup_data.py

Output

  • Cleaned Data: Saved as cleaned_output.xlsx in the same directory.
  • Log File: The script logs detailed processing steps and errors to the console.

Key Files

  • cleanup_data.py: Main script for data cleaning and validation.
  • cleaned_output.xlsx: Output file with cleaned data.
  • README.md: Documentation for the script.

Example

Given the following input data:

CardHolderNameEnCardHolderMobileRealEstateNumberCardIssueDate
John Doe05012345671232020-01-01
Jane Smith971505432198456

The script will produce the following output:

Name EnglishFormatted PhoneBroker NumberCardIssueDateMissing DataPhone ValidCardIssueDate_Valid
John Doe+9715012345671232020-01-01FalseTrueTrue
Jane Smith+971505432198456NaTTrueTrueFalse

Notes

  • Ensure input files are in the correct format before running the script.
  • For large datasets, processing time may vary based on system performance.

Contact

For any issues or feature requests, please create an issue in the repository or contact the maintainer.


Enjoy using the Data Cleaning Script!

PostgreSQL Backup, Restore, and Django Connection Setup

· 3 min read

This document provides a comprehensive guide on backing up PostgreSQL from AWS RDS, restoring it into a Docker container, and connecting it to a Django application.


1. Take PostgreSQL Database Dump

Step 1: Install PostgreSQL Tools (if needed)

brew install postgresql

Step 2: Create Database Backup Using pg_dump

pg_dump -h <AWS_RDS_ENDPOINT> -U <DB_USERNAME> -d <DB_NAME> -p 5432 -F c -b -v -f db_dump.backup

Example:

pg_dump -h rezio-staging.cfwuacqgqc0d.me-central-1.rds.amazonaws.com -U aketai -d rezio_staging -p 5432 -F c -b -v -f rezio_staging.backup

Options Explained:

  • -h: AWS RDS endpoint
  • -U: Database username
  • -d: Database name
  • -p: Database port (usually 5432)
  • -F c: Format as custom backup
  • -b: Include blobs
  • -v: Verbose mode
  • -f: Output file name

2. Restore the Database in Docker

Step 1: Run PostgreSQL in Docker

docker run --name postgres_local -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=rezio_staging_local -p 5432:5432 -d postgres:16.6

Step 2: Copy the Backup to the Docker Container

docker cp db_dump.backup postgres_local:/db_dump.backup

Step 3: Restore the Backup

docker exec -it postgres_local bash
pg_restore -U postgres -d rezio_staging_local -v /db_dump.backup

Step 4: Verify the Restore

psql -U postgres -d rezio_staging_local
\dt
SELECT * FROM user_user LIMIT 5;

3. Connect PostgreSQL to Django App

Step 1: Update Django settings.py

DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'rezio_staging_local', # Docker DB name
'USER': 'postgres', # Docker username
'PASSWORD': 'postgres', # Docker password
'HOST': 'localhost', # Docker host
'PORT': '5432', # Docker port
}
}

Step 2: Apply Django Migrations

python manage.py makemigrations
python manage.py migrate

Step 3: Test the Connection

python manage.py runserver

Access: http://localhost:8000


4. Troubleshooting & Additional Commands

Check Docker Container Status

docker ps

Check PostgreSQL Logs

docker logs postgres_local

Manually Test PostgreSQL Connection

psql -h localhost -U postgres -d rezio_staging_local

Inspect Docker Network

docker network inspect bridge

Check Django Database Shell

python manage.py dbshell

Recreate Docker Container (if needed)

docker stop postgres_local && docker rm postgres_local

By following these steps, you can successfully take a database dump from AWS RDS, restore it in Docker, and connect it to a Django application for local development and testing.

Setting Up Grafana with AWS RDS PostgreSQL Database

· 3 min read

This guide walks through the end-to-end process of setting up Grafana on an AWS EC2 instance and connecting it securely to an AWS RDS PostgreSQL database.


1. Setting Up Grafana on EC2

Step 1: Connect to Your EC2 Instance

SSH into your EC2 instance:

ssh -i /path/to/your-key.pem ec2-user@<your-ec2-public-dns>

Step 2: Install Grafana

  1. Add the Grafana repository:

    sudo yum install -y https://dl.grafana.com/oss/release/grafana-9.4.7-1.x86_64.rpm
  2. Install Grafana:

    sudo yum install grafana -y
  3. Start the Grafana service:

    sudo systemctl start grafana-server
    sudo systemctl enable grafana-server
  4. Verify Grafana is running:

    sudo systemctl status grafana-server

2. Configuring Security Groups

Step 1: Update Security Group for EC2

  1. Go to the AWS Management Console.
  2. Navigate to the Security Groups section under EC2.
  3. Edit the security group attached to your EC2 instance to allow inbound traffic:
    • Protocol: TCP
    • Port Range: 3000
    • Source: Your IP (or 0.0.0.0/0 for testing).

Step 2: Update Security Group for RDS

  1. Navigate to RDSDatabases.
  2. Select your database instance.
  3. Add an inbound rule to the associated security group:
    • Protocol: TCP
    • Port Range: 5433
    • Source: EC2 instance’s private IP or its security group.

3. Download AWS RDS Root Certificate

  1. SSH into your EC2 instance:

    ssh -i /path/to/your-key.pem ec2-user@<your-ec2-public-dns>
  2. Download the certificate:

    sudo curl -o /etc/grafana/me-central-1-bundle.pem https://truststore.pki.rds.amazonaws.com/global-bundle.pem
  3. Verify the certificate:

    cat /etc/grafana/me-central-1-bundle.pem
  4. Set appropriate permissions:

    sudo chmod 644 /etc/grafana/me-central-1-bundle.pem

4. Test Database Connectivity

Step 1: Using psql Command

Test the connection to your RDS instance with SSL:

psql "host=<rds-endpoint> port=5433 dbname=<db-name> user=<db-user> sslmode=verify-full sslrootcert=/etc/grafana/me-central-1-bundle.pem"

If successful, you should see:

SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)

5. Configure Grafana Data Source

Step 1: Log in to Grafana

  • Open your browser and navigate to:
    http://<your-ec2-public-dns>:3000
  • Log in using the default credentials:
    • Username: admin
    • Password: admin (change this immediately).

Step 2: Add PostgreSQL Data Source

  1. Go to ConfigurationData SourcesAdd Data Source.
  2. Select PostgreSQL.

Step 3: Configure Connection Settings

Fill in the following details:

FieldValue
Host<rds-endpoint>:5433
Database<db-name>
User<db-user>
Password<db-password>
SSL Modeverify-full
TLS/SSL Certificate/etc/grafana/me-central-1-bundle.pem
  1. Save and Test.

6. Troubleshooting

Issue: pq: couldn't parse pem in sslrootcert

  • Ensure the certificate file is correctly formatted:
    cat /etc/grafana/me-central-1-bundle.pem
  • Re-download the certificate:
    sudo curl -o /etc/grafana/me-central-1-bundle.pem https://truststore.pki.rds.amazonaws.com/global-bundle.pem
  • Remove extraneous content (if any):
    sudo nano /etc/grafana/me-central-1-bundle.pem

Issue: no pg_hba.conf entry

  • Update the RDS security group to allow access from your EC2 instance’s IP or security group.

7. Final Security Recommendations

  1. Restrict security group rules to trusted IP addresses.
  2. Regularly update Grafana and PostgreSQL to patch security vulnerabilities.
  3. Use strong passwords and enable monitoring for RDS and EC2 instances.

Conclusion

By following these steps, you’ve set up Grafana on your EC2 instance and connected it securely to an AWS RDS PostgreSQL database with SSL encryption. Let me know if you have further questions or issues!

**Rezio Tech Vision Statement**

· 2 min read

Mission Statement

To revolutionize the real estate experience through technology-driven innovation, leveraging AI-powered insights, seamless customer journeys, and a scalable, resilient cloud infrastructure.

Tech Stack Direction

Frontend:

  • Framework: Flutter
  • Target Platforms: iOS, Android, Web
  • State Management: Riverpod, Provider
  • UI Frameworks & Libraries: Material UI, Custom Widgets

Backend:

  • Framework: Django (Python)
  • Database: PostgreSQL (Managed by Amazon RDS)
  • APIs: REST, GraphQL (planned integration)

Cloud Infrastructure (AWS):

  • Compute: Amazon EC2 (Auto-scaling Groups)
  • Storage: Amazon S3 for static assets, file storage
  • Database: Amazon RDS (PostgreSQL), Amazon ElastiCache for caching
  • Networking: Amazon VPC, Route 53 for DNS management
  • Monitoring & Logging: Amazon CloudWatch, AWS X-Ray, ELK Stack
  • Security: AWS WAF, AWS Secrets Manager, IAM Policies

DevOps and CI/CD:

  • Version Control: GitHub
  • CI/CD Pipeline: GitHub Actions, AWS CodePipeline
  • Infrastructure as Code: Terraform, AWS CloudFormation
  • Containerization & Orchestration: Docker, Amazon ECS

AI Ambitions

AI-Powered Insights:

  • Recommendation Engine: Property recommendations based on user preferences and browsing history.
  • Market Trends Forecasting: Predictive models using historical property data.

Intelligent Automation:

  • Lead Scoring: AI-driven lead qualification and property matching.
  • Chatbots & Virtual Assistants: Customer support and property inquiries.

Data Analytics & Insights:

  • Data Lake Integration: Amazon S3 and AWS Glue for big data processing.
  • AI/ML Models: Amazon SageMaker for training custom ML models.

DevOps Philosophy

  1. Automation First: Emphasizing Infrastructure as Code (IaC), automated testing, and CI/CD pipelines.
  2. Scalability & Resilience: Auto-scaling infrastructure with multi-AZ deployments.
  3. Security by Design: IAM policies, role-based access, and automated vulnerability scans.
  4. Observability & Incident Management: Real-time monitoring, proactive alerts, and performance optimization through CloudWatch and Datadog.
  5. Agile & Iterative Development: Adopting Scrum/Kanban practices, with bi-weekly sprints and continuous feedback loops.

Together, we build the future of real estate with innovation, precision, and intelligence.