DDRC Technical Framework

Introduction

This project focuses on creating a secure and scalable infrastructure to support data acquisition, transformation, visualization and analysis for the Southeast Regional Drug Data Research Center (SR-DDRC) at the University of Alabama Institute of Data and Analytics. The primary goal of the technical aspect of this initiative is to collect and analyze demographic, geographic, socioeconomic and other social determinants of health data relevant to drug and opioid crisis to better inform future policy and intervention. The data gathered in this effort will also support future research into various areas related to drug and opioid misuse.

This document serves as a technical guide for other research institutions interested in replicating portions of this project in different geographic areas or research areas. It provides and overview of the technical infrastructure, including the setup of secure network environments, server infrastructure, data acquisition methods, and data transformation workflows. The document also outlines the tools and best practices used to ensure data quality, security and scalability.

This section provides a high-level overview of the infrastructure supporting data acquisition, transformation, and visualization, with a focus on secure, scalable architecture. While we do not provide detailed network configurations for security reasons, the general principles and components involved are outlined to guide replication efforts.

1.1 Network Architecture

The SR-DDRC utilizes a secure and segmented network infrastructure designed to protect sensitive data and ensure high availability. The architecture is built with industry-standard security protocols, including the use of firewalls, Virtual Private Networks (VPNs), and encryption technologies to safeguard data both in transit and at rest.

Key elements of the network setup include:

Firewalls: Firewalls are configured to control both inbound and outbound traffic, allowing only authorized access to the database and web servers.
VPN Access: Users access the network securely through a VPN, ensuring that only authenticated users can connect to internal systems.
Encrypted Communication: All data transmission between systems (e.g., API calls, database queries) is encrypted using TLS/SSL protocols to prevent interception and unauthorized access.

While specifics of the IP ranges, routing rules, and subnets are not disclosed, these components are integral to creating a secure and scalable environment suitable for sensitive research data.

1.2 Server Infrastructure

A robust set of virtual Windows and Linux based servers are deployed to handle the various tasks of data acquisition, transformation, storage and visualization. The architecture supports an on-premise deployment, although a cloud based deployment is feasible with some architectural changes.

The central data repository is hosted on Windows based servers running Microsoft SQL Server, a relational database system. There are multiple servers with various levels of accessibility, depending on the sensitivity of the data they host. The database servers are backed up daily to a cloud and offsite disaster recovery environment, and disaster recovery procedures are in place to ensure minimal downtime and data loss in the event of hardware and software failures. Our organization conducts an annual disaster recovery exercise to rigorously test and validate the robustness of our disaster recovery plan, ensuring its effectiveness in mitigating potential risks and minimizing downtime

The SR-DDRC web servers are Linux based and host Development, Test and Production environments of the Drupal Content Management Systems (CMS), which supports the project’s website. Drupal is an open-source CMS known for its flexibility, scalability and robust security features, making it ideal for presenting research findings. In general, the Development environment is used to test Drupal and component upgrades and the Test environment is utilized to create and test content before it is migrated to the Production environment. Drupal’s security features are enhanced by regular updates, role-based permissions and SSL certificates that enforce HTTPS for all web traffic.

Researchers and data engineers/analysts securely access the SR-DDRC environment through terminal servers. These servers are configured to allow remote sessions only to authorized users utilizing two factor authentication to login. Software typically used to manage data pipelines and analyze data (Python, R, RStudio, Visual Studio Code, Azure Data Studio, etc) are installed on these terminal servers.

1.3 Security Protocols

Security for SR-DDRC data is a top priority, and the infrastructure employs a defense in depth model to protect both the network and data. The key security measures include:

Encryption Standards: Sensitive data is encrypted at rest using AES-256, while all network communication is protected using TLS/SSL to prevent unauthorized interception.
Role-Based Access Control (RBAC): Access to servers, databases, and applications is governed by RBAC policies. Different roles are assigned based on the user's function (e.g., researcher, data engineer), with each role given only the permissions required to perform their duties.
Multi-Factor Authentication (MFA): Users accessing the system must authenticate using MFA, combining password protection with either hardware tokens or mobile authentication apps to reduce the risk of unauthorized access.
Intrusion Detection and Monitoring: The system employs intrusion detection systems (IDS) and continuous monitoring tools to detect and respond to suspicious activity. Logs from all servers, including authentication attempts, API requests, and database queries, are collected and reviewed regularly to identify potential security threats.

This section outlines the processes involved in acquiring raw data from external sources, such as public APIs and raw text files, and the subsequent steps for staging this data in the relational database. The methods used ensure reliable and secure data ingestion, handling a variety of data formats and sources.

2.1 Data Sources

The SR-DDRC relies on a variety of external data sources to gather geographic, demographic and socioeconomic data files. The primary data sources include publicly available raw data files (typically in CSV format) or publicly available application programming interfaces (APIs) that are provided by government agencies such as the United State Census Bureau. Other formats such as JSON and XML can also be imported when necessary. These data sources are selected based on their relevance to our research objectives, frequency of updates and the reliability of their data.

While some datasets are static in nature, others are updated at periodic intervals. To manage these updates efficiently, each new dataset is versioned in our database. This allows us to maintain a history of data changes over time, enabling researchers to track how the data has evolved and compare previous versions. The versioning process involves assigning unique version numbers or timestamps to each dataset update. By versioning data in this manner, we can ensure that any transformation or analysis is performed consistently across different points in time, while still maintaining access to older data for reference or rollback purposes. Regular checks are conducted to ensure that new data files are uploaded promptly and correctly, minimizing any data discrepancies that could arise from missed updates.

2.2 Raw Data Ingestion

For data sources that are provided as raw text files, such as CSV, JSON, or XML, the ingestion process involves downloading and parsing these files using Python libraries like Pandas or JSON. These files are either retrieved manually or automatically from predefined endpoints and repositories, depending on their update frequency. When retrieved manually, the files are stored on an encrypted cloud platform.

Once downloaded, the files are parsed and validated against predefined schemas to ensure that they meet our data quality standards. This validation step checks for issues such as missing values, incorrect data types, or invalid records. If any issues are found, the data is either cleaned automatically during the ingestion process or flagged for manual review.

This ingestion pipeline is designed to handle both batch uploads for large datasets. The parsed data is then written to staging tables in the database for further transformation. For research groups working with bulk datasets, it’s crucial to automate as much of the ingestion process as possible to reduce manual workload and ensure consistency.

2.3 API Integration

Some data sources are accessible via APIs, which offer a structured and automated way to retrieve up-to-date information. To integrate these APIs, we utilize a combination of Python-based scripts and RESTful API client libraries that handle data requests, retrieval, and error management.

Each API connection is authenticated using either API keys or OAuth tokens, depending on the provider's requirements. This ensures secure access to data while preventing unauthorized usage. Our scripts are designed with rate-limiting considerations, adhering to the API provider’s usage policies. In case of failed requests, automatic retry mechanisms are implemented to minimize data loss or interruptions. Additionally, error-handling logic is built into the scripts to manage issues such as timeouts, invalid responses, or malformed data, with notifications sent to administrators for manual review when necessary.

For other institutions replicating this process, it’s important to thoroughly review API documentation for connection methods, data formats, and rate limitations. Our modular scripts can be easily adapted to different APIs by adjusting parameters and authentication methods as required.

2.4 Data Staging Process

Once data is acquired from either APIs or raw files, it is first written to staging tables in a relational database. These staging tables serve as a temporary holding area where data is stored in its raw, unprocessed form. This allows us to inspect the data, validate it, and perform any necessary cleaning or transformation before it is moved into the final production tables. Data staging table columns are typically of type “nvarchar” which in SQL Server can be used to store variable length Unicode string data. At this point in the process, the nvarchar data type allows us to accept as much data as possible, even if it is malformed or otherwise unusable.

The staging tables are structured to mirror the source data closely, with minimal transformations applied at this stage. This approach ensures that we retain the original data integrity and can easily trace any issues back to the raw source. Key validation checks are performed in the staging environment, including schema validation, duplicate detection, and the identification of missing or anomalous values. These checks ensure that any inconsistencies are addressed before the data is transformed and used for analysis or visualization.

By using staging tables, the project can manage data pipelines efficiently, allowing for incremental loads and reprocessing of specific datasets without disrupting the entire workflow. For those replicating this project, staging tables provide a safe space for data handling and error correction before committing changes to the final database.

After acquiring and staging the raw data, the next critical step involves transforming and cleaning it to ensure consistency, accuracy, and usability. Data from external sources often contains inconsistencies, missing values, or irregular formats that must be addressed before the data can be used for analysis and visualization. The transformation process includes standardizing data formats, removing duplicates, and applying necessary calculations or aggregations. Data cleaning involves resolving errors, filling in gaps, and ensuring that the data aligns with predefined quality standards. By applying these techniques, we ensure that the final dataset is reliable, comprehensive, and ready for use in the subsequent analysis stages.

3.1 Data Cleansing

The data cleaning process is crucial to prepare raw datasets for analysis by ensuring they meet the necessary quality standards. A primary focus is on handling null values, which are common in datasets from external sources. Depending on the nature of the missing data, we either remove records with null values or apply imputation techniques to fill in gaps, using methods such as mean, median, or domain-specific estimations. In cases where null values represent meaningful absence, they are retained but marked explicitly to avoid misinterpretation.

Another key aspect of data cleaning is handling duplicate values. Duplicates can arise from repeated data entries or errors during data acquisition. We systematically identify and remove duplicates using predefined rules that compare key attributes across records to ensure that only unique and accurate entries are retained in the final dataset. This step is critical to avoid skewed analysis results and ensure data integrity.

Finally, we ensure data normalization by standardizing data formats and structures. This involves organizing data into a consistent format (e.g., converting dates to a common format, standardizing units of measurement) and ensuring that the database adheres to normalization rules, typically up to the third normal form (3NF). Normalization reduces data redundancy and improves query performance, making the dataset more efficient and easier to analyze. By applying these techniques, we create a clean, reliable dataset that serves as the foundation for accurate research and analysis.

3.2 ETL Tools and Frameworks

To perform data cleaning and transformation, we utilize a Windows-based platform with SQL Server as our primary database management system. SQL Server provides a robust and scalable environment for handling large datasets, offering powerful tools for data manipulation, storage, and query performance optimization. Our data cleaning and transformation workflows leverage a combination of T-SQL, Python, and PowerShell to automate and streamline the processing of raw data.

T-SQL (Transact-SQL) is used extensively within SQL Server to manage, transform, and query the data stored in our staging and production tables. We utilize T-SQL scripts to handle operations such as removing duplicates, filtering invalid data, updating null values, and enforcing data integrity constraints. T-SQL’s integration with SQL Server allows for seamless execution of these tasks within the database environment, ensuring high performance and efficient resource usage.
Python plays a vital role in more complex data cleaning and transformation tasks that require external libraries or advanced data processing capabilities. Python scripts are used to apply statistical methods for handling missing data, perform data validation, and transform raw data into structured formats ready for analysis. Python’s extensive ecosystem of data manipulation libraries (e.g., Pandas, NumPy) provides flexibility when working with a wide variety of data formats, making it indispensable for tasks requiring custom transformations.
PowerShell is used to automate repetitive tasks related to data import, scheduling, and file management. PowerShell scripts enable the automated retrieval of raw data files, trigger scheduled database operations, and move processed files to appropriate directories. By integrating PowerShell into our workflow, we ensure that data pipelines remain consistent and that processes such as importing new datasets or backing up data are executed on time and without manual intervention.

The combination of T-SQL, Python, and PowerShell enables us to build a comprehensive and efficient data cleaning and transformation pipeline. Each step in the process, from initial data ingestion to normalization and validation, is managed with these tools to ensure data quality and readiness for analysis. However, the specifics of each process can vary based on the data source, format, and transformation requirements.

For those seeking to replicate or adapt our methodology, we provide detailed technical process documents on our website. These documents outline the exact scripts, configurations, and procedures used for each data source and transformation step. Each document includes code samples, troubleshooting tips, and customization options, allowing you to tailor the processes to your own project’s needs. These resources are intended to assist other research institutions in implementing similar workflows, ensuring that data is cleaned, transformed, and stored efficiently and securely.

3.3 Data Quality Assurance

To ensure the reliability and accuracy of the data after it has been imported and transformed, we implement a comprehensive Data Quality Assurance (QA) process. This involves both automated and manual checks using T-SQL and visualization tools like Power BI to verify and validate the data. T-SQL scripts are employed to run a series of validation checks, such as ensuring data integrity by verifying relationships between tables, checking for missing or duplicate values, and validating data types and ranges. These scripts help identify potential issues early in the pipeline, allowing us to correct errors before the data is used for analysis.

In addition to T-SQL, we use Power BI to create visualizations that provide an intuitive view of the dataset’s completeness and correctness. By generating charts, tables, and dashboards, we can easily spot trends, anomalies, or inconsistencies that might not be apparent through raw data queries alone. These visual inspections complement the automated checks, offering a more holistic approach to verifying that the data is accurate, properly transformed, and ready for further analysis. This dual approach—combining structured SQL validation with visual data exploration—ensures that data quality issues are addressed promptly, and the final dataset meets our research standards.

3.4 Final Data Storage

Once the data has been thoroughly cleaned, validated, and transformed, it is moved from the staging environment to the final production tables. These tables are structured differently from the staging tables, with each column assigned the most appropriate data type to ensure optimal storage and query performance. Unlike the raw format in the staging tables, the production tables are designed to align with the specific requirements of analysis and reporting, with normalized structures that reduce redundancy and improve data integrity.

To further enhance performance, we implement database optimizations such as indexes, partitions, and other strategies where appropriate. Indexes are applied to frequently queried columns to speed up retrieval times, while partitions are used to segment large datasets based on logical divisions, such as date ranges or geographic regions. These optimizations ensure that the production environment is scalable and capable of handling complex queries efficiently, even as the dataset grows over time. By designing the production tables with both data integrity and performance in mind, we provide a solid foundation for downstream analysis and reporting.

Conclusion

In this technical assistance document, we have outlined the comprehensive framework used for our data acquisition, transformation, and visualization project. Starting with a secure infrastructure, we discussed the methods of data acquisition from various sources, emphasizing our focus on raw text files and the importance of regular updates and versioning. The subsequent data cleaning and transformation processes, utilizing T-SQL, Python, and PowerShell, ensure data quality and integrity, supported by robust validation techniques and visual inspections through Power BI. We detailed the structure of our final production tables, highlighting their optimization for performance, and elaborated on our deployment strategies using Azure DevOps and Git for version control. Finally, ongoing maintenance practices, including monitoring and regular updates, are essential to maintaining a reliable system. This document serves as a blueprint for other research institutions aiming to replicate our efforts in data management and analysis, with further resources available on our website for specific processes.

Introduction

1. Infrastructure Overview