Skip to main content

Data Engineering

Data Collection and Acquisition

  • What you Need to Know

Data Preprocessing and Cleaning

  • What you Need to Know
    • Data Cleaning Techniques

    • Data Transformation and Normalization

    • Categorical Data Encoding

Feature Engineering and Selection

Data Pipeline Architecture

  • What you Need to Know
    • ETL Pipeline Design

      • Extract, Transform, Load (ETL) vs ELT architectures
      • Data pipeline orchestration and scheduling
      • Error handling and data quality monitoring
      • Resources:
    • Data Storage and Retrieval

    • Batch vs Stream Processing

Big Data Technologies

  • What you Need to Know
    • Distributed Computing Frameworks

      • Apache Spark for large-scale data processing
      • Hadoop ecosystem and HDFS storage
      • Dask for parallel computing in Python
      • Resources:
    • NoSQL Databases for ML

    • Cloud Data Services

      • AWS S3, RDS, and Redshift for data storage
      • Google BigQuery for analytics
      • Azure Data Factory for ETL processes
      • Resources:

Data Validation and Quality Assurance

  • What you Need to Know
    • Data Schema Validation

      • Schema definition and enforcement
      • Data type validation and constraints
      • Automated data quality checks
      • Resources:
    • Statistical Data Profiling

Time Series Data Processing

  • What you Need to Know
    • Time Series Preprocessing

      • Handling irregular time intervals and missing timestamps
      • Resampling and interpolation techniques
      • Seasonal decomposition and trend analysis
      • Resources:
    • Feature Engineering for Time Series

      • Lag features and rolling window statistics
      • Fourier transforms for frequency domain analysis
      • Time-based aggregations and seasonality features
      • Resources:

Text Data Processing and NLP

  • What you Need to Know
    • Text Preprocessing Pipeline

      • Tokenization, stemming, and lemmatization
      • Stop word removal and text normalization
      • Handling different languages and encodings
      • Resources:
    • Text Feature Extraction

      • Bag of Words and TF-IDF vectorization
      • N-grams and character-level features
      • Word embeddings (Word2Vec, GloVe, FastText)
      • Resources:

Image Data Processing

  • What you Need to Know
    • Image Preprocessing Pipeline

    • Feature Extraction from Images

Data Version Control and Experiment Tracking

  • What you Need to Know
    • Data Versioning Systems

      • DVC for data and model versioning
      • Git-based workflows for data science
      • Reproducible data pipelines and lineage
      • Resources:
    • Experiment Management

      • MLflow for experiment tracking and model registry
      • Weights & Biases for advanced experiment management
      • Neptune for collaborative ML development
      • Resources:

Ready to Develop Models? Continue to Module 3: Model Development to master advanced model building, neural network architectures, and algorithm optimization.