Passion Data Science: Data Science

Data Science Author: Djibril Chimère DIAW

Data Science is a comprehensive and structured academic work that presents the theoretical foundations, mathematical principles, computational techniques, and applied methodologies that define contemporary data science. Designed as both a pedagogical reference and a long-term scholarly resource, the book integrates statistics, probability theory, machine learning, and large-scale data systems into a coherent intellectual framework.

The text begins with the mathematical and statistical foundations essential to scientific inference, including probability distributions, statistical estimation, hypothesis testing, regression analysis, and multivariate methods. These elements provide the rigorous basis necessary for understanding predictive modeling and quantitative decision-making.

Building upon these foundations, the book develops a systematic exposition of machine learning, covering supervised and unsupervised learning, model evaluation, feature engineering, and algorithmic optimization. Core techniques such as regression models, classification methods, clustering algorithms, ensemble approaches, and neural networks are examined from both theoretical and applied perspectives. Special attention is given to deep learning architectures and reinforcement learning as extensions of statistical learning theory.

The work further explores large-scale data environments and modern computational infrastructures, including big data ecosystems, distributed processing concepts, and data engineering principles. The relationship between data pipelines, storage architectures, scalability, and analytical performance is treated as an integral component of contemporary data science practice.

Specialized domains receive dedicated analytical treatment, including natural language processing, time series analysis, social network analysis, and business intelligence. The book emphasizes the translation of mathematical models into actionable insights, bridging theory and real-world decision systems.

Beyond technical instruction, the volume addresses issues of data governance, methodological rigor, reproducibility, and the ethical implications of algorithmic systems. It situates data science within a broader epistemological and societal context, recognizing its transformative role in economics, technology, public policy, and scientific research.

Structured progressively yet self-contained in its chapters, Data Science may serve multiple audiences: students in advanced secondary or university education, independent researchers, professionals seeking methodological consolidation, and readers interested in the scientific foundations of artificial intelligence and quantitative analysis.

This edition reflects an integrated vision of data science as a unified discipline grounded in mathematics, computation, and critical reasoning. It is conceived not merely as a technical manual, but as a long-term intellectual contribution to the scientific understanding of data-driven systems.

Edition information

First published: [23/03/2023]
Current version: v1.0
Last updated: [23/03/2023]

Author: [ Djibril Chimère DIAW]
Original language: [ENGLISH]
Digital publication: Archive.org
Collection: [Data Science]

https://archive.org/details/data-science_202602

Contents
Copyright    2
Author’s Note    3
About The Author    4
Dedication    6
To all mothers,    7
Data science    13
1 Statistics and Probability    14
1.1 Descriptive Statistics    15
1.1.1 Measures of central tendency    16
1.1.2 Measures of variability    17
1.1.3 Measures of distribution    18
1.1.4 Quartiles and percentiles    19
1.1.5 Frequency distributions    20
1.1.6 Graphical representation    21
1.1.7 Correlation and regression analysis    22
Correlation analysis    23
Regression analysis    24
1.2 Inferential Statistics    25
1.2.1 Estimation    26
1.2.2 Hypothesis testing    27
1.2.2.1 One-sample hypothesis tests    28
1.2.2.2 Two-sample hypothesis tests    29
1.2.2.3 Paired-sample hypothesis tests    30
1.2.2.4 Goodness-of-fit tests    31
1.2.3 Regression analysis    32
1.2.3.1 Simple linear regression    33
1.2.3.2 Multiple regression    34
1.2.3.3 Logistic regression    35
1.2.3.4 Nonlinear regression    36
1.2.4 Analysis of variance (ANOVA)    37
1.2.4.1 One-way ANOVA    38
1.2.4.2 Factorial ANOVA    39
1.2.4.3 Repeated measures ANOVA    40
1.2.4.4 Mixed ANOVA    41
1.2.5 Nonparametric statistics    42
1.3 Bayesian Statistics    44
1.4 Probability Theory    45
2. Data Exploration and Visualization    46
2.1 Exploratory Data Analysis    47
2.1.1 Dimensionality Reduction    48
2.1.2 Clustering Analysis    49
2.1.3 Correlation Analysis    50
2.1.3.1 Pearson correlation    51
2.1.3.2 Spearman correlation    52
2.1.3.3 Kendall correlation    53
2.1.4 Descriptive Statistics    54
2.1.5 Data Visualization    55
2.2 Data Cleaning    56
2.2.1 Removing Duplicates    57
2.2.2 Handling Missing Values    58
2.2.3 Handling Outliers    59
2.2.4 Standardizing Data    60
2.2.5 Correcting Typos and Inconsistencies    61
2.2.6 Handling Inconsistent Data    62
2.3 Data Wrangling    63
2.4 Data Visualization    64
2.5 Interactive Data Visualization    65
2.6 Geospatial Data Visualization    66
3 Machine Learning    67
3.1 Supervised Learning    68
3.1.1 Regression    69
3.1.1.1 Linear Regression    70
3.1.1.2 Polynomial Regression    71
3.1.1.3 Multiple Regression    72
3.1.1.4 Ridge Regression    73
3.1.1.5 Lasso Regression    74
3.1.1.6 Elastic Net Regression    75
3.1.2 Classification    76
3.1.2.1 Logistic Regression    77
3.1.2.2 k-Nearest Neighbors (k-NN)    78
3.1.2.3 Decision Trees    79
3.1.2.4 Random Forest    80
3.1.2.5 Support Vector Machines (SVM)    81
3.1.2.6 Naive Bayes    82
3.1.2.7 Neural Networks    83
3.2 Unsupervised Learning    84
3.2.1 Clustering    85
3.2.1.1 K-Means Clustering    86
3.2.1.2 Hierarchical Clustering    87
3.2.1.3 Density-Based Clustering    88
3.2.1.4 Expectation-Maximization (EM) Clustering    89
3.2.1.5 Spectral Clustering    90
3.2.1.6 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)    91
3.2.2 Dimensionality Reduction    92
3.2.2.1 Principal Component Analysis (PCA)    93
3.2.2.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)    94
3.2.2.3 Non-negative Matrix Factorization (NMF)    95
3.2.2.4 Independent Component Analysis (ICA)    96
3.2.2.5 Factor Analysis    97
3.2.2.6 Autoencoder    99
3.2.3 Anomaly Detection    100
3.2.3.1 Clustering-based methods    101
3.2.3.2 Local Outlier Factor (LOF)    102
3.2.3.3 Isolation Forest    103
3.2.3.4 One-Class Support Vector Machine (OCSVM)    104
3.2.4 Association Rules Mining    105
3.2.4.1 Apriori algorithm    106
3.2.4.2 FP-Growth algorithm    107
3.2.4.3 ECLAT algorithm    108
3.2.5 Self-Organizing Maps (SOM)    109
3.3 Reinforcement Learning    110
3.3.1 Policy Gradient methods    111
3.3.1.1 Vanilla policy gradient (VPG)    112
3.3.1.2 Trust Region Policy Optimization (TRPO)    113
3.3.1.3 Proximal Policy Optimization (PPO)    114
3.3.1.4 Deep Deterministic Policy Gradient (DDPG)    115
3.3.1.5 Natural Policy Gradient (NPG)    116
3.3.2 Actor-Critic methods    117
3.3.2.1 Advantage Actor-Critic (A2C)    118
3.3.2.2 Asynchronous Advantage Actor-Critic (A3C)    119
3.3.3 Monte Carlo methods    120
3.3.3.1 First-Visit Monte Carlo    121
3.3.3.2 Every-Visit Monte Carlo    122
3.3.3.3 Monte Carlo Exploring Starts    123
3.3.3.4 Monte Carlo Tree Search    124
3.3.3.5 UCT (Upper Confidence Bound applied to Trees    125
3.3.3.6 Monte Carlo Policy Evaluation    126
3.3.4 Temporal Difference methods    127
3.3.4.1 Deep Q-Networks (DQN)    128
3.3.4.2 Q-learning    129
3.3.4.3 SARSA (State-Action-Reward-State-Action)    130
3.3.4.4 Expected SARSA    131
3.3.4.5 TD(0)    132
3.3.4.6 TD(lambda)    133
3.3.5 Multi-Armed Bandit algorithms    134
3.3.5.1 Epsilon-Greedy Algorithm    135
3.3.5.2 Upper Confidence Bound (UCB)    136
3.3.5.3 Thompson Sampling    137
3.3.5.4 Gradient Bandit Algorithms    138
3.3.5.5 Pursuit Algorithms    139
3.3.5.5.1Single-agent pursuit    140
3.3.5.5.2 Coordinated pursuit    141
3.3.5.5.3 Competitive pursuit    142
3.3.5.5.4 Multi-objective pursuit    143
3.3.5.6 Exp3 (Exponential Weighted Exploration and Exploitation)    144
3.3.6 Model-based methods    145
3.3.6.1 Dynamic Programming    146
3.3.6.2 Model Predictive Control    147
3.3.6.3 Cross-Entropy Method    148
3.3.7 Hierarchical Reinforcement Learning    149
3.3.7.1 Hierarchical Q-learning    150
3.3.7.2 MAXQ    151
3.4 Semi-Supervised Learning    152
3.4.1 Self-training    153
3.4.2 Co-training    154
3.4.3 Tri-training    155
3.4.4 Graph-based methods    156
3.4.4.1 Label Propagation    157
3.4.4.2 Graph Laplacian Regularization    158
3.4.4.3 Deep Graph Infomax    159
3.4.4.4 Graph Convolutional Networks (GCNs)    160
3.4.4.5 Random Walks    161
3.4.4.5.1 Discrete-time random walks    162
3.4.4.5.2 Continuous-time random walks    163
3.4.4.5.3 Markov Chain Monte Carlo (MCMC)    164
3.4.4.6 Graph neural networks (GNNs)    165
3.4.4.7 Spectral methods    166
3.4.4.8 Kernel methods    167
3.4.4.9 Graph-based clustering    168
3.5 Deep Learning    169
3.5.1 Convolutional Neural Networks (CNNs)    170
3.5.2 Recurrent Neural Networks (RNNs)    171
3.5.3 Generative Adversarial Networks (GANs)    172
3.5.4 Autoencoders    173
3.5.5 Long Short-Term Memory (LSTM) Networks    174
3.5.6 Deep Belief Networks (DBNs)    175
3.5.7 Deep Reinforcement Learning (DRL)    176
3.5.8 Transfer Learning    177
3.5.9 Batch Normalization    178
3.5.10 Dropout Regularization    179
3.5.11 Gradient Descent Optimization    180
3.5.12 Adam Optimization    181
3.5.13 Convolutional Autoencoders    182
3.5.14 Restricted Boltzmann Machines (RBMs)    183
3.5.15 Siamese Networks    184
3.5.16 Capsule Networks.    185
4. Big Data Technologies    186
4.1 Hadoop    187
4.1.1 Hadoop Distributed File System (HDFS)    188
4.1.2 MapReduce    189
4.1.3 HBase    190
4.1.4 YARN (Yet Another Resource Negotiator)    191
4.2 Spark    192
Spark and Hadoop    193
4.2.1 Spark Core    194
4.2.2 Spark SQL    195
4.2.3 Spark Streaming    196
4.2.4 MLlib (machine learning library)    197
4.3 NoSQL Databases    198
4.3.1 Apache Cassandra    199
4.3.2 MongoDB    200
4.3.3 Redis    201
4.3.4 Neo4j    202
4.3.5 Apache Flink    203
4.3.6 Apache Pig    204
4.4 Apache Kafka    205
4.5 Cloud Computing    206
Cloud Computing and Big Data Technologies    207
4.6 Distributed Computing    209
Distributed Computing and Big Data Technologies    210
5. Natural Language Processing    211
5.1 Sentiment Analysis    212
5.2 Text Classification    213
5.3 Named Entity Recognition    214
5.4 Machine Translation    215
5.5 Speech Recognition    216
5.6 Topic Modeling    217
5.7 Text summarization    218
5.8 Language modeling    219
5.9 Tokenization    220
5.10 Part-of-speech tagging    221
6. Time Series Analysis    222
6.1 Trend analysis    223
6.2 Seasonality analysis    224
6.3 Autocorrelation analysis    225
6.4 Time series decomposition    226
6.5 Forecasting    227
6.5.1 Autoregressive Integrated Moving Average (ARIMA) models    228
6.5.2 Exponential smoothing methods    229
6.5.3 Seasonal decomposition methods    230
6.5.4 Machine learning algorithms    231
6.5.5 Deep learning models    232
7. Data Engineering    233
7.1 Data Warehousing    234
7.2 Data Integration    235
7.3 Data Pipeline Development    236
7.4 ETL (Extract, Transform, Load)    237
7.5 Data Modeling    238
7.6 Database Management    239
8. Business Intelligence    240
8.1 Dashboards and Scorecards    241
8.2 Reporting and Analytics    242
8.3 Data Mining    243
8.4 OLAP (Online Analytical Processing)    244
8.5 KPI (Key Performance Indicators)    245
8.6 Data Governance    246
9. Optimization and Decision Science    247
9.1 Linear Programming    248
9.2 Nonlinear Programming    249
9.3 Integer Programming    250
9.4 Multi-objective Optimization    251
9.5 Decision Analysis    252
9.6 Game Theory    253
10. Social Network Analysis    254
10.1 Community Detection    255
10.2 Centrality Measures    256
10.3 Link Prediction    257
10.4 Network Visualization    258
10.5 Graph Algorithms    259
Bibliography    260
Books    260

Passion Data Science

mardi 10 mars 2026

Data Science

Data Science

Libellés