Data Science Author: Djibril Chimère DIAW
Data Science is a comprehensive and structured academic work that presents the theoretical foundations, mathematical principles, computational techniques, and applied methodologies that define contemporary data science. Designed as both a pedagogical reference and a long-term scholarly resource, the book integrates statistics, probability theory, machine learning, and large-scale data systems into a coherent intellectual framework.
The text begins with the mathematical and statistical foundations essential to scientific inference, including probability distributions, statistical estimation, hypothesis testing, regression analysis, and multivariate methods. These elements provide the rigorous basis necessary for understanding predictive modeling and quantitative decision-making.
Building upon these foundations, the book develops a systematic exposition of machine learning, covering supervised and unsupervised learning, model evaluation, feature engineering, and algorithmic optimization. Core techniques such as regression models, classification methods, clustering algorithms, ensemble approaches, and neural networks are examined from both theoretical and applied perspectives. Special attention is given to deep learning architectures and reinforcement learning as extensions of statistical learning theory.
The work further explores large-scale data environments and modern computational infrastructures, including big data ecosystems, distributed processing concepts, and data engineering principles. The relationship between data pipelines, storage architectures, scalability, and analytical performance is treated as an integral component of contemporary data science practice.
Specialized domains receive dedicated analytical treatment, including natural language processing, time series analysis, social network analysis, and business intelligence. The book emphasizes the translation of mathematical models into actionable insights, bridging theory and real-world decision systems.
Beyond technical instruction, the volume addresses issues of data governance, methodological rigor, reproducibility, and the ethical implications of algorithmic systems. It situates data science within a broader epistemological and societal context, recognizing its transformative role in economics, technology, public policy, and scientific research.
Structured progressively yet self-contained in its chapters, Data Science may serve multiple audiences: students in advanced secondary or university education, independent researchers, professionals seeking methodological consolidation, and readers interested in the scientific foundations of artificial intelligence and quantitative analysis.
This edition reflects an integrated vision of data science as a unified discipline grounded in mathematics, computation, and critical reasoning. It is conceived not merely as a technical manual, but as a long-term intellectual contribution to the scientific understanding of data-driven systems.
Edition information
First published: [23/03/2023]
Current version: v1.0
Last updated: [23/03/2023]
Author: [ Djibril Chimère DIAW]
Original language: [ENGLISH]
Digital publication: Archive.org
Collection: [Data Science]
https://archive.org/details/data-science_202602
Contents
Copyright 2
Author’s Note 3
About The Author 4
Dedication 6
To all mothers, 7
Data science 13
1 Statistics and Probability 14
1.1 Descriptive Statistics 15
1.1.1 Measures of central tendency 16
1.1.2 Measures of variability 17
1.1.3 Measures of distribution 18
1.1.4 Quartiles and percentiles 19
1.1.5 Frequency distributions 20
1.1.6 Graphical representation 21
1.1.7 Correlation and regression analysis
22
Correlation analysis 23
Regression analysis 24
1.2 Inferential Statistics 25
1.2.1 Estimation 26
1.2.2 Hypothesis testing 27
1.2.2.1 One-sample hypothesis tests 28
1.2.2.2 Two-sample hypothesis tests 29
1.2.2.3 Paired-sample hypothesis tests 30
1.2.2.4 Goodness-of-fit tests 31
1.2.3 Regression analysis 32
1.2.3.1 Simple linear regression 33
1.2.3.2 Multiple regression 34
1.2.3.3 Logistic regression 35
1.2.3.4 Nonlinear regression 36
1.2.4 Analysis of variance (ANOVA) 37
1.2.4.1 One-way ANOVA 38
1.2.4.2 Factorial ANOVA 39
1.2.4.3 Repeated measures ANOVA 40
1.2.4.4 Mixed ANOVA 41
1.2.5 Nonparametric statistics 42
1.3 Bayesian Statistics 44
1.4 Probability Theory 45
2. Data Exploration and Visualization 46
2.1 Exploratory Data Analysis 47
2.1.1 Dimensionality Reduction 48
2.1.2 Clustering Analysis 49
2.1.3 Correlation Analysis 50
2.1.3.1 Pearson correlation 51
2.1.3.2 Spearman correlation 52
2.1.3.3 Kendall correlation 53
2.1.4 Descriptive Statistics 54
2.1.5 Data Visualization 55
2.2 Data Cleaning 56
2.2.1 Removing Duplicates 57
2.2.2 Handling Missing Values 58
2.2.3 Handling Outliers 59
2.2.4 Standardizing Data 60
2.2.5 Correcting Typos and
Inconsistencies 61
2.2.6 Handling Inconsistent Data 62
2.3 Data Wrangling 63
2.4 Data Visualization 64
2.5 Interactive Data Visualization 65
2.6 Geospatial Data Visualization 66
3 Machine Learning 67
3.1 Supervised Learning 68
3.1.1 Regression 69
3.1.1.1 Linear Regression 70
3.1.1.2 Polynomial Regression 71
3.1.1.3 Multiple Regression 72
3.1.1.4 Ridge Regression 73
3.1.1.5 Lasso Regression 74
3.1.1.6 Elastic Net Regression 75
3.1.2 Classification 76
3.1.2.1 Logistic Regression 77
3.1.2.2 k-Nearest Neighbors (k-NN) 78
3.1.2.3 Decision Trees 79
3.1.2.4 Random Forest 80
3.1.2.5 Support Vector Machines (SVM) 81
3.1.2.6 Naive Bayes 82
3.1.2.7 Neural Networks 83
3.2 Unsupervised Learning 84
3.2.1 Clustering 85
3.2.1.1 K-Means Clustering 86
3.2.1.2 Hierarchical Clustering 87
3.2.1.3 Density-Based Clustering 88
3.2.1.4 Expectation-Maximization (EM)
Clustering 89
3.2.1.5 Spectral Clustering 90
3.2.1.6 DBSCAN (Density-Based Spatial Clustering of Applications with
Noise) 91
3.2.2 Dimensionality Reduction 92
3.2.2.1 Principal Component Analysis
(PCA) 93
3.2.2.2 t-Distributed Stochastic Neighbor Embedding
(t-SNE) 94
3.2.2.3 Non-negative Matrix Factorization
(NMF) 95
3.2.2.4 Independent Component Analysis
(ICA) 96
3.2.2.5 Factor Analysis 97
3.2.2.6 Autoencoder 99
3.2.3 Anomaly Detection 100
3.2.3.1 Clustering-based methods 101
3.2.3.2 Local Outlier Factor (LOF) 102
3.2.3.3 Isolation Forest 103
3.2.3.4 One-Class Support Vector Machine
(OCSVM) 104
3.2.4 Association Rules Mining 105
3.2.4.1 Apriori algorithm 106
3.2.4.2 FP-Growth algorithm 107
3.2.4.3 ECLAT algorithm 108
3.2.5 Self-Organizing Maps (SOM) 109
3.3 Reinforcement Learning 110
3.3.1 Policy Gradient methods 111
3.3.1.1 Vanilla policy gradient (VPG) 112
3.3.1.2 Trust Region Policy Optimization
(TRPO) 113
3.3.1.3 Proximal Policy Optimization
(PPO) 114
3.3.1.4 Deep Deterministic Policy Gradient
(DDPG) 115
3.3.1.5 Natural Policy Gradient (NPG) 116
3.3.2 Actor-Critic methods 117
3.3.2.1 Advantage Actor-Critic (A2C) 118
3.3.2.2 Asynchronous Advantage Actor-Critic
(A3C) 119
3.3.3 Monte Carlo methods 120
3.3.3.1 First-Visit Monte Carlo 121
3.3.3.2 Every-Visit Monte Carlo 122
3.3.3.3 Monte Carlo Exploring Starts 123
3.3.3.4 Monte Carlo Tree Search 124
3.3.3.5 UCT (Upper Confidence Bound applied to
Trees 125
3.3.3.6 Monte Carlo Policy Evaluation 126
3.3.4 Temporal Difference methods 127
3.3.4.1 Deep Q-Networks (DQN) 128
3.3.4.2 Q-learning 129
3.3.4.3 SARSA
(State-Action-Reward-State-Action) 130
3.3.4.4 Expected SARSA 131
3.3.4.5 TD(0) 132
3.3.4.6 TD(lambda) 133
3.3.5 Multi-Armed Bandit algorithms 134
3.3.5.1 Epsilon-Greedy Algorithm 135
3.3.5.2 Upper Confidence Bound (UCB) 136
3.3.5.3 Thompson Sampling 137
3.3.5.4 Gradient Bandit Algorithms 138
3.3.5.5 Pursuit Algorithms 139
3.3.5.5.1Single-agent pursuit 140
3.3.5.5.2 Coordinated pursuit 141
3.3.5.5.3 Competitive pursuit 142
3.3.5.5.4 Multi-objective pursuit 143
3.3.5.6 Exp3 (Exponential Weighted Exploration and
Exploitation) 144
3.3.6 Model-based methods 145
3.3.6.1 Dynamic Programming 146
3.3.6.2 Model Predictive Control 147
3.3.6.3 Cross-Entropy Method 148
3.3.7 Hierarchical Reinforcement Learning
149
3.3.7.1 Hierarchical Q-learning 150
3.3.7.2 MAXQ 151
3.4 Semi-Supervised Learning 152
3.4.1 Self-training 153
3.4.2 Co-training 154
3.4.3 Tri-training 155
3.4.4 Graph-based methods 156
3.4.4.1 Label Propagation 157
3.4.4.2 Graph Laplacian Regularization 158
3.4.4.3 Deep Graph Infomax 159
3.4.4.4 Graph Convolutional Networks
(GCNs) 160
3.4.4.5 Random Walks 161
3.4.4.5.1 Discrete-time random walks 162
3.4.4.5.2 Continuous-time random walks 163
3.4.4.5.3 Markov Chain Monte Carlo (MCMC)
164
3.4.4.6 Graph neural networks (GNNs) 165
3.4.4.7 Spectral methods 166
3.4.4.8 Kernel methods 167
3.4.4.9 Graph-based clustering 168
3.5 Deep Learning 169
3.5.1 Convolutional Neural Networks
(CNNs) 170
3.5.2 Recurrent Neural Networks (RNNs) 171
3.5.3 Generative Adversarial Networks
(GANs) 172
3.5.4 Autoencoders 173
3.5.5 Long Short-Term Memory (LSTM)
Networks 174
3.5.6 Deep Belief Networks (DBNs) 175
3.5.7 Deep Reinforcement Learning (DRL)
176
3.5.8 Transfer Learning 177
3.5.9 Batch Normalization 178
3.5.10 Dropout Regularization 179
3.5.11 Gradient Descent Optimization 180
3.5.12 Adam Optimization 181
3.5.13 Convolutional Autoencoders 182
3.5.14 Restricted Boltzmann Machines
(RBMs) 183
3.5.15 Siamese Networks 184
3.5.16 Capsule Networks. 185
4. Big Data Technologies 186
4.1 Hadoop 187
4.1.1 Hadoop Distributed File System
(HDFS) 188
4.1.2 MapReduce 189
4.1.3 HBase 190
4.1.4 YARN (Yet Another Resource
Negotiator) 191
4.2 Spark 192
Spark and Hadoop 193
4.2.1 Spark Core 194
4.2.2 Spark SQL 195
4.2.3 Spark Streaming 196
4.2.4 MLlib (machine learning library) 197
4.3 NoSQL Databases 198
4.3.1 Apache Cassandra 199
4.3.2 MongoDB 200
4.3.3 Redis 201
4.3.4 Neo4j 202
4.3.5 Apache Flink 203
4.3.6 Apache Pig 204
4.4 Apache Kafka 205
4.5 Cloud Computing 206
Cloud Computing and Big Data Technologies
207
4.6 Distributed Computing 209
Distributed Computing and Big Data
Technologies 210
5. Natural Language Processing 211
5.1 Sentiment Analysis 212
5.2 Text Classification 213
5.3 Named Entity Recognition 214
5.4 Machine Translation 215
5.5 Speech Recognition 216
5.6 Topic Modeling 217
5.7 Text summarization 218
5.8 Language modeling 219
5.9 Tokenization 220
5.10 Part-of-speech tagging 221
6. Time Series Analysis 222
6.1 Trend analysis 223
6.2 Seasonality analysis 224
6.3 Autocorrelation analysis 225
6.4 Time series decomposition 226
6.5 Forecasting 227
6.5.1 Autoregressive Integrated Moving Average (ARIMA)
models 228
6.5.2 Exponential smoothing methods 229
6.5.3 Seasonal decomposition methods 230
6.5.4 Machine learning algorithms 231
6.5.5 Deep learning models 232
7. Data Engineering 233
7.1 Data Warehousing 234
7.2 Data Integration 235
7.3 Data Pipeline Development 236
7.4 ETL (Extract, Transform, Load) 237
7.5 Data Modeling 238
7.6 Database Management 239
8. Business Intelligence 240
8.1 Dashboards and Scorecards 241
8.2 Reporting and Analytics 242
8.3 Data Mining 243
8.4 OLAP (Online Analytical Processing)
244
8.5 KPI (Key Performance Indicators) 245
8.6 Data Governance 246
9. Optimization and Decision
Science 247
9.1 Linear Programming 248
9.2 Nonlinear Programming 249
9.3 Integer Programming 250
9.4 Multi-objective Optimization 251
9.5 Decision Analysis 252
9.6 Game Theory 253
10. Social Network Analysis 254
10.1 Community Detection 255
10.2 Centrality Measures 256
10.3 Link Prediction 257
10.4 Network Visualization 258
10.5 Graph Algorithms 259
Bibliography 260
Books 260
Passion Data Science
mardi 10 mars 2026
Inscription à :
Commentaires (Atom)
Data Science
Data Science Author : Djibril Chimère DIAW Data Science is a comprehensive and structured academic work that presents the theoretical f...
-
Data Science Author : Djibril Chimère DIAW Data Science is a comprehensive and structured academic work that presents the theoretical f...