What is Regression?

Regression is one of the fundamental building blocks of data analysis and is a powerful statistical analysis method that mathematically models the relationship between variables. It is an important tool for anyone with a need to understand, estimate and model the relationship between variables. Regression, which has been used for years in the field of statistics, has now gone beyond artificial intelligence and machine learning and has become a fundamental element. It draws on a wide range of regression analysis, from sales forecasts, climate change models, financial statements, to campaign optimization by marketing professionals.

Definition and History of Regression

Regression analysis is a statistical method that describes the relationship between a dependent variable and one or more independent variables. Its main objective is to estimate the value of the dependent variable from the values of the independent variables. It allows us to understand the relationship between variables in the data set by expressing this relationship as a mathematical model.

The term regression was first used in the 19th century by Sir Francis Galton. Galton observed that the children of parents who were tall also tended to be tall, but shorter than their parents' average height. He called this trend “return to the mean” and used the term “regression” in this context.

The regression analysis we use today was developed with contributions from mathematicians such as Karl Pearson and Udny Yule and became an important part of statistical science in the early 20th century. With the development of modern computer technologies, the ability to quickly calculate complex regression models has been achieved, which has contributed to the popularization of regression analysis.

Types of Regression Analysis

There are several types of regression analysis according to different data types and relationship structures. These species are selected according to the structure of the data to be analyzed, the nature of the relationship between the dependent and independent variables. The most commonly used types of regression are:

Simple Linear Regression

Simple linear regression is the most basic type of regression that models the linear relationship between a single independent variable (X) and a dependent variable (Y). Mathematically it is expressed as:

Y = βω + β⋅x + ε

Here:

Y: Dependent variable
X: Independant
βω: the point at which it intersects the Y-axis (fixed term)
β₂: Slope (effect of a unit change in X on Y)
ε: Error term

Simple linear regression can be used, for example, to model the relationship between the price of a product and the quantity of sales, or to analyze the relationship between hours worked and the quantity of production.

Multiple Linear Regression

Multiple linear regression is a type of regression that models the effect of multiple independent variables on the dependent variable. Its formula is as follows:

Y = βω + β₂x₂ + β₂x₂ +... + βx+ ε

Here:

Y: Dependent variable
X¹, X₂,..., X: Independents
β₂, β₂, β₂,..., β: Regression coefficients
ε: Error term

Multiple linear regression is used to model more complex relationships. For example, to estimate the price of a house, one can take into account multiple factors such as location, size, number of rooms, age.

Polynomial Regression

Polynomial regression is a type of regression used in cases where the relationship between the independent and dependent variable is not linear. It includes different forces of the independent variable:

Y = βω + β₂x + β₂x² + β₃ x³ +... + βx+ ε

Polynomial regression is effective in modeling nonlinear complex relationships. For example, it can be used to model sales trends in the life cycle of a product or the effect of temperature on plant growth.

Logistic Regression

Logistic regression, despite its name, is actually a classification method and is used in cases where the dependent variable is categorical. Although its most common use is to predict one of two categories of outcome (successful/fail, yes/no, 1/0), it can also be extended to multi-category classification.

Logistic regression uses the logistic function (sigmoid function) that converts probabilities to a value between 0 and 1:

P (Y=1) = 1/(1 + e^ (- (βω + β₂x₂ + β₂x₂ +... + βx)))

It is widely used in areas such as logistic regression, credit risk assessment, medical diagnosis, predicting customer behavior.

Principle of operation of regression analysis

Regression analysis works using an optimization technique called the “least squares method”. This method aims to find parameters that minimize the sum of the squares of the differences (residuals) between the real values and the estimated values.

The basic steps of regression analysis are:

Data Collection: The first step is to collect the appropriate dataset for analysis. The dataset must contain information about the dependent variable and the independent variables.
Model Selection: Choose the regression model appropriate to the data structure and the nature of the relationship between the variables (linear, polynomial, logistic, etc.).
Parameter Estimation: Estimate model parameters using techniques such as the least squares method.
Model Evaluation: Use various metrics and tests to evaluate the performance of the model.
Forecasting: Make estimates for new data using the generated model.

According to IDC's 2023 report, regression analysis is used as a basic modeling technique in 68% of data science projects. This shows that regression analysis is still one of the most reliable and widely used analytical methods today (IDC Worldwide Data & Analytics Survey, 2023).

Evaluation of Regression Model

Various metrics are used to evaluate the performance of a regression model. These metrics help measure how well the model fits into the data set and its predictive power.

R-squared Value

The R-square (R²) is a metric that shows how much of the variation in the dependent variable is explained by the independent variables. It takes values between 0 and 1:

R² = 0: The model cannot explain the variation in the dependent variable at all.
R² = 1: The model describes all of the variation in the dependent variable.

Formula:R² = 1 - (SSRes/SSTot)

Here:

SSRes: Sum of squares of the remains
SSTot: Total Sum of Squares

The R² value is commonly used to assess the overall fit of the model, but it is not sufficient on its own and must be evaluated in conjunction with other metrics.

Corrected R-square

The corrected R-square is a variation of the standard R-square metric and takes into account the number of independent variables in the model. This is especially important in multiple regression models because the standard R-squared value often increases as you add variables to the model.

Formula:Corrected R² = 1 - [(1 - R²) * (n - 1)/(n - p - 1)]

Here:

n: Number of observations
p: Number of independent variables

The corrected R-square penalizes the addition of unnecessary variables to the model and is therefore a more suitable metric for model comparison.

Mean Square Error (MSE)

Mean Squared Error (MSE) is the average of squares of the differences between predicted values and actual values. The lower the MSE, the better the model performs.

Formula:MSE = (1/n) * Σ (y-) ²

Here:

y: Actual value
: Estimated value
n: Number of observations

MSE punishes large errors even more, because mistakes are squared.

Root Mean Square Error (RMSE)

Root Mean Squared Error (RMSE) is the square root of the MSE and refers to estimation errors on the scale of the original dependent variable. This facilitates the interpretation of the results.

Formula:RMSE = √MSE = √ [(1/n) * Σ (y-) ²]

RMSE is a widely used metric for assessing model performance and is often preferred in model comparisons.

Points to Consider in Regression Analysis

Although regression analysis is a powerful tool, there are some important points to consider for its correct application and interpretation:

1. Linearity Assumption: Linear regression models assume a linear relationship between dependent and independent variables. When this assumption is not met, nonlinear regression models should be considered.

2. IndependenceObservations must be independent of each other. For example, in time series data, independence is often violated and requires special methods.

3. Homogeneity: The variance of error terms must be constant (homoskedastic). When this assumption is violated, alternative methods such as weighted least squares can be used.

4. Normality: Error terms must have normal distribution. This assumption is usually met in large samples thanks to the central limit theorem.

5. Multicollinearity: Occurs when there is high correlation between the independent variables, making the prediction of model parameters difficult. Metrics such as the variance inflation factor (VIF) can be used to detect this.

6. Outliers: Outliers can significantly affect the regression model. Therefore, it must be detected before the analysis and handled appropriately.

7. Variable Selection: It is a critical decision which variables to include in the model. Metrics such as the Akaike Knowledge Criterion (AIC) or Bayesian Knowledge Criterion (BIC) can help in optimal model selection.

8. Overfitting: A very complex model can adapt perfectly to training data but perform poorly on new data. Techniques such as cross-validation help reduce this problem.

Regression analysis has a wide range of uses, from scientific research to commercial applications. It can be implemented in both simple and complex data sets. KomtasIntegrate regression methods into your business processes with data analysis and modeling solutions. You can always contact us for more information!

‍

back to the Glossary

What is Regression?

Definition and History of Regression

Types of Regression Analysis

Simple Linear Regression

Multiple Linear Regression

Polynomial Regression

Logistic Regression

Principle of operation of regression analysis

Evaluation of Regression Model

R-squared Value

Corrected R-square

Mean Square Error (MSE)

Root Mean Square Error (RMSE)

Points to Consider in Regression Analysis

Discover Glossary of Data Science and Data Analytics

Join Our Successful Partners!

We can't wait to get to know you

Akbank Data Governance Program