Linearna regresija¶

ZADATAK: Kreirati model linearne regresije koji predviđa vidljivost u km na osnovu dostupnih podataka o drugim parametrima vremenske prognoze za grad Segedin. Podaci u bazi prikupljani su u periodu 2006-2016.

Napomena: Za sva postavljena pitanja, napisati odgovore u novim tekstualnim ćelijama ispod poziva komandi.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn import datasets
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Upoznavanje baze podataka¶

In [2]:
# preuzimanje baze
!gdown 1NpFP8jxoW-CwL4v73wigrg7hI7DfK3nN
Downloading...
From: https://drive.google.com/uc?id=1NpFP8jxoW-CwL4v73wigrg7hI7DfK3nN
To: /content/weatherHistory.csv
100% 16.3M/16.3M [00:00<00:00, 70.7MB/s]

Učitati bazu podataka u Pandas DataFrame i prikazati prvih 5 uzoraka.

In [3]:
# weatherHistory.csv
df = pd.read_csv('weatherHistory.csv')
df.head()
Out[3]:
Formatted Date Summary Precip Type Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars) Daily Summary
0 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222 7.388889 0.89 14.1197 251.0 15.8263 0.0 1015.13 Partly cloudy throughout the day.
1 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556 7.227778 0.86 14.2646 259.0 15.8263 0.0 1015.63 Partly cloudy throughout the day.
2 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778 9.377778 0.89 3.9284 204.0 14.9569 0.0 1015.94 Partly cloudy throughout the day.
3 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889 5.944444 0.83 14.1036 269.0 15.8263 0.0 1016.41 Partly cloudy throughout the day.
4 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556 6.977778 0.83 11.0446 259.0 15.8263 0.0 1016.51 Partly cloudy throughout the day.
  • Koliko ima uzoraka?
  • Koliko ima obeležja?
  • Kog su tipa obeležja?
In [4]:
print(df.shape)  ##df.shape[0] je broj uzoraka a df.shape[1] je broj obelezja
print(df.dtypes)  ## su tipovi obelezja, NOTE: moras sam reci da je NPR: fromatted date datum a ne tip 'object'
(96453, 12)
Formatted Date               object
Summary                      object
Precip Type                  object
Temperature (C)             float64
Apparent Temperature (C)    float64
Humidity                    float64
Wind Speed (km/h)           float64
Wind Bearing (degrees)      float64
Visibility (km)             float64
Loud Cover                  float64
Pressure (millibars)        float64
Daily Summary                object
dtype: object

Analizirati kategorička obeležja.

  • Koje su im jedinstvene vrednosti?
In [5]:
## kategoricka obelezja su  'Summary', 'Precip Type','Daily Summary'
df['Summary'].unique()
Out[5]:
array(['Partly Cloudy', 'Mostly Cloudy', 'Overcast', 'Foggy',
       'Breezy and Mostly Cloudy', 'Clear', 'Breezy and Partly Cloudy',
       'Breezy and Overcast', 'Humid and Mostly Cloudy',
       'Humid and Partly Cloudy', 'Windy and Foggy', 'Windy and Overcast',
       'Breezy and Foggy', 'Windy and Partly Cloudy', 'Breezy',
       'Dry and Partly Cloudy', 'Windy and Mostly Cloudy',
       'Dangerously Windy and Partly Cloudy', 'Dry', 'Windy',
       'Humid and Overcast', 'Light Rain', 'Drizzle', 'Windy and Dry',
       'Dry and Mostly Cloudy', 'Breezy and Dry', 'Rain'], dtype=object)
In [6]:
df['Daily Summary'].unique()  ## 214 unique values sa jako puno teksta, summary type nam vise znaci
Out[6]:
array(['Partly cloudy throughout the day.',
       'Mostly cloudy throughout the day.', 'Foggy in the evening.',
       'Foggy overnight and breezy in the morning.',
       'Overcast throughout the day.', 'Partly cloudy until night.',
       'Mostly cloudy until night.',
       'Foggy starting overnight continuing until morning.',
       'Foggy in the morning.', 'Partly cloudy until evening.',
       'Partly cloudy starting in the morning.',
       'Mostly cloudy starting overnight continuing until night.',
       'Mostly cloudy until evening.',
       'Partly cloudy starting in the morning continuing until evening.',
       'Partly cloudy starting in the afternoon.',
       'Partly cloudy starting overnight.',
       'Partly cloudy until morning.',
       'Partly cloudy starting overnight continuing until night.',
       'Partly cloudy starting in the afternoon continuing until night.',
       'Mostly cloudy starting overnight.',
       'Partly cloudy until afternoon.',
       'Mostly cloudy until night and breezy in the afternoon.',
       'Foggy starting in the evening.', 'Foggy throughout the day.',
       'Foggy starting in the evening continuing until night.',
       'Mostly cloudy until morning.',
       'Foggy starting in the morning continuing until evening.',
       'Foggy starting overnight continuing until afternoon.',
       'Partly cloudy starting in the morning continuing until afternoon.',
       'Foggy starting overnight.', 'Foggy until morning.',
       'Foggy starting overnight continuing until evening.',
       'Foggy starting in the afternoon.',
       'Partly cloudy starting overnight continuing until afternoon.',
       'Partly cloudy starting in the morning continuing until night.',
       'Overcast until night.',
       'Mostly cloudy starting overnight continuing until evening.',
       'Foggy overnight.', 'Partly cloudy in the morning.',
       'Mostly cloudy starting in the morning.',
       'Foggy starting in the afternoon continuing until evening.',
       'Mostly cloudy until afternoon.',
       'Foggy starting overnight continuing until night.',
       'Mostly cloudy throughout the day and breezy in the evening.',
       'Foggy starting in the morning continuing until afternoon.',
       'Partly cloudy in the afternoon.', 'Clear throughout the day.',
       'Partly cloudy starting in the afternoon continuing until evening.',
       'Partly cloudy overnight.', 'Overcast until evening.',
       'Foggy in the morning and breezy starting in the afternoon continuing until night.',
       'Breezy starting overnight continuing until afternoon and foggy starting in the morning continuing until evening.',
       'Partly cloudy starting overnight continuing until morning.',
       'Mostly cloudy throughout the day and breezy in the afternoon.',
       'Mostly cloudy starting overnight and breezy in the afternoon.',
       'Partly cloudy throughout the day and breezy starting in the morning continuing until night.',
       'Mostly cloudy throughout the day and breezy in the morning.',
       'Partly cloudy starting in the evening continuing until night.',
       'Mostly cloudy until night and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy starting in the morning continuing until evening and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy throughout the day and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy throughout the day and breezy starting in the morning continuing until evening.',
       'Foggy until afternoon.',
       'Overcast until night and breezy overnight.',
       'Breezy until morning and mostly cloudy throughout the day.',
       'Mostly cloudy starting in the morning continuing until night.',
       'Breezy starting overnight continuing until morning and partly cloudy starting overnight continuing until evening.',
       'Partly cloudy in the evening.',
       'Mostly cloudy starting overnight continuing until afternoon.',
       'Mostly cloudy starting in the morning continuing until afternoon.',
       'Mostly cloudy starting in the afternoon.',
       'Mostly cloudy starting in the morning continuing until evening.',
       'Partly cloudy starting overnight continuing until afternoon and breezy in the afternoon.',
       'Partly cloudy starting overnight and breezy in the afternoon.',
       'Mostly cloudy starting in the morning and breezy in the evening.',
       'Foggy starting in the afternoon continuing until night.',
       'Foggy until night.',
       'Foggy starting in the morning continuing until night.',
       'Foggy until evening.', 'Foggy starting in the morning.',
       'Partly cloudy starting overnight continuing until evening.',
       'Partly cloudy starting overnight continuing until evening and breezy starting in the morning continuing until evening.',
       'Breezy starting overnight continuing until morning and foggy in the evening.',
       'Mostly cloudy throughout the day and breezy starting in the morning continuing until evening.',
       'Partly cloudy until evening and breezy starting in the morning continuing until afternoon.',
       'Mostly cloudy starting in the afternoon continuing until night.',
       'Breezy starting overnight continuing until afternoon and mostly cloudy starting overnight continuing until evening.',
       'Mostly cloudy throughout the day and windy starting in the morning continuing until evening.',
       'Breezy and partly cloudy in the afternoon.',
       'Mostly cloudy starting overnight and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy until night and breezy starting in the morning continuing until afternoon.',
       'Breezy and mostly cloudy overnight.',
       'Mostly cloudy throughout the day and breezy overnight.',
       'Mostly cloudy throughout the day and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy throughout the day and breezy in the morning.',
       'Partly cloudy starting in the morning continuing until evening and breezy starting in the afternoon continuing until evening.',
       'Partly cloudy throughout the day and breezy starting in the afternoon continuing until evening.',
       'Mostly cloudy starting overnight and breezy in the morning.',
       'Partly cloudy starting in the afternoon and breezy in the afternoon.',
       'Partly cloudy starting in the morning and breezy in the evening.',
       'Partly cloudy until evening and breezy in the morning.',
       'Partly cloudy starting overnight continuing until evening and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy starting overnight continuing until evening and breezy in the evening.',
       'Mostly cloudy throughout the day and breezy starting in the evening.',
       'Mostly cloudy throughout the day and windy starting in the morning continuing until night.',
       'Breezy starting overnight continuing until morning and partly cloudy starting in the morning.',
       'Mostly cloudy starting in the morning and breezy overnight.',
       'Overcast throughout the day and breezy starting overnight continuing until morning.',
       'Partly cloudy throughout the day and breezy in the evening.',
       'Mostly cloudy until evening and breezy starting in the morning continuing until afternoon.',
       'Mostly cloudy until night and breezy in the evening.',
       'Partly cloudy starting in the evening.',
       'Overcast starting in the morning.',
       'Mostly cloudy starting overnight continuing until evening and breezy starting overnight continuing until morning.',
       'Partly cloudy starting overnight continuing until morning and breezy starting in the morning continuing until afternoon.',
       'Partly cloudy until evening and breezy starting in the morning continuing until evening.',
       'Breezy starting in the morning continuing until afternoon and partly cloudy starting in the morning.',
       'Partly cloudy starting in the morning and breezy starting in the afternoon continuing until evening.',
       'Mostly cloudy starting overnight continuing until morning.',
       'Mostly cloudy throughout the day and breezy starting overnight continuing until afternoon.',
       'Breezy starting overnight continuing until morning and foggy overnight.',
       'Mostly cloudy throughout the day and breezy starting overnight continuing until morning.',
       'Overcast throughout the day and breezy in the morning.',
       'Overcast throughout the day and breezy in the evening.',
       'Mostly cloudy starting in the morning continuing until night and breezy in the afternoon.',
       'Mostly cloudy until night and breezy starting in the evening continuing until night.',
       'Partly cloudy until night and breezy in the morning.',
       'Partly cloudy until evening and breezy overnight.',
       'Partly cloudy starting overnight continuing until night and windy starting in the morning continuing until afternoon.',
       'Breezy starting in the morning continuing until afternoon and mostly cloudy starting in the morning.',
       'Foggy starting overnight continuing until morning and breezy starting in the evening.',
       'Mostly cloudy until night and breezy starting in the afternoon.',
       'Foggy in the afternoon.',
       'Mostly cloudy until night and breezy starting in the afternoon continuing until night.',
       'Foggy starting overnight continuing until morning and breezy starting in the evening continuing until night.',
       'Breezy until afternoon and mostly cloudy throughout the day.',
       'Mostly cloudy throughout the day and breezy starting in the morning continuing until night.',
       'Partly cloudy starting overnight continuing until evening and breezy in the morning.',
       'Mostly cloudy starting in the morning and breezy in the afternoon.',
       'Mostly cloudy starting overnight continuing until night and breezy starting in the morning continuing until evening.',
       'Foggy starting overnight continuing until morning and breezy starting in the morning continuing until afternoon.',
       'Mostly cloudy until evening and windy starting in the morning continuing until afternoon.',
       'Foggy starting overnight continuing until afternoon and breezy in the morning.',
       'Foggy starting in the morning continuing until afternoon and breezy starting in the evening.',
       'Partly cloudy starting overnight and breezy starting in the morning continuing until afternoon.',
       'Foggy starting overnight continuing until morning and breezy in the afternoon.',
       'Mostly cloudy starting overnight and breezy starting in the afternoon continuing until evening.',
       'Overcast throughout the day and breezy starting overnight continuing until afternoon.',
       'Partly cloudy starting in the morning continuing until evening and breezy in the afternoon.',
       'Partly cloudy starting in the morning continuing until night and breezy starting in the afternoon continuing until evening.',
       'Mostly cloudy until night and breezy starting in the evening.',
       'Breezy in the morning and mostly cloudy starting in the morning.',
       'Mostly cloudy until night and breezy starting in the morning continuing until evening.',
       'Partly cloudy starting overnight continuing until evening and windy starting in the morning continuing until evening.',
       'Breezy in the morning and partly cloudy starting in the evening continuing until night.',
       'Partly cloudy overnight and breezy starting in the morning continuing until afternoon.',
       'Light rain in the morning.', 'Light rain until morning.',
       'Light rain in the morning and afternoon.',
       'Partly cloudy starting in the morning continuing until night and breezy starting in the morning continuing until afternoon.',
       'Breezy starting in the afternoon continuing until night and mostly cloudy starting in the evening.',
       'Mostly cloudy throughout the day and breezy starting in the evening continuing until night.',
       'Foggy starting in the afternoon and breezy starting in the afternoon continuing until evening.',
       'Breezy and foggy until morning.',
       'Mostly cloudy until night and breezy starting overnight continuing until morning.',
       'Partly cloudy starting overnight continuing until night and breezy in the morning.',
       'Partly cloudy starting overnight continuing until night and breezy in the afternoon.',
       'Mostly cloudy starting in the morning and breezy starting in the afternoon continuing until evening.',
       'Partly cloudy starting overnight and breezy starting in the evening.',
       'Breezy overnight and overcast throughout the day.',
       'Partly cloudy until night and breezy in the afternoon.',
       'Mostly cloudy starting overnight and breezy starting in the evening.',
       'Breezy overnight and partly cloudy until evening.',
       'Mostly cloudy starting in the evening.',
       'Mostly cloudy throughout the day and breezy starting in the afternoon.',
       'Mostly cloudy throughout the day and breezy starting in the afternoon continuing until evening.',
       'Mostly cloudy until night and windy starting in the morning continuing until afternoon.',
       'Breezy and foggy starting in the evening.',
       'Breezy overnight and partly cloudy throughout the day.',
       'Overcast throughout the day and breezy starting in the evening.',
       'Breezy until evening and foggy in the morning.',
       'Breezy overnight and mostly cloudy throughout the day.',
       'Partly cloudy until evening and breezy in the afternoon.',
       'Partly cloudy starting in the morning and breezy starting in the morning continuing until afternoon.',
       'Mostly cloudy until evening and breezy in the evening.',
       'Windy in the afternoon.', 'Overcast until morning.',
       'Mostly cloudy overnight.',
       'Foggy starting in the morning continuing until evening and breezy in the evening.',
       'Breezy starting overnight continuing until morning.',
       'Breezy starting in the afternoon continuing until evening and foggy starting in the evening.',
       'Mostly cloudy until night and breezy overnight.',
       'Mostly cloudy starting in the morning and windy in the evening.',
       'Partly cloudy throughout the day and windy starting in the morning continuing until afternoon.',
       'Breezy until afternoon and overcast throughout the day.',
       'Breezy in the morning and foggy in the evening.',
       'Breezy starting in the afternoon continuing until evening and foggy in the evening.',
       'Breezy starting in the morning continuing until night.',
       'Breezy in the morning and mostly cloudy starting in the evening.',
       'Mostly cloudy until evening and breezy in the afternoon.',
       'Mostly cloudy until night and breezy starting in the afternoon continuing until evening.',
       'Mostly cloudy until evening and breezy starting overnight continuing until morning.',
       'Overcast throughout the day and breezy in the afternoon.',
       'Overcast throughout the day and breezy starting in the morning continuing until evening.',
       'Overcast throughout the day and breezy overnight.',
       'Overcast starting in the afternoon.',
       'Partly cloudy throughout the day and breezy in the afternoon.',
       'Light rain starting overnight.',
       'Drizzle starting in the evening.', 'Drizzle until morning.',
       'Rain throughout the day.', 'Rain until morning.',
       'Light rain overnight.', 'Rain until afternoon.'], dtype=object)
In [7]:
df['Wind Bearing (degrees)'].unique() ## 360 unqiue values... why tf would this be a categorical when it can be a numercial one
Out[7]:
array([251., 259., 204., 269., 258., 260., 279., 290., 316., 281., 289.,
       262., 288., 230., 163., 139., 147., 160., 152., 150., 149., 180.,
       161., 135., 141., 151., 169., 170., 187., 179., 162., 159., 168.,
        32., 140., 103., 113., 129., 207., 153.,   4., 341.,  15., 348.,
       321., 311., 339., 340., 330.,  19., 277.,   9.,   0., 350., 349.,
       338., 320., 310., 328.,  20.,  28.,  11., 326., 309., 193., 273.,
       300., 307., 319., 318., 243., 177., 172., 142., 130., 359., 166.,
       145., 178., 223., 240., 231., 214., 222., 241., 235., 238., 211.,
       221., 215., 224., 358.,   8.,  59.,  63.,  65., 146., 305., 327.,
       271., 297., 301., 308., 272., 351., 175., 138., 158., 132., 209.,
       250., 295., 280., 270., 239., 242., 266., 278., 325., 282., 274.,
       255.,  46., 284., 283., 313., 345.,  16., 332.,  12.,  39.,   3.,
        33.,  24.,  25.,  31.,  47.,  67.,  60., 144.,  57.,   2.,  18.,
        48.,  29., 335., 228., 315.,  40., 143., 133., 136., 355., 123.,
       199., 227., 156., 261., 357., 190., 298.,  10., 127.,  81., 120.,
       134., 115., 347.,  72.,   7.,   1.,  30., 352., 353.,  85.,  49.,
        21.,  23.,  35.,  56., 337., 285., 306.,   6., 116., 197., 275.,
       100., 333., 304.,  90.,  58., 157., 329., 268., 317., 210., 226.,
       188., 292., 154., 205., 181., 185., 148.,  68., 155., 137.,  95.,
        50.,  80.,  83.,  73.,  70., 106.,  94., 117.,  53.,  82., 101.,
       102.,  54.,  52.,  38., 128., 164.,  71., 108., 121., 122., 189.,
       246., 302., 245., 291., 312., 296., 264.,  41.,  62., 119.,  98.,
       124., 299., 173., 176., 194., 171., 183., 167., 229., 303., 212.,
       322., 331., 342., 324., 265., 267., 165., 236., 110., 216., 218.,
       354.,  75., 186., 249.,  55.,  64.,  17.,   5., 263., 256.,  69.,
        89., 200., 112., 192., 287., 237., 213., 220., 217., 202., 208.,
       198., 201., 111., 104., 126., 191., 254., 107., 118., 131., 232.,
       196., 174., 248., 182., 294., 252., 247., 233., 244., 276., 225.,
       234., 219.,  84.,  96., 336., 253., 346., 314., 203.,  13., 343.,
       125.,  61.,  79.,  93., 286.,  27., 323., 195.,  87., 105., 206.,
        44.,  51.,  36.,  43.,  26.,  74., 109., 356., 334.,  99.,  88.,
       257., 293.,  45., 344., 184.,  22.,  42.,  37.,  14.,  34.,  76.,
        91.,  86.,  97.,  66., 114.,  77.,  92.,  78.])
In [8]:
df['Precip Type'].unique()    ## 'rain', 'snow', nan
Out[8]:
array(['rain', 'snow', nan], dtype=object)

Izračunati statističke vrednosti za sve numeričke varijable.

In [9]:
df.describe()
Out[9]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars)
count 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000 96453.0 96453.000000
mean 11.932678 10.855029 0.734899 10.810640 187.509232 10.347325 0.0 1003.235956
std 9.551546 10.696847 0.195473 6.913571 107.383428 4.192123 0.0 116.969906
min -21.822222 -27.716667 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
25% 4.688889 2.311111 0.600000 5.828200 116.000000 8.339800 0.0 1011.900000
50% 12.000000 12.000000 0.780000 9.965900 180.000000 10.046400 0.0 1016.450000
75% 18.838889 18.838889 0.890000 14.135800 290.000000 14.812000 0.0 1021.090000
max 39.905556 39.344444 1.000000 63.852600 359.000000 16.100000 0.0 1046.380000

Za sada, izbaciti iz baze sva kategorička obeležja osim padavina Precip Type i pravca vetra Wind Bearing (degrees).

  • Šta raditi obeležjem datum?

  • Šta raditi sa obeležjem Cloud cover (koje su vrednosti tog obeležja)?

Moja misljenja: Formatted Date mozemo prebaciti u datum NPR:

df['date_column'] = pd.to_datetime(df['date_column'])

I onda mozemo to iskoristiti kao index u dataframe-u i po tome sortirati i dobiti vremensku seriju. Moramo paziti na nedostajuce datume i NaN vrijednosti. NPR: Imamo 1-13.7 pa nemamo 14-17.7 pa imamo 18-25.7. Moracemo nekako popuniti 14-17.7 sa nekim podacima

Cloud cover (ovde se zove Loud Cover) je beskorisno jer ima samo nule vrijednosti pa mozemo izbaciti po mom misljenju

In [10]:
df=df.drop(['Summary','Daily Summary', 'Formatted Date', 'Loud Cover'], axis=1)

U promenljivu X staviti sva obeležja na osnovu kojih će se vršiti predviđanje, a u promenljivu y staviti onu promenljivu čije će se vrednosti predviđati regresionim modelom (Visibility (km)).

  • Koliko ima uzoraka i obeležja?

Prikazati prvih 5 uzoraka.

U nastavku koristiti promenljive X i y.

In [11]:
# TODO
X=df.drop(['Visibility (km)'], axis=1)
y=df['Visibility (km)']
print(X.shape)
print(X.columns)
X.head()
(96453, 7)
Index(['Precip Type', 'Temperature (C)', 'Apparent Temperature (C)',
       'Humidity', 'Wind Speed (km/h)', 'Wind Bearing (degrees)',
       'Pressure (millibars)'],
      dtype='object')
Out[11]:
Precip Type Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Pressure (millibars)
0 rain 9.472222 7.388889 0.89 14.1197 251.0 1015.13
1 rain 9.355556 7.227778 0.86 14.2646 259.0 1015.63
2 rain 9.377778 9.377778 0.89 3.9284 204.0 1015.94
3 rain 8.288889 5.944444 0.83 14.1036 269.0 1016.41
4 rain 8.755556 6.977778 0.83 11.0446 259.0 1016.51
  • Koje su moguće vrednosti za padavine iz obeležja Percip Type?
  • Šta predstavljaju nan vrednosti ovde?

Moje misljenje NaN predstavlja da taj dan nije bilo padavina. To mozemo da kategorisemo kao NPR: rain=1, snow=2, NaN = 0.

In [12]:
X['Precip Type'].unique()
Out[12]:
array(['rain', 'snow', nan], dtype=object)

Prebaciti vrednosti obeležja Percip Type u numeričke vrednosti kreiranjem dummy varijabli i rešiti problem nedostajućih vrednosti.

Napomena: pogledati dokumentaciju za funkciju pandas.get_dummies

In [13]:
# print(pd.get_dummies(X, columns=['Precip Type']))
X = pd.get_dummies(X, columns=['Precip Type'])
X
Out[13]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Pressure (millibars) Precip Type_rain Precip Type_snow
0 9.472222 7.388889 0.89 14.1197 251.0 1015.13 True False
1 9.355556 7.227778 0.86 14.2646 259.0 1015.63 True False
2 9.377778 9.377778 0.89 3.9284 204.0 1015.94 True False
3 8.288889 5.944444 0.83 14.1036 269.0 1016.41 True False
4 8.755556 6.977778 0.83 11.0446 259.0 1016.51 True False
... ... ... ... ... ... ... ... ...
96448 26.016667 26.016667 0.43 10.9963 31.0 1014.36 True False
96449 24.583333 24.583333 0.48 10.0947 20.0 1015.16 True False
96450 22.038889 22.038889 0.56 8.9838 30.0 1015.66 True False
96451 21.522222 21.522222 0.60 10.5294 20.0 1015.95 True False
96452 20.438889 20.438889 0.61 5.8765 39.0 1016.16 True False

96453 rows × 8 columns

Proveriti da li u bazi ima nedostajućih vrednosti.

In [14]:
X.isna().sum()
Out[14]:
0
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Pressure (millibars) 0
Precip Type_rain 0
Precip Type_snow 0

Izvršiti statističku analizu baze.

  • Da li postoje nemoguće/pogrešne vrednosti u bazi?
In [15]:
X.describe()
Out[15]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Pressure (millibars)
count 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000
mean 11.932678 10.855029 0.734899 10.810640 187.509232 1003.235956
std 9.551546 10.696847 0.195473 6.913571 107.383428 116.969906
min -21.822222 -27.716667 0.000000 0.000000 0.000000 0.000000
25% 4.688889 2.311111 0.600000 5.828200 116.000000 1011.900000
50% 12.000000 12.000000 0.780000 9.965900 180.000000 1016.450000
75% 18.838889 18.838889 0.890000 14.135800 290.000000 1021.090000
max 39.905556 39.344444 1.000000 63.852600 359.000000 1046.380000

Prikazati raspodelu za obeležje pritiska

  • Može li pritisak biti 0? Kako rešiti problem?
In [16]:
plt.hist(X['Pressure (millibars)'], bins=30, label='pressure')
plt.legend()
plt.show()
No description has been provided for this image
In [17]:
X['Pressure (millibars)'] = X['Pressure (millibars)'].replace(0.0, method='ffill') ## ja bi zamjenio sa mean/median jer je numericko al oni kazu da je ffill bolje
X.describe()
/tmp/ipython-input-17-2157937382.py:1: FutureWarning: The 'method' keyword in Series.replace is deprecated and will be removed in a future version.
  X['Pressure (millibars)'] = X['Pressure (millibars)'].replace(0.0, method='ffill') ## ja bi zamjenio sa mean/median jer je numericko al oni kazu da je ffill bolje
Out[17]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Pressure (millibars)
count 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000 96453.000000
mean 11.932678 10.855029 0.734899 10.810640 187.509232 1016.817525
std 9.551546 10.696847 0.195473 6.913571 107.383428 7.779863
min -21.822222 -27.716667 0.000000 0.000000 0.000000 973.780000
25% 4.688889 2.311111 0.600000 5.828200 116.000000 1012.120000
50% 12.000000 12.000000 0.780000 9.965900 180.000000 1016.560000
75% 18.838889 18.838889 0.890000 14.135800 290.000000 1021.170000
max 39.905556 39.344444 1.000000 63.852600 359.000000 1046.380000

Nacrtati raspodelu vrednosti varijable Visibility (km) koja će se predvidjati modelom linearne regresije i uraditi statističku analizu te promenljive.

In [18]:
plt.figure(figsize=(10,5))
plt.hist(y, density=True, bins=500)
plt.show()
No description has been provided for this image

Moje misljenje: Ne znam da li bas moze postojati vidjlivost sa 0.0 ali posto postoje vidljivosti sa 0.01 itd. mozda i ima.

In [19]:
y.describe()
# sorted(y.unique(),reverse=True) ## 0.0 pa 0.0161 0.322...
Out[19]:
Visibility (km)
count 96453.000000
mean 10.347325
std 4.192123
min 0.000000
25% 8.339800
50% 10.046400
75% 14.812000
max 16.100000

Kreirati funkciju get_wind_direction koja prima numerički parametar direction i kao rezultat vraća karakter za pravac vetra prema sledećim pravilima:

  • za direction u intervalu [45, 135) $\rightarrow$ 'N'
  • za direction u intervalu [135, 225) $\rightarrow$ 'W'
  • za direction u intervalu [225, 315) $\rightarrow$ 'S'
  • za sve ostale direction vrednosti $\rightarrow$ 'E'

Primeniti funkciju za pravljenje kategoričke varijable sa navedenim kategorijama (metoda apply) iz obeležja Wind Bearing (degrees), a potom je pretvoriti u dummy varijable.

Napomena: koristiti minimalni broj potrebnih obeležja pri kreiranju dummy varijabli.

In [20]:
def get_wind_direction(direction):
  if 45 <= direction < 135:
    return 'N'
  elif 135 <= direction < 225:
    return 'W'
  elif 225 <= direction < 315:
    return 'S'
  else:
    return 'E'
In [21]:
X['Wind bearing cat'] = X['Wind Bearing (degrees)'].apply(get_wind_direction)
X.head()
Out[21]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Pressure (millibars) Precip Type_rain Precip Type_snow Wind bearing cat
0 9.472222 7.388889 0.89 14.1197 251.0 1015.13 True False S
1 9.355556 7.227778 0.86 14.2646 259.0 1015.63 True False S
2 9.377778 9.377778 0.89 3.9284 204.0 1015.94 True False W
3 8.288889 5.944444 0.83 14.1036 269.0 1016.41 True False S
4 8.755556 6.977778 0.83 11.0446 259.0 1016.51 True False S
In [22]:
X.drop('Wind Bearing (degrees)', axis=1, inplace=True)
X = pd.get_dummies(X, columns=['Wind bearing cat'], drop_first=True)
X.head()
Out[22]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Pressure (millibars) Precip Type_rain Precip Type_snow Wind bearing cat_N Wind bearing cat_S Wind bearing cat_W
0 9.472222 7.388889 0.89 14.1197 1015.13 True False False True False
1 9.355556 7.227778 0.86 14.2646 1015.63 True False False True False
2 9.377778 9.377778 0.89 3.9284 1015.94 True False False False True
3 8.288889 5.944444 0.83 14.1036 1016.41 True False False True False
4 8.755556 6.977778 0.83 11.0446 1016.51 True False False True False

Normalizacija obeležja¶

Postoji nekoliko vrsta normalizacije. Samostalno implementirati funkcije za

  1. Normalizaciju na opseg [0,1] \begin{equation}y_i^{norm} = \frac{y_i-y_{min}}{y_{max}-y_{min}}\end{equation}
  2. Z-normalizaciju ili standardizaciju \begin{equation}y_i^{std} = \frac{y_i-\mu}{\sigma}\end{equation}

i izvršiti ih nad promenljivom y kao odvojene rezultate y_normalizovano, y_standardizovano.

Proveriti statističke vrednosti pre i posle pozivanja funkcija i nacrtati njihove raspodele.

  • Da li ima razlike?
In [23]:
y_normalizovano = (y-min(y))/(max(y)-min(y))
y_normalizovano.describe()
Out[23]:
Visibility (km)
count 96453.000000
mean 0.642691
std 0.260380
min 0.000000
25% 0.518000
50% 0.624000
75% 0.920000
max 1.000000

In [24]:
plt.figure(figsize=(10,5))
plt.hist(y_normalizovano, density=True, bins=50)
plt.title('normalizovano')
plt.show()
No description has been provided for this image
In [25]:
y_standardizovano = (y-np.mean(y))/np.std(y)
y_standardizovano.describe()
Out[25]:
Visibility (km)
count 9.645300e+04
mean -2.357352e-18
std 1.000005e+00
min -2.468291e+00
25% -4.788827e-01
50% -7.178379e-02
75% 1.065021e+00
max 1.372265e+00

In [26]:
plt.figure(figsize=(10,5))
plt.hist(y_standardizovano, density=True, bins=50)
plt.title('standardizovano')
plt.show()
No description has been provided for this image

MOJE MISLJENJE: Standardizacija (Z-score scaling) $$z = (x - 𝜇) ÷ σ $$ x: originalna vrednost

μ: srednja vrednost kolone

σ: standardna devijacija kolone

Kada koristiti:

  1. Kad podaci nisu ograničeni na neki fiksni opseg (npr. nisu između 0 i 1).
  2. Kada pretpostavljaš normalnu raspodelu podataka (npr. kod linearnog regresora, logističke regresije, SVM, PCA).
  3. Za algoritme koji pretpostavljaju da su podaci centrirani (oko nule).

NPR: Ako imaš visinu ljudi u centimetrima i raspodela je zvonasta, standardizacija je dobra ideja.

Normalizacija (Min-Max scaling) $$ x_{scaled} = (x - x_{min}) \div (x_{max} - x_{min}) $$

Kada koristiti:

  1. Kada su podaci ograničeni ili imaju prirodne granice (npr. pikselske vrednosti 0–255).
  2. Kod neuronskih mreža (npr. za ulaze u ReLU, sigmoid, tanh).
  3. Kada je opseg bitan (npr. kod algoritama osetljivih na apsolutne razlike: k-NN, distance-based modeli).

NPR: Ako imaš intenzitete svetlosti, temperature, ili vrednosti piksela, normalizacija ih dovodi u isti opseg.

Karakteristika Standardizacija Normalizacija
Opseg Može biti negativan / neograničen [0, 1] (ili [a, b])
Srednja vrednost 0 Ne mora biti 0
Standardna devijacija 1 Nije relevantna
Pogodno za Linearne modele, PCA, SVM k-NN, neuronske mreže, slike
Osetljivost na outliere Manje osetljiva Više osetljiva

Napomena:

  • Standardizacija je robusnija ako imaš outliere.
  • Normalizacija može biti loša ako imaš ekstremne vrednosti jer one "sabiju" ostatak opsega.

Ponoviti proceduru normalizacije, sada korišćenjem sklearn.preprocessing.MinMaxScaler klase. Uporediti rezultate.

In [27]:
from sklearn.preprocessing import MinMaxScaler

s = MinMaxScaler()
s.fit(y.array.reshape(-1,1))
y_norm = s.transform(y.array.reshape(-1,1))
y_norm = pd.DataFrame(y_norm)
print(y_norm.describe())
print(y_norm.head())
                  0
count  96453.000000
mean       0.642691
std        0.260380
min        0.000000
25%        0.518000
50%        0.624000
75%        0.920000
max        1.000000
       0
0  0.983
1  0.983
2  0.929
3  0.983
4  0.983

Ponoviti proceduru standardizacije, sada korišćenjem sklearn.preprocessing.StandardScaler klase. Uporediti rezultate.

In [28]:
from sklearn.preprocessing import StandardScaler

s = StandardScaler()
s.fit(y.array.reshape(-1,1))
y_std = s.transform(y.array.reshape(-1,1))
y_std = pd.DataFrame(y_std)
y_std.describe()
Out[28]:
0
count 9.645300e+04
mean -4.243234e-17
std 1.000005e+00
min -2.468291e+00
25% -4.788827e-01
50% -7.178379e-02
75% 1.065021e+00
max 1.372265e+00

Mere uspešnosti regresora¶

Kreirati funkciju model_evaluation koja za ulazne parametre y_test (kao prave vrednosti $y_i$), y_predicted (kao prediktovane vrednosti $\hat{y}_i$, mean(y_predicted) => y_i sa gornjom crtom), N i d vraća različite mere uspešnosti regresora:

  • MSE (mean squared error) - srednja kvadratna greška \begin{equation}MSE = \frac{1}{N}\sum_i(y_i-\hat{y_i})^2\end{equation}
  • MAE (mean absolute error) - srednja apsolutna greška \begin{equation}MAE = \frac{1}{N}\sum_i|y_i-\hat{y_i}|\end{equation}
  • RMSE (root mean squared error) - koren srednje kvadratne greške \begin{equation}RMSE = \sqrt{MSE}\end{equation}
  • $R^2$ skor \begin{equation}R^2=1-\frac{\sum_i(y_i-\hat{y_i})^2}{\sum_i(y_i-\overline{y_i})^2}\end{equation}
  • $R^2$ skor prilagođen (adjusted) \begin{equation}R^2_{adj} =1 - \frac{(1-R^2)*(N-1)}{(N-d-1)}\end{equation} N je broj uzoraka, a d broj obeležja

Funkcija dodatno treba da vrši ispis vrednosti opisanih mera, kao i da prikaže uporedni prikaz nekoliko pravih i prediktovanih vrednosti.

In [29]:
# TODO
def model_evaluation(y_test, y_predicted, N, d):
    mse = np.mean((y_test-y_predicted)**2)
    # mse = mean_squared_error(y_test,y_predicted)
    mae = np.mean(np.abs(y_test-y_predicted))
    # mae = mean_absolute_error(y_test,y_predicted)
    rmse = np.sqrt(mse)
    r2 = 1-np.sum((y_test-y_predicted)**2)/np.sum((y_test-np.mean(y_test))**2)
    # r2 = r2_score(y_test, y_predicted)
    r2_adj = 1-((1-r2)*(N-1))/(N-d-1)

    # printing values
    print('Mean squared error: ', mse)
    print('Mean absolute error: ', mae)
    print('Root mean squared error: ', rmse)
    print('R2 score: ', r2)
    print('R2 adjusted score: ', r2_adj)

    # Uporedni prikaz nekoliko pravih i predvidjenih vrednosti
    res=pd.concat([pd.DataFrame(y_test.values), pd.DataFrame(y_predicted)], axis=1)
    res.columns = ['y', 'y_pred']
    print(res.head(20))
    return mse,mae,rmse,r2,r2_adj

Testirati funkciju za neka 2 proizvoljna niza i potom uporediti rezultate odgovarajućih mera sa ugrađenim funkcijama

  • sklearn.metrics.mean_squared_error
  • sklearn.metrics.mean_absolute_error
  • sklearn.metrics.r2_score
In [30]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# TODO
# tmp_in = np.random.normal(size=(10,))
tmp_true = df.loc[:9,'Wind Speed (km/h)']
tmp_predicted = np.arange(0,10)
# print(tmp_in)
# print(tmp_out)

tmp_res = model_evaluation(y_test=tmp_true, y_predicted=tmp_predicted, N=10, d=1)
print(tmp_res)
print('mean_squared_error:',mean_squared_error(tmp_true,tmp_predicted))
print('mean_absolute_error:',mean_absolute_error(tmp_true,tmp_predicted))
sk_learn_r2 = r2_score(tmp_true,tmp_predicted)
print('r2_score:',sk_learn_r2)
custom_r2_adjusted = 1 - ((1 - sk_learn_r2) * (10-1)) / (10 - 1 - 1)
print('r2_score custom adjusted:', custom_r2_adjusted )
del tmp_true, tmp_predicted, tmp_res
Mean squared error:  74.73116288000001
Mean absolute error:  7.678040000000001
Root mean squared error:  8.644718785478219
R2 score:  -7.384075767581049
R2 adjusted score:  -8.43208523852868
         y  y_pred
0  14.1197       0
1  14.2646       1
2   3.9284       2
3  14.1036       3
4  11.0446       4
5  13.9587       5
6  12.3648       6
7  14.1519       7
8  11.3183       8
9  12.5258       9
(np.float64(74.73116288000001), np.float64(7.678040000000001), np.float64(8.644718785478219), np.float64(-7.384075767581049), np.float64(-8.43208523852868))
mean_squared_error: 74.73116288000001
mean_absolute_error: 7.678040000000001
r2_score: -7.384075767581049
r2_score custom adjusted: -8.43208523852868

Obuka modela linearne regresije¶

I model¶

  • Podela skupa na trening i test
  • Obuka modela sa osnovnom hipotezom \begin{equation} y=b_0+b_1x_1+b_2x_2+...b_dx_d\end{equation}
  • Primena modela na test uzorke
  • Evaluacija modela

Koristeći funkciju sklearn.model_selection.train_test_split izvršiti podelu skupa X i y na train i test celine. Zahtevati da test skup sadrži 10% podataka. Fiksirati generator slučajnih vrednosti na 42.

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Koristeći klasu sklearn.linear_model.LinearRegression kreirati inicijalni model linearne regresije za parametar fit_intercept=True.

  • Šta predstavlja ovaj parametar?

Ovaj paramatar predstavlja slobodni clan b: $$y = w_1*x_1 + w_2*x_2 + ... + w_i*x_i + b$$

Ako je fit_intercept=True a bez b ako je False.

Ako si unapred standardizovao podatke (npr. pomoću StandardScaler), onda intercept može biti bespotreban, jer su sve promenljive već srednjene — tada možeš koristiti fit_intercept=False.

In [32]:
# Inicijalizacija modela (podesavanje hiperparametara)
model = LinearRegression(fit_intercept=True)

Obučiti kreirani model koristeći metodu fit i trening podatke a zatim izvršiti predviđanje nad test podacima koristeći metodu predict. Rezultate predikcije staviti u promenljivu y_predicted.

In [33]:
# Obuka modela
model.fit(X_train, y_train)

# Testiranje obucenog modela (primena modela)
y_predicted = model.predict(X_test)

Izvršiti evaluaciju modela koristeći implementiranu funkciju model_evaluation.

In [34]:
# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_test, y_predicted, X_train.shape[0], X_train.shape[1])
Mean squared error:  14.020892363319918
Mean absolute error:  3.0559024248680258
Root mean squared error:  3.7444482054529633
R2 score:  0.21754464246307192
R2 adjusted score:  0.21745449368230585
          y     y_pred
0   15.5526  12.945771
1    9.9820   9.740866
2    9.6278  10.706595
3    8.0500  10.253302
4    7.5509   6.345266
5   13.7977   9.030456
6    9.9820  13.091935
7   11.2700  10.720846
8    4.5080   6.901841
9    8.4203   9.517001
10   8.0983  10.340767
11  10.3523  11.267772
12   3.4937   5.366907
13   9.9820  10.033562
14  16.1000  12.207898
15   0.0000   7.770094
16   8.0500   9.694605
17   9.9820  11.428571
18   0.5313   9.255958
19  11.2700   9.727986
Out[34]:
(np.float64(14.020892363319918),
 np.float64(3.0559024248680258),
 np.float64(3.7444482054529633),
 np.float64(0.21754464246307192),
 np.float64(0.21745449368230585))

Koristeći matplotlib.pyplot.bar grafički prikazati koeficijente modela, sadržane u polju coef_.

In [35]:
# Ilustracija koeficijenata
plt.figure(figsize=(10,5))
plt.bar(model.feature_names_in_,model.coef_)
plt.xticks(rotation=45, ha="right")
print(f'coef: {model.coef_}')           ## koef. modela
print(f'intercept: {model.intercept_}') ## slobodni clan
plt.show()
coef: [ 9.16231002e-02 -3.56597522e-02 -5.29981407e+00  2.77470063e-03
 -3.73660177e-02  2.86953981e+00  6.22868232e-01 -3.09246313e-01
  1.67222542e-01 -3.88999643e-01]
intercept: 49.0224861405965
No description has been provided for this image

I model - standardizacija¶

Obučiti model sa istom hipotezom ali sa standardizovanim obeležjima. Koristiti podelu na trening i test.

  • Podela skupa na trening i test
  • Standardizacija
  • Obuka modela sa osnovnom hipotezom \begin{equation} y=b_0+b_1x_1+b_2x_2+...b_dx_d\end{equation}
  • Primena modela na test uzorke
  • Evaluacija modela
In [36]:
# podela skupa na trening (90%) i test (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Definisati listu numeričkih i kategoričkih obeležja.

In [37]:
cat_feats = X_train.columns[['_' in column for column in X_train.columns]]
print(f'cat feats: {cat_feats}')
numeric_feats = X_train.columns[['_' not in column for column in X_train.columns]]
print(f'numeric feats: {numeric_feats}')
cat feats: Index(['Precip Type_rain', 'Precip Type_snow', 'Wind bearing cat_N',
       'Wind bearing cat_S', 'Wind bearing cat_W'],
      dtype='object')
numeric feats: Index(['Temperature (C)', 'Apparent Temperature (C)', 'Humidity',
       'Wind Speed (km/h)', 'Pressure (millibars)'],
      dtype='object')

U okviru numeričkih obeležja, fit-ovati klasu standardizacije sa obeležjima trening podataka i potom izvršiti standardizaciju transform-isanjem obeležja trening i test podataka.

  • Da li smo mogli uraditi nezavisno fit i transform za trening i test podatke i zašto?
In [38]:
# print(X_train[numeric_feats].head(5))
# print('----------------')
# print(X_test[numeric_feats].head(5))
# print('----------------')
s = StandardScaler()
s.fit(X_train[numeric_feats])
X_train[numeric_feats] = s.transform(X_train[numeric_feats])
X_test[numeric_feats] = s.transform(X_test[numeric_feats])
# print(X_train[numeric_feats].head(5))
# print('----------------')
# print(X_test[numeric_feats].head(5))
# print('----------------')

Ponoviti proces obuke, primene i evaluacije modela sa novim podacimа.

  • Da li ima razlike u rezultatima?

Grafički prikazati koeficijente modela.

  • Da li možemo interpretirati vrednosti koeficijenata na neki način?
In [39]:
# Inicijalizacija modela (podesavanje hiperparametara)
model = LinearRegression(fit_intercept=True)

# Obuka modela
model.fit(X_train, y_train)

# Testiranje obucenog modela (primena modela)
y_predicted = model.predict(X_test)

# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_test, y_predicted, X_train.shape[0], X_train.shape[1])
Mean squared error:  14.020892363319915
Mean absolute error:  3.0559024248680258
Root mean squared error:  3.744448205452963
R2 score:  0.21754464246307215
R2 adjusted score:  0.21745449368230618
          y     y_pred
0   15.5526  12.945771
1    9.9820   9.740866
2    9.6278  10.706595
3    8.0500  10.253302
4    7.5509   6.345266
5   13.7977   9.030456
6    9.9820  13.091935
7   11.2700  10.720846
8    4.5080   6.901841
9    8.4203   9.517001
10   8.0983  10.340767
11  10.3523  11.267772
12   3.4937   5.366907
13   9.9820  10.033562
14  16.1000  12.207898
15   0.0000   7.770094
16   8.0500   9.694605
17   9.9820  11.428571
18   0.5313   9.255958
19  11.2700   9.727986
Out[39]:
(np.float64(14.020892363319915),
 np.float64(3.0559024248680258),
 np.float64(3.744448205452963),
 np.float64(0.21754464246307215),
 np.float64(0.21745449368230618))
In [40]:
# Ilustracija koeficijenata
plt.figure(figsize=(10,5))
plt.bar(model.feature_names_in_,model.coef_)
plt.xticks(rotation=45, ha="right")
print(f'coef: {model.coef_}')
print(f'intercept: {model.intercept_}')
plt.show()
coef: [ 0.87431593 -0.38103252 -1.03544497  0.01917541 -0.2904449   2.86953981
  0.62286823 -0.30924631  0.16722254 -0.38899964]
intercept: 7.870049301791442
No description has been provided for this image

I model - regularizacija¶

  • Podela skupa na trening i test
  • Standardizacija
  • Obuka modela sa osnovnom hipotezom i regularizacijom \begin{equation} y=b_0+b_1x_1+b_2x_2+...b_dx_d\end{equation}
  • Primena modela na test uzorke
  • Evaluacija modela
In [41]:
# podela skupa na trening (90%) i test (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
In [42]:
# standardizacija spram trening seta
s = StandardScaler()
s.fit(X_train[numeric_feats])
X_train[numeric_feats] = s.transform(X_train[numeric_feats])
X_test[numeric_feats] = s.transform(X_test[numeric_feats])

Ridge¶

Koristeći klasu sklearn.linear_model.Ridge, obučiti model sa uključenom ridge regularizacijom.

Ridge regularizacija podrazumeva uvođenje regularizacije u funkciju cene \begin{equation} J=\frac{1}{N}\sum_i(y_i-\hat{y_i})^2 + α\sum_jθ_j^2 \end{equation}

Koristiti parametar alpha=5.

Ridge regresija:

Kod regularizovane regresije, dodajemo kaznu (penal) za velika težinska koeficijente ww, pa ciljna funkcija postaje: $$ MSE + α ∑_{i=1}^p ω_j^2$$

  • α je hiperparametar koji kontroliše jačinu regularizacije.
  • Ne postavlja koeficijente na nulu, već ih smanjuje.
  • Koristi se kada imaš mnogo korelisanih karakteristika.
  • α obicno ide od 1 do 100 sto utice na koliko se koeficijenti smanjuju

Lasso regresija:

Koristi apsolutnu vrednost težinskih koeficijenata:

$$ MSE + α ∑_{i=1}^p |ω_j|$$

  • Može naterati koeficijente da postanu tačno 0.
  • Time vrši selekciju karakteristika automatski.
  • Korisna kada imaš mnogo nebitnih ulaznih promenljivih.
  • α obicno ide od 0.001 do 1. Kako raste tako vise koeficijenata postaje 0
In [43]:
# Inicijalizacija modela (podesavanje hiperparametara)
from sklearn.linear_model import Ridge
model_ridge = Ridge(alpha=5, fit_intercept=True)

# Obuka modela
model_ridge.fit(X_train, y_train)

# Testiranje obucenog modela (primena modela)
y_predicted = model_ridge.predict(X_test)

# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_test, y_predicted, X_train.shape[0], X_train.shape[1])
Mean squared error:  14.020800010360787
Mean absolute error:  3.055845320516795
Root mean squared error:  3.7444358734475327
R2 score:  0.21754979634813032
R2 adjusted score:  0.21745964816115726
          y     y_pred
0   15.5526  12.945527
1    9.9820   9.740490
2    9.6278  10.706926
3    8.0500  10.252578
4    7.5509   6.344797
5   13.7977   9.030272
6    9.9820  13.091452
7   11.2700  10.721259
8    4.5080   6.938428
9    8.4203   9.517242
10   8.0983  10.340601
11  10.3523  11.267687
12   3.4937   5.366064
13   9.9820  10.034060
14  16.1000  12.207746
15   0.0000   7.806896
16   8.0500   9.695141
17   9.9820  11.428267
18   0.5313   9.256618
19  11.2700   9.727803
Out[43]:
(np.float64(14.020800010360787),
 np.float64(3.055845320516795),
 np.float64(3.7444358734475327),
 np.float64(0.21754979634813032),
 np.float64(0.21745964816115726))

Lasso¶

Koristeći klasu sklearn.linear_model.Lasso, obučiti model sa uključenom lasso regularizacijom.

Lasso regularizacija podrazumeva uvođenje regularizacije u funkciju cene \begin{equation} J=\frac{1}{N}\sum_i(y_i-\hat{y_i})^2 + α\sum_j|θ_j| \end{equation}

Koristiti parametar alpha=0.01.

In [44]:
# Inicijalizacija modela (podesavanje hiperparametara)
model_lasso = Lasso(alpha=0.01, fit_intercept=True)

# Obuka modela
model_lasso.fit(X_train, y_train)

# Testiranje obucenog modela (primena modela)
y_predicted = model_lasso.predict(X_test)

# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_test, y_predicted, X_train.shape[0], X_train.shape[1])
Mean squared error:  14.029134958291259
Mean absolute error:  3.055760798733383
Root mean squared error:  3.745548685879181
R2 score:  0.21708465301103064
R2 adjusted score:  0.2169944512336458
          y     y_pred
0   15.5526  12.907872
1    9.9820   9.748854
2    9.6278  10.759736
3    8.0500  10.145137
4    7.5509   6.470670
5   13.7977   9.066796
6    9.9820  13.031737
7   11.2700  10.783161
8    4.5080   7.640638
9    8.4203   9.557600
10   8.0983  10.259546
11  10.3523  11.247211
12   3.4937   5.460913
13   9.9820  10.105863
14  16.1000  12.276990
15   0.0000   8.459230
16   8.0500   9.782073
17   9.9820  11.398009
18   0.5313   9.251604
19  11.2700   9.754239
Out[44]:
(np.float64(14.029134958291259),
 np.float64(3.055760798733383),
 np.float64(3.745548685879181),
 np.float64(0.21708465301103064),
 np.float64(0.2169944512336458))

Odrediti mere performansi oba modela spram test seta.

  • Čemu služi regularizacija? Šta reguliše parametar alhpa? Šta uočavamo iz grafičkih prikaza koeficijenata ovih modela?
In [45]:
# Ilustracija koeficijenata
plt.figure(figsize=(10,5))
plt.bar(model_ridge.feature_names_in_,model_ridge.coef_, label='ridge')
plt.legend()
plt.xticks(rotation=45, ha="right")
print(f'coef: {model_ridge.coef_}')
print(f'intercept: {model_ridge.intercept_}')
plt.show()
coef: [ 0.86798416 -0.37476099 -1.03554388  0.01951315 -0.29055701  2.83286984
  0.58626602 -0.3090594   0.16718629 -0.38880959]
intercept: 7.906437747983643
No description has been provided for this image
In [46]:
# Ilustracija koeficijenata
plt.figure(figsize=(10,5))
plt.bar(model_lasso.feature_names_in_,model_lasso.coef_, label='lasso')
plt.legend()
plt.xticks(rotation=45, ha="right")
print(f'coef: {model_lasso.coef_}')
print(f'intercept: {model_lasso.intercept_}')
plt.show()
coef: [ 0.50646735  0.         -1.03066045  0.04108523 -0.28764133  2.15662212
 -0.         -0.20699429  0.15832752 -0.31984528]
intercept: 8.53517625666293
No description has been provided for this image
  • Uporediti performanse ova dva modela. Da li je korektno poređenje na ovaj način i zašto?

I model - trostruka podela podataka¶

  • Podela skupa podataka na trening, validacioni i test skup
  • Standardizacija obeležja
  • Obuka modela sa osnovnom hipotezom i regularizacijom \begin{equation} y=b_0+b_1x_1+b_2x_2+...b_dx_d\end{equation}
  • Primena modela na uzorke iz validacionog skupa
  • Evaluacija modela spram validacije (donošenje odluke o vrednostima hiperparametara)
  • Obuka konačnog modela
  • Primena modela na test uzorke
  • Evaluacija modela

Koristeći funkciju sklearn.model_selection.train_test_split izvršiti podelu skupa X i y na train, validation i test celine. Zahtevano je da validacioni i test skup sadrže po 10% podataka. Fiksirati generator slučajnih vrednosti na 42.

In [47]:
# podela skupa na trening (80%), val(10%) i test (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

Izvršiti standardizaciju svih skupova obeležja spram trening seta.

In [48]:
# standardizacija spram trening seta
s = StandardScaler()
s.fit(X_train[numeric_feats])
X_train[numeric_feats] = s.transform(X_train[numeric_feats])
X_val[numeric_feats] = s.transform(X_val[numeric_feats])
X_test[numeric_feats] = s.transform(X_test[numeric_feats])

Kreirati i obučiti modele sa lasso i ridge regularizacijom sa datom hipotezom.

In [49]:
# Inicijalizacija modela (podesavanje hiperparametara)
model_ridge = Ridge(alpha=5)
# Obuka modela
model_ridge.fit(X_train, y_train)
Out[49]:
Ridge(alpha=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=5)
In [50]:
# Inicijalizacija modela (podesavanje hiperparametara)
model_lasso = Lasso(alpha=0.01)
# Obuka modela
model_lasso.fit(X_train, y_train)
Out[50]:
Lasso(alpha=0.01)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.01)

Primeniti model nad uzorcima validacionog skupa i izvršiti evaluaciju modela.

In [51]:
# Testiranje obucenog modela (primena modela)
y_predicted_ridge = model_ridge.predict(X_val)
# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_val, y_predicted_ridge, X_train.shape[0], X_train.shape[1])
Mean squared error:  14.07552718783539
Mean absolute error:  3.0583353090592866
Root mean squared error:  3.75173655629435
R2 score:  0.21580370876882204
R2 adjusted score:  0.2157020644231582
          y     y_pred
0    9.9820  10.171718
1   16.1000  11.144381
2   15.1823  10.756456
3    6.0697   6.469956
4   10.2557  11.468892
5   15.8263  12.054888
6    9.9820  14.211538
7   14.9569  10.060617
8   11.2700  10.123561
9    4.4436   8.859296
10  10.4006  11.423057
11  10.3523  12.729105
12   9.9820  10.209543
13  15.8263   7.033093
14  16.1000  11.093510
15   0.1610   8.568039
16   9.9820  10.523047
17  10.3523  12.513890
18   2.0769   6.139311
19  15.8263  12.946812
Out[51]:
(np.float64(14.07552718783539),
 np.float64(3.0583353090592866),
 np.float64(3.75173655629435),
 np.float64(0.21580370876882204),
 np.float64(0.2157020644231582))
In [52]:
# Testiranje obucenog modela (primena modela)
y_predicted_lasso = model_lasso.predict(X_val)
# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_val, y_predicted_lasso, X_train.shape[0], X_train.shape[1])
Mean squared error:  14.074672381479967
Mean absolute error:  3.057906587294417
Root mean squared error:  3.7516226331388887
R2 score:  0.21585133298670733
R2 adjusted score:  0.2157496948139016
          y     y_pred
0    9.9820  10.144456
1   16.1000  11.098217
2   15.1823  10.673193
3    6.0697   6.463707
4   10.2557  11.349296
5   15.8263  12.020563
6    9.9820  14.180232
7   14.9569  10.139356
8   11.2700  10.171205
9    4.4436   8.915285
10  10.4006  11.435534
11  10.3523  12.726415
12   9.9820  10.202198
13  15.8263   7.012660
14  16.1000  11.151853
15   0.1610   8.601278
16   9.9820  10.611018
17  10.3523  12.453106
18   2.0769   6.248733
19  15.8263  12.948846
Out[52]:
(np.float64(14.074672381479967),
 np.float64(3.057906587294417),
 np.float64(3.7516226331388887),
 np.float64(0.21585133298670733),
 np.float64(0.2157496948139016))
  • Uporediti performanse ova dva modela. Da li je korektno poređenje na ovaj način i zašto?

Za odabrani model, izvršiti ponovo treniranje koristeći spojen trening i validacioni skup.

In [53]:
X_train_full = pd.concat((X_train, X_val))
y_train_full = pd.concat((y_train, y_val))
X_train_full
Out[53]:
Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Pressure (millibars) Precip Type_rain Precip Type_snow Wind bearing cat_N Wind bearing cat_S Wind bearing cat_W
43345 0.366321 0.427717 1.153114 0.007331 -0.234906 True False False False False
66832 -0.620241 -0.911529 -0.331279 3.275592 -0.865304 True False False True False
92142 0.659843 0.689845 1.204300 -0.734188 -0.843433 True False False True False
24092 -0.179957 -0.060133 -0.382464 -1.096811 0.215378 True False False True False
35372 -0.418735 -0.499094 0.129395 0.460610 0.010820 True False False True False
... ... ... ... ... ... ... ... ... ... ...
13309 2.002242 1.860059 -1.815671 0.507101 -0.959220 True False False False True
9459 -0.384957 -0.323301 0.129395 -0.652830 0.317013 True False False True False
18571 1.875282 1.645260 -2.327531 -0.160034 -0.556537 True False False False False
7862 -0.310412 -0.414838 -1.713299 0.885996 0.365901 True False False False False
86561 -0.736136 -0.866281 1.050742 0.653545 -0.039354 True False False False False

86807 rows × 10 columns

In [54]:
# Inicijalizacija modela (podesavanje hiperparametara)
model_lasso = Lasso(alpha=0.1)
# Obuka modela
model_lasso.fit(X_train_full, y_train_full)
Out[54]:
Lasso(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.1)

Primeniti model nad test uzorcima i izvršiti evaluaciju modela.

In [55]:
# Testiranje obucenog modela (primena modela)
y_predicted_ridge = model_lasso.predict(X_test)
# Evaluacija (racunanje mera uspesnosti)
model_evaluation(y_test, y_predicted_ridge, X_train.shape[0], X_train.shape[1])
Mean squared error:  13.971763071250301
Mean absolute error:  3.037403500670552
Root mean squared error:  3.737882163906495
R2 score:  0.19822978355000698
R2 adjusted score:  0.19812586134336674
          y     y_pred
0   15.5526  11.636798
1    4.1216   7.814812
2    9.9820  11.965885
3   15.8263  10.039523
4   15.1340   9.420337
5    7.9695   8.147696
6    9.9820  11.826974
7    9.9820  14.021446
8   15.8263  11.452480
9   10.3523  13.645432
10  10.2557   8.847953
11   4.0250   7.288054
12   8.9677   8.742788
13   4.9105   6.801529
14   1.9642   8.888358
15   9.9820  10.148058
16  16.1000   8.516962
17  14.9569  10.328438
18  14.0231   8.714493
19   3.0268   7.170125
Out[55]:
(np.float64(13.971763071250301),
 np.float64(3.037403500670552),
 np.float64(3.737882163906495),
 np.float64(0.19822978355000698),
 np.float64(0.19812586134336674))

I model - unakrsna validacija¶

  • Podela skupa podataka na trening i test, a potom podela trening skupa na 5 disjunktnih podskupova, te realizacija unakrsne validacije kroz petlju u kojoj se radi:
    • Standardizacija obeležja
    • Obuka modela sa osnovnom hipotezom i regularizacijom sa (pod)trening podacima \begin{equation} y=b_0+b_1x_1+b_2x_2+...b_dx_d\end{equation}
    • Primena modela na uzorke iz (pod)test skupa
    • Evaluacija modela (čuvanje mera iz svake iteracije)
  • Računanje konačnih mera uspešnosti modela
  • Obuka konačnog modela
  • Primena modela na uzorke iz test skupa
  • Evaluacija modela
In [56]:
# podela skupa na trening (90%) i test (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Koristeći klasu sklearn.model_selection.KFold izvršiti podelu trening skupa na 5 celina. U okviru for petlje, koristeći pristup dataframe-u putem indeksa, izdvajati (pod)trening i (pod)test celine i izvršiti zahtevane korake. Rezultate individualnih mera evaluacije čuvati u izdvojenim listama. Ispisati na kraju prosek dobijenih mera evaluacije. Koristiti samo model sa ridge regularizacijom.

In [57]:
# TODO
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

kf_ridge = KFold(n_splits=5,shuffle=False) ## ,random_state=42
kf_ridge.get_n_splits(X_train)
# print(kf_ridge)
# print(kf_ridge.split(X_train))

# print(numeric_feats)

ridge_results = []
for split_id,(train_index,test_index) in enumerate(kf_ridge.split(X_train)):
  # print('train_index:',train_index, len(train_index))
  # print('test_index:',test_index, len(test_index))

  ## can not find the indexes in 'X_train' even though we split it. LIKE WHAT THE FUCK
  # train_df = X_train.loc[train_index,:]
  # test_df = y_train.loc[test_index,:]

  train_df_x = X.loc[train_index]
  test_df_x = X.loc[test_index]
  train_df_y = y.loc[train_index]
  test_df_y = y.loc[test_index]
  # print(train_index.shape)
  # print(train_df)
  # print(test_df)

  tmp_sc = StandardScaler()
  tmp_sc.fit(train_df_x[numeric_feats])
  # print(tmp_sc.mean_)

  train_df_x[numeric_feats] = tmp_sc.transform(train_df_x[numeric_feats])
  test_df_x[numeric_feats] = tmp_sc.transform(test_df_x[numeric_feats])
  # print(train_df_x[numeric_feats])

  my_ridge_model = Ridge(alpha=5, fit_intercept=True)
  my_ridge_model.fit(train_df_x, train_df_y)
  # print(my_ridge_model.coef_)

  my_y_predicted = my_ridge_model.predict(train_df_x)
  # print(my_y_predicted.shape)
  # print("--------------")
  # print(train_df_y.shape)
  # print("--------------")

  mean_squared_error(y_true=train_df_y,y_pred=my_y_predicted)
  mean_absolute_error(y_true=train_df_y,y_pred=my_y_predicted)
  r2_score(y_true=train_df_y,y_pred=my_y_predicted)
  my_res = model_evaluation(train_df_y,my_y_predicted,train_df_y.shape[0],train_df_x.shape[1])
  # print(train_df_y.shape[0],train_df_x.shape[1])
  # print("--------------")
  # print(my_res)
  # print("--------------")
  ridge_results.append(my_res)

mean_mse = 0
mean_mAe = 0
mean_root_square_error = 0
mean_r2 = 0
mean_r2_adj = 0
for result in ridge_results:
  mean_mse += result[0]
  mean_mAe += result[1]
  mean_root_square_error = result[2]
  mean_r2 += result[3]
  mean_r2_adj += result[4]

mean_mse /= len(ridge_results)
mean_mAe /= len(ridge_results)
mean_root_square_error /= len(ridge_results)
mean_r2 /= len(ridge_results)
mean_r2_adj /= len(ridge_results)

print("MSE:",mean_mse)
print("Mean AbsoluteE:",mean_mAe)
print("mean_root_square_error: ",mean_root_square_error)
print("r2:",mean_r2)
print("r2_adj:",mean_r2_adj)

del tmp_sc, train_df_x, train_df_y, test_df_x,  test_df_y , my_ridge_model, my_y_predicted, my_res, mean_mse, mean_mAe,mean_r2,mean_r2_adj, mean_root_square_error
Mean squared error:  13.993916688217425
Mean absolute error:  3.0454294957806023
Root mean squared error:  3.740844381716169
R2 score:  0.21445290689183205
R2 adjusted score:  0.2143397710948005
          y     y_pred
0    9.9820  13.182870
1   10.3523  13.447281
2   11.2056  13.172528
3    9.9820  13.839584
4   10.3523  14.197236
5    9.9820  13.659735
6    9.9820  14.245082
7   10.3523  14.071979
8    9.9820  13.833627
9    9.9820  13.775057
10  10.3523  12.882075
11   9.9820  11.185861
12  15.8263  11.333326
13  14.9569  10.941102
14   9.9820   9.316291
15   6.2951   9.332095
16  14.1680   9.418226
17   6.2951   9.165078
18   6.2951   9.254108
19   6.8425   9.158915
Mean squared error:  13.943623197415658
Mean absolute error:  3.05111515961069
Root mean squared error:  3.7341161199694444
R2 score:  0.21862038210729162
R2 adjusted score:  0.2185078465169623
          y     y_pred
0   15.8263  10.040441
1   15.8263  10.184886
2   14.9569   9.262247
3   15.8263  10.273044
4   15.8263  10.261926
5   14.9569  10.193982
6    9.9820   9.552946
7    9.9820   9.946408
8    9.9820  10.313834
9    9.9820  11.046320
10  11.2056  11.454064
11  11.4471  11.979353
12  11.2700  12.224600
13  11.2700  12.422178
14  11.4471  12.735427
15  11.2700  12.810080
16  11.2700  11.848270
17  11.4471  11.684506
18  11.2056  10.665760
19  11.2056  10.544664
Mean squared error:  14.120209333556948
Mean absolute error:  3.0684599495694
Root mean squared error:  3.7576866997604985
R2 score:  0.18244066753023125
R2 adjusted score:  0.18232292297309582
          y     y_pred
0   15.8263  10.076561
1   15.8263  10.209325
2   14.9569   9.471395
3   15.8263  10.291836
4   15.8263  10.294370
5   14.9569  10.211679
6    9.9820   9.617736
7    9.9820   9.976242
8    9.9820  10.387983
9    9.9820  11.042999
10  11.2056  11.403906
11  11.4471  11.999587
12  11.2700  12.114609
13  11.2700  12.307495
14  11.4471  12.575736
15  11.2700  12.643415
16  11.2700  11.791877
17  11.4471  11.628643
18  11.2056  10.740144
19  11.2056  10.639857
Mean squared error:  13.55406866484194
Mean absolute error:  2.969827670230717
Root mean squared error:  3.6815850750514976
R2 score:  0.216961088578021
R2 adjusted score:  0.21684831563765627
          y     y_pred
0   15.8263   9.899432
1   15.8263  10.041198
2   14.9569   9.127865
3   15.8263  10.123315
4   15.8263  10.095732
5   14.9569  10.045953
6    9.9820   9.402599
7    9.9820   9.801979
8    9.9820  10.154153
9    9.9820  10.887186
10  11.2056  11.321937
11  11.4471  11.864421
12  11.2700  12.109351
13  11.2700  12.295882
14  11.4471  12.574952
15  11.2700  12.619854
16  11.2700  11.692831
17  11.4471  11.512042
18  11.2056  10.544776
19  11.2056  10.418596
Mean squared error:  12.269043996179049
Mean absolute error:  2.700273180177254
Root mean squared error:  3.5027195143458245
R2 score:  0.19615032709738522
R2 adjusted score:  0.19603455699975392
          y     y_pred
0   15.8263  10.122521
1   15.8263  10.224259
2   14.9569   9.221741
3   15.8263  10.285211
4   15.8263  10.226435
5   14.9569  10.226191
6    9.9820   9.765225
7    9.9820  10.061082
8    9.9820  10.239218
9    9.9820  10.760370
10  11.2056  11.111724
11  11.4471  11.487560
12  11.2700  11.702019
13  11.2700  11.819483
14  11.4471  11.961616
15  11.2700  11.943505
16  11.2700  11.336992
17  11.4471  11.179818
18  11.2056  10.251249
19  11.2056  10.155006
MSE: 13.576172376042203
Mean AbsoluteE: 2.967021091073733
mean_root_square_error:  0.7005439028691649
r2: 0.20572507444095223
r2_adj: 0.20561068264445376

Ponoviti postupak, koristeći sada model sa lasso regularizacijom.

In [58]:
# TODO
# TODO
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

kf_ridge = KFold(n_splits=5,shuffle=False) ## ,random_state=42
kf_ridge.get_n_splits(X_train)
# print(kf_ridge)
# print(kf_ridge.split(X_train))

# print(numeric_feats)

lasso_results = []
for split_id,(train_index,test_index) in enumerate(kf_ridge.split(X_train)):
  # print('train_index:',train_index, len(train_index))
  # print('test_index:',test_index, len(test_index))

  ## can not find the indexes in 'X_train' even though we split it. LIKE WHAT THE FUCK
  # train_df = X_train.loc[train_index,:]
  # test_df = y_train.loc[test_index,:]

  train_df_x = X.loc[train_index]
  test_df_x = X.loc[test_index]
  train_df_y = y.loc[train_index]
  test_df_y = y.loc[test_index]
  # print(train_index.shape)
  # print(train_df)
  # print(test_df)

  tmp_sc = StandardScaler()
  tmp_sc.fit(train_df_x[numeric_feats])
  # print(tmp_sc.mean_)

  train_df_x[numeric_feats] = tmp_sc.transform(train_df_x[numeric_feats])
  test_df_x[numeric_feats] = tmp_sc.transform(test_df_x[numeric_feats])
  # print(train_df_x[numeric_feats])

  my_lasso_model = Lasso(alpha=5, fit_intercept=True)
  my_lasso_model.fit(train_df_x, train_df_y)
  # print(my_lasso_model.coef_)

  my_y_predicted = my_lasso_model.predict(train_df_x)
  # print(my_y_predicted.shape)
  # print("--------------")
  # print(train_df_y.shape)
  # print("--------------")

  mean_squared_error(y_true=train_df_y,y_pred=my_y_predicted)
  mean_absolute_error(y_true=train_df_y,y_pred=my_y_predicted)
  r2_score(y_true=train_df_y,y_pred=my_y_predicted)
  my_res = model_evaluation(train_df_y,my_y_predicted,train_df_y.shape[0],train_df_x.shape[1])
  # print(train_df_y.shape[0],train_df_x.shape[1])
  # print("--------------")
  # print(my_res)
  # print("--------------")
  lasso_results.append(my_res)

mean_mse = 0
mean_mAe = 0
mean_root_square_error = 0
mean_r2 = 0
mean_r2_adj = 0
for result in lasso_results:
  mean_mse += result[0]
  mean_mAe += result[1]
  mean_root_square_error = result[2]
  mean_r2 += result[3]
  mean_r2_adj += result[4]

mean_mse /= len(ridge_results)
mean_mAe /= len(ridge_results)
mean_root_square_error /= len(ridge_results)
mean_r2 /= len(ridge_results)
mean_r2_adj /= len(ridge_results)

print("MSE:",mean_mse)
print("Mean AbsoluteE:",mean_mAe)
print("mean_root_square_error: ",mean_root_square_error)
print("r2:",mean_r2)
print("r2_adj:",mean_r2_adj)

del tmp_sc, train_df_x, train_df_y, test_df_x,  test_df_y , my_lasso_model, my_y_predicted, my_res, mean_mse, mean_mAe,mean_r2,mean_r2_adj
Mean squared error:  17.814230121898618
Mean absolute error:  3.1370521780775253
Root mean squared error:  4.220690716209685
R2 score:  0.0
R2 adjusted score:  -0.00014402166085769608
          y     y_pred
0    9.9820  10.280538
1   10.3523  10.280538
2   11.2056  10.280538
3    9.9820  10.280538
4   10.3523  10.280538
5    9.9820  10.280538
6    9.9820  10.280538
7   10.3523  10.280538
8    9.9820  10.280538
9    9.9820  10.280538
10  10.3523  10.280538
11   9.9820  10.280538
12  15.8263  10.280538
13  14.9569  10.280538
14   9.9820  10.280538
15   6.2951  10.280538
16  14.1680  10.280538
17   6.2951  10.280538
18   6.2951  10.280538
19   6.8425  10.280538
Mean squared error:  17.844877033035512
Mean absolute error:  3.1556347938495377
Root mean squared error:  4.224319712454955
R2 score:  0.0
R2 adjusted score:  -0.00014402166085769608
          y     y_pred
0   15.8263  10.276835
1   15.8263  10.276835
2   14.9569  10.276835
3   15.8263  10.276835
4   15.8263  10.276835
5   14.9569  10.276835
6    9.9820  10.276835
7    9.9820  10.276835
8    9.9820  10.276835
9    9.9820  10.276835
10  11.2056  10.276835
11  11.4471  10.276835
12  11.2700  10.276835
13  11.2700  10.276835
14  11.4471  10.276835
15  11.2700  10.276835
16  11.2700  10.276835
17  11.4471  10.276835
18  11.2056  10.276835
19  11.2056  10.276835
Mean squared error:  17.271173813038306
Mean absolute error:  3.1006709860106625
Root mean squared error:  4.155860177272366
R2 score:  0.0
R2 adjusted score:  -0.00014401958666376835
          y     y_pred
0   15.8263  10.474373
1   15.8263  10.474373
2   14.9569  10.474373
3   15.8263  10.474373
4   15.8263  10.474373
5   14.9569  10.474373
6    9.9820  10.474373
7    9.9820  10.474373
8    9.9820  10.474373
9    9.9820  10.474373
10  11.2056  10.474373
11  11.4471  10.474373
12  11.2700  10.474373
13  11.2700  10.474373
14  11.4471  10.474373
15  11.2700  10.474373
16  11.2700  10.474373
17  11.4471  10.474373
18  11.2056  10.474373
19  11.2056  10.474373
Mean squared error:  17.309572317712913
Mean absolute error:  3.0630281759525784
Root mean squared error:  4.160477414638002
R2 score:  0.0
R2 adjusted score:  -0.00014401958666376835
          y     y_pred
0   15.8263  10.166142
1   15.8263  10.166142
2   14.9569  10.166142
3   15.8263  10.166142
4   15.8263  10.166142
5   14.9569  10.166142
6    9.9820  10.166142
7    9.9820  10.166142
8    9.9820  10.166142
9    9.9820  10.166142
10  11.2056  10.166142
11  11.4471  10.166142
12  11.2700  10.166142
13  11.2700  10.166142
14  11.4471  10.166142
15  11.2700  10.166142
16  11.2700  10.166142
17  11.4471  10.166142
18  11.2056  10.166142
19  11.2056  10.166142
Mean squared error:  15.262858728146083
Mean absolute error:  2.7558326699385938
Root mean squared error:  3.906770882473924
R2 score:  0.0
R2 adjusted score:  -0.00014401958666376835
          y     y_pred
0   15.8263  10.009671
1   15.8263  10.009671
2   14.9569  10.009671
3   15.8263  10.009671
4   15.8263  10.009671
5   14.9569  10.009671
6    9.9820  10.009671
7    9.9820  10.009671
8    9.9820  10.009671
9    9.9820  10.009671
10  11.2056  10.009671
11  11.4471  10.009671
12  11.2700  10.009671
13  11.2700  10.009671
14  11.4471  10.009671
15  11.2700  10.009671
16  11.2700  10.009671
17  11.4471  10.009671
18  11.2056  10.009671
19  11.2056  10.009671
MSE: 17.100542402766287
Mean AbsoluteE: 3.0424437607657793
mean_root_square_error:  0.7813541764947848
r2: 0.0
r2_adj: -0.00014402041634133944
  • Uporediti performanse ova dva modela. Da li je korektno poređenje na ovaj način i zašto?

Poređenje je korektno pod uslovom da:

  • Testni skup je isti za oba modela.
  • Podaci su prethodno standardizovani (posebno važno za Lasso i Ridge).
  • Koristiš iste metrike i mernu skalu.

R² i R²_adj:

  • Ridge: 0.2057 (znači da model objašnjava ~20.6% varijanse u podacima)
  • Lasso: ≈ 0, što implicira da model nije bolji od proste srednje vrednosti. Sve ostale vrijednosti greske su manje od Lasso-a pa uzimamo Ridge

Za odabrani model, izvršiti ponovo treniranje koristeći inicijalni trening skup.

In [59]:
# TODO
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


# print(numeric_feats)

train_df_x = X_train
test_df_x = X_test
train_df_y = y_train
test_df_y = y_test
# print(train_index.shape)
# print(train_df)
# print(test_df)

tmp_sc = StandardScaler()
tmp_sc.fit(train_df_x[numeric_feats])
# print(tmp_sc.mean_)

train_df_x[numeric_feats] = tmp_sc.transform(train_df_x[numeric_feats])
test_df_x[numeric_feats] = tmp_sc.transform(test_df_x[numeric_feats])
# print(train_df_x[numeric_feats])

my_ridge_model = Ridge(alpha=5, fit_intercept=True)
my_ridge_model.fit(train_df_x, train_df_y)
# print(my_ridge_model.coef_)

my_y_predicted = my_ridge_model.predict(train_df_x)
# print(my_y_predicted.shape)
# print("--------------")
# print(train_df_y.shape)
# print("--------------")

my_res = model_evaluation(train_df_y,my_y_predicted,train_df_y.shape[0],train_df_x.shape[1])
# print(train_df_y.shape[0],train_df_x.shape[1])
# print("--------------")
print(my_res)
print("--------------")

my_y_predicted = my_ridge_model.predict(test_df_x)
# print(my_y_predicted.shape)
# print("--------------")
# print(train_df_y.shape)
# print("--------------")

my_res = model_evaluation(test_df_y,my_y_predicted,test_df_y.shape[0],test_df_x.shape[1])
# print(train_df_y.shape[0],train_df_x.shape[1])
# print("--------------")
print(my_res)
print("--------------")

del tmp_sc, train_df_x, train_df_y, test_df_x,  test_df_y , my_ridge_model, my_y_predicted, my_res
Mean squared error:  13.878578644356596
Mean absolute error:  3.0340208667273254
Root mean squared error:  3.7253964412336837
R2 score:  0.20853077845901935
R2 adjusted score:  0.20843959116680066
          y     y_pred
0   10.2557  11.696899
1   15.8263  11.042516
2   11.2700  13.786124
3   11.2700  10.471661
4    9.9820   9.913415
5    3.3327   9.250926
6   15.8263  11.733761
7   15.5526  10.730600
8    1.4812   6.762791
9    9.9820   9.882093
10   6.7781   9.104298
11   7.9051   9.225040
12  15.5526  14.171902
13   5.5706   8.885414
14  16.1000  12.227292
15   0.1610   5.931609
16   0.1610   5.967857
17   6.1180   7.114873
18  11.2700  11.389195
19  16.1000  10.850203
(np.float64(13.878578644356596), np.float64(3.0340208667273254), np.float64(3.7253964412336837), np.float64(0.20853077845901935), np.float64(0.20843959116680066))
--------------
Mean squared error:  14.020800010360787
Mean absolute error:  3.055845320516795
Root mean squared error:  3.7444358734475327
R2 score:  0.21754979634813032
R2 adjusted score:  0.2167377048030843
          y     y_pred
0   15.5526  12.945527
1    9.9820   9.740490
2    9.6278  10.706926
3    8.0500  10.252578
4    7.5509   6.344797
5   13.7977   9.030272
6    9.9820  13.091452
7   11.2700  10.721259
8    4.5080   6.938428
9    8.4203   9.517242
10   8.0983  10.340601
11  10.3523  11.267687
12   3.4937   5.366064
13   9.9820  10.034060
14  16.1000  12.207746
15   0.0000   7.806896
16   8.0500   9.695141
17   9.9820  11.428267
18   0.5313   9.256618
19  11.2700   9.727803
(np.float64(14.020800010360787), np.float64(3.055845320516795), np.float64(3.7444358734475327), np.float64(0.21754979634813032), np.float64(0.2167377048030843))
--------------

Primeniti model nad test uzorcima i izvršiti evaluaciju modela.

In [60]:
# TODO
## done in the above one

Dodatak: Uraditi postupak krosvalidacije koristeći cross_val_score. Koristeći klasu Pipeline obezbediti standardizaciju nad trening i test foldovima.

In [61]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline

s = StandardScaler()
model_ridge = Ridge(alpha=5)
cv = KFold(n_splits = 5, random_state = 42, shuffle=True)

pipeline = Pipeline([('scaler', s), ('LinReg', model_ridge)])
scores = cross_val_score(pipeline, X_train, y_train, cv = cv, scoring='r2')
In [62]:
print(f'r2: {scores}')
r2: [0.21452231 0.21133761 0.20963071 0.20078304 0.20420529]

Selekcija obeležja¶

Najjednostavniji način selekcije obeležja baziran je na iterativnom procesu dodavanja najboljeg (selekcija unapred) ili oduzimanja najmanje dobrog (selekcija unazad) obeležja dok se ne zadrži ciljni broj obeležja koja će se koristiti za predikciju. U tom iterativnom procesu pravi se mnoštvo modela koji se međusobno porede prema zahtevanom skoru dobijenom unakrsnom validacijom na k podskupova.

II model¶

  • Podela skupa podataka na trening i test skup
  • Standardizacija obeležja
  • Evaluacija modela spram unakrsne validacije sa selekcijom obeležja za obuku modela sa osnovnom hipotezom \begin{equation} y=b_0+b_1x_1+b_2x_2+...b_dx_d\end{equation}
  • Obuka konačnog modela
  • Primena modela na test uzorke
  • Evaluacija modela
In [63]:
# podela skupa na trening (90%) i test (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Koristeći klasu sklearn.feature_selection.SequentialFeatureSelector kreirati model za selekciju obeležja inicijalizovanog modela linearne regresije sa osnovnom hipotezom. Odrediti vrednost $R^2$ skora za zahtevani broj od 2 selektovana obeležja koristeći unakrsnu validaciju sa 5 foldova.

Uporediti rezultate sa selekcijom unapred i unazad.

In [ ]:
# TODO
from sklearn.feature_selection import SequentialFeatureSelector

pipeline = Pipeline([('scaler', s), ('LinReg', LinearRegression())])
cv = KFold(n_splits = 5, random_state = 42, shuffle=True)

sfs = SequentialFeatureSelector(pipeline, n_features_to_select = 2, direction='forward', scoring='r2', cv=cv)
sfs.fit(X_train, y_train)
forward_features = sfs.get_feature_names_out()
print(f"odabrana obelezja, forward: {forward_features}, score: {cross_val_score(pipeline, X_train[forward_features], y_train, cv = cv, scoring='r2').mean()}")

sfs = SequentialFeatureSelector(pipeline, n_features_to_select = 2, direction='backward', scoring='r2', cv=cv)
sfs.fit(X_train, y_train)
backward_features = sfs.get_feature_names_out()
print(f"odabrana obelezja, backward: {backward_features}, score: {cross_val_score(pipeline, X_train[backward_features], y_train, cv = cv, scoring='r2').mean()}")
odabrana obelezja, forward: ['Temperature (C)' 'Humidity'], score: 0.17749488981647626

Za odabrana obeležja obučiti finalni model i izvršiti evaluaciju nad test skupom.

In [ ]:
# standardizacija spram trening seta
s = StandardScaler()
s.fit(X_train[numeric_feats])
X_train[numeric_feats] = s.transform(X_train[numeric_feats])
X_test[numeric_feats] = s.transform(X_test[numeric_feats])
In [ ]:
model.fit(X_train[backward_features], y_train)
y_predicted = model.predict(X_test[backward_features])
model_evaluation(y_test, y_predicted, X_train[backward_features].shape[0], X_train[backward_features].shape[1])

plt.figure(figsize=(10,5))
plt.bar(model.feature_names_in_,model.coef_, label='osnovna hipoteza')
plt.legend()
plt.xticks(rotation=45, ha="right")
print(f'coef: {model.coef_}')
print(f'intercept: {model.intercept_}')
plt.show()

Postoje i mnogi drugi načini selekcije obeležja, računanjem važnosti obeležja/međuinformacije, oslanjanjem na predviđene koeficijente modela, i dr. Svaka metoda može se samostalno implementirati, a za mnoge postoje gotove funkcije u okviru biblioteke sklearn.

Isprobavanje različitih hipoteza¶

  • Ima li potrebe za korišćenjem drugačije hipoteze? Kako dolazimo do ovog odgovora? Kada se uvode interakcije između obeležja u hipotezu?

Prikazati heatmapu korelacije između numeričkih obeležja. Za koje ima smisla da se vrši kombinovanje?

In [ ]:
corr = X_train[numeric_feats].corr()
f = plt.figure(figsize=(12, 9))
sb.heatmap(corr.abs(), annot=True);
  • Podela skupa podataka na trening, validacioni i test skup
  • Standardizacija obeležja
  • Obuka modela sa novom hipotezom i regularizacijom
  • Primena modela na uzorke iz validacionog skupa
  • Evaluacija modela spram validacije
In [ ]:
# podela skupa na trening (80%), val(10%) i test (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

# standardizacija spram trening seta
s = StandardScaler()
s.fit(X_train[numeric_feats])
X_train[numeric_feats] = s.transform(X_train[numeric_feats])
X_val[numeric_feats] = s.transform(X_val[numeric_feats])
X_test[numeric_feats] = s.transform(X_test[numeric_feats])

III model - hipoteza sa interakcijama drugog stepena (bez kvadrata)¶

\begin{equation} y=b_0+b_1x_1+b_2x_2+b_3x_3+...+b_dx_d+c_1x_1x_2+c_2x_1x_3+c_3x_2x_3+...\end{equation}

Koristeći klasu sklearn.preprocessing.PolynomialFeatures kreirati novi set obeležja koji posmatra interakcije drugog stepena (bez kvadrata).

Komentarisati ulogu degree, interaction_only i include_bias parametara.

In [ ]:
# TODO
## PolynomialFeatures(degree=2,interaction_only=True)
## interaction_only=True ne generise x_i^2 nego samo x_i*x_j
## include_bias=True dodaj bias tj. slobodni clan

from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

poly_feat1 = PolynomialFeatures(degree=2,interaction_only=True, include_bias=True)
X_train_poly_feat1 = poly_feat1.fit_transform(X_train)
print(X_train_poly_feat1.shape)

Koristeći metodu get_feature_names_out utvrditi koja su novodobijena obeležja i njihov redosled.

In [ ]:
# TODO
poly_feat1.get_feature_names_out()

Sa novim obeležjima izvršiti obuku a potom i evaluaciju spram uzoraka iz validacionog skupa.

In [ ]:
# TODO
## should I put the scaler?
my_pipeline1 = Pipeline([
    # ('scaler',s),
    ('polyFeat',poly_feat1),
    ('LinReg', LinearRegression()) ])
my_pipeline1.fit(X_train, y_train)
y_pred1 = my_pipeline1.predict(X_val)
print(my_pipeline1['LinReg'].coef_.shape)
print(y_pred1)
# del y_pred1
# X_train[numeric_feats] = s.transform(X_train[numeric_feats])
# X_val[numeric_feats] = s.transform(X_val[numeric_feats])
# X_test[numeric_feats] = s.transform(X_test[numeric_feats])

III model - hipoteza interakcijama drugog stepena¶

\begin{equation} y=b_0+b_1x_1+b_2x_2+b_3x_3+...+b_dx_d+c_1x_1x_2+c_2x_1x_3+c_3x_2x_3+...+d_1x_1^2+d_2x_2^2+d_3x_3^2+...+d_nx_n^2\end{equation}

Ponoviti prethodne korake, koristeći novu hipotezu.

In [ ]:
# TODO
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

poly_feat2 = PolynomialFeatures(degree=2,interaction_only=False, include_bias=True)
X_train_poly_feat2 = poly_feat2.fit_transform(X_train)
print(X_train_poly_feat2.shape)

# poly_feat2.get_feature_names_out()

## should I put the scaler?
my_pipeline2 = Pipeline([
    # ('scaler',s),
    ('polyFeat',poly_feat2 ),
    ('LinReg', LinearRegression()) ])

my_pipeline2.fit(X_train, y_train)
y_pred2 = my_pipeline2.predict(X_val)
print(my_pipeline2['LinReg'].coef_.shape)
print(y_pred2)
# del y_pred2
# X_train[numeric_feats] = s.transform(X_train[numeric_feats])
# X_val[numeric_feats] = s.transform(X_val[numeric_feats])
# X_test[numeric_feats] = s.transform(X_test[numeric_feats])

III model - hipoteza sa interakcijama trećeg stepena¶

\begin{equation} y=b_0+b_1x_1+b_2x_2+b_3x_3+...+b_dx_d+c_1x_1x_2+c_2x_1x_3+c_3x_2x_3+...+d_1x_1^2+d_2x_2^2+d_3x_3^2+...+d_nx_n^2+e_1x_1^3+e_2x_2^3+e_3x_3^3+...+d_nx_n^3\end{equation}

In [ ]:
# TODO
# TODO
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

poly_feat3 = PolynomialFeatures(degree=3,interaction_only=False, include_bias=True)
X_train_poly_feat3 = poly_feat3.fit_transform(X_train)
print(X_train_poly_feat3.shape)

# poly_feat3.get_feature_names_out()

## should I put the scaler?
my_pipeline3 = Pipeline([
    # ('scaler',s),
    ('polyFeat',poly_feat3 ),
    ('LinReg', LinearRegression()) ])

my_pipeline3.fit(X_train, y_train)
y_pred3 = my_pipeline3.predict(X_val)
print(my_pipeline3['LinReg'].coef_.shape)
print(y_pred3)
# del y_pred
# X_train[numeric_feats] = s.transform(X_train[numeric_feats])
# X_val[numeric_feats] = s.transform(X_val[numeric_feats])
# X_test[numeric_feats] = s.transform(X_test[numeric_feats])

Koja mera evaluacije bi bila najadekvatnija za poređenje i zašto?

Za odabranu hipotezu, izvršiti

  • Obuka konačnog modela
  • Primena modela na test uzorke
  • Evaluacija modela

Za mjeru bi bila najbolja R2_adjusted jer on kazni modele sa previse feature-a

In [ ]:
# TODO
# TODO
from sklearn.feature_selection import SequentialFeatureSelector

# cv = KFold(n_splits = 5, random_state = 42, shuffle=True)
# # print(f"model1 r2_score: {cross_val_score(pipeline, X_train[forward_features], y_train, cv = cv, scoring='r2').mean()}")
# print(f"model1 r2_score: {cross_val_score(my_pipeline1, X_train_poly_feat1, y_train, cv = cv, scoring='r2').mean()}")
# print(f"model2 r2_score: {cross_val_score(my_pipeline2, X_train_poly_feat2, y_train, cv = cv, scoring='r2').mean()}")
# print(f"model3 r2_score: {cross_val_score(my_pipeline3, X_train_poly_feat3, y_train, cv = cv, scoring='r2').mean()}")
In [ ]:
# print(f"model3 r2_score: {cross_val_score(my_pipeline3, X_train_poly_feat3, y_train, cv = cv, scoring='r2').mean()}")

##
X_val_poly_feat1 = poly_feat1.transform(X_val)
X_val_poly_feat2 = poly_feat2.transform(X_val)
X_val_poly_feat3 = poly_feat3.transform(X_val)

y_val_poly_feat1_predicted = my_pipeline1.predict(X_val)
y_val_poly_feat2_predicted = my_pipeline2.predict(X_val)
y_val_poly_feat3_predicted = my_pipeline3.predict(X_val)
# y_val_poly_feat3_predicted = my_pipeline3.predict(X_val_poly_feat3)

print(X_val_poly_feat1.shape)
print(X_val_poly_feat2.shape)
print(X_val_poly_feat3.shape)
# print(my_pipeline1['linReg'].coef_.shape[0])
model_evaluation(y_test=y_val, y_predicted=y_val_poly_feat1_predicted, N=X_val_poly_feat1.shape[0],  d=X_val_poly_feat1.shape[1])
# Mean squared error:  11.629556853227895
# Mean absolute error:  2.796682534744267
# Root mean squared error:  3.410213608152412
# R2 score:  0.3520771740013281
# R2 adjusted score:  0.34829289383279183
print("--------------------------")
model_evaluation(y_test=y_val, y_predicted=y_val_poly_feat2_predicted, N=X_val_poly_feat2.shape[0],  d=X_val_poly_feat2.shape[1])
# Mean squared error:  10.718202047577192
# Mean absolute error:  2.7256805994361852
# Root mean squared error:  3.273866528674801
# R2 score:  0.40285190158699524
# R2 adjusted score:  0.3987370786077451
print("--------------------------")
model_evaluation(y_test=y_val, y_predicted=y_val_poly_feat3_predicted, N=X_val_poly_feat3.shape[0],  d=X_val_poly_feat3.shape[1])
# Mean squared error:  9.830574762996207
# Mean absolute error:  2.594140518624325
# Root mean squared error:  3.135374740441118
# R2 score:  0.4523046869267574
# R2 adjusted score:  0.4355659757129353
print("--------------------------")
# X_train[numeric_feats] = s.transform(X_train[numeric_feats])
# X_val[numeric_feats] = s.transform(X_val[numeric_feats])
# X_test[numeric_feats] = s.transform(X_test[numeric_feats])

III model - naša hipoteza¶

Kreirati model posmatranjem interakcija obeležja koji imaju međkorelaciju $|corr| > 0.3$. Kako se ponaša model spram prethodnih.

In [ ]:
# TODO
# print(X_train.corr())
sb.heatmap(X_train.corr(),annot=True,fmt=".2f")
plt.show()
# print(X_train.columns)
# 'Temperature (C)','Humidity','Precip Type_snow','Precip Type_rain','Pressure (millibars)','Apparent Temperature (C)'
# kolone bez |corr| > 0.3: 'Wind bearing cat_N', 'Wind Speed (km/h)',

# sb.heatmap(X.corr(),annot=True,fmt=".2f")
# plt.show()
# print(X.columns)
  • Međukorelacija $|corr|>0.3$ znači da su osobine zavisne, tj. postoji multikolinearnost.
  • Ove interakcije nisu nezavisne, što može praviti probleme u linearnim modelima.

Kako se ponaša model?

  1. Linearni regresioni model bez regularizacije (npr. LinearRegression)
  • Multikolinearnost → koeficijenti mogu postati nestabilni i neinterpretabilni.
  • Male promene u podacima mogu drastično promeniti vrednosti koeficijenata.
  • Model može imati visoku varijansu i lošu generalizaciju.
  • Iako R² može biti visok, model loše generalizuje.
  1. Model sa regularizacijom (npr. Ridge ili Lasso)
  • Regularizacija (posebno Ridge) smanjuje uticaj međukorelisanih osobina tako što smanjuje njihove koeficijente.
  • Ridge pomaže kada su osobine korelirane – deli "teret" među njima.
  • Lasso može isključiti (naterati na 0) neke interakcije ako nisu korisne.
  • Model ostaje stabilan čak i ako su osobine korelisane.
In [ ]:
# TODO
# from sklearn.feature_selection import SequentialFeatureSelector

# pipeline = Pipeline([('scaler', s), ('LinReg', LinearRegression())])
# cv = KFold(n_splits = 5, random_state = 42, shuffle=True)

# sfs = SequentialFeatureSelector(pipeline, n_features_to_select = 2, direction='forward', scoring='r2', cv=cv)
# sfs.fit(X_train, y_train)
# forward_features = sfs.get_feature_names_out()
# print(f"odabrana obelezja, forward: {forward_features}, score: {cross_val_score(pipeline, X_train[forward_features], y_train, cv = cv, scoring='r2').mean()}")

# sfs = SequentialFeatureSelector(pipeline, n_features_to_select = 2, direction='backward', scoring='r2', cv=cv)
# sfs.fit(X_train, y_train)
# backward_features = sfs.get_feature_names_out()
# print(f"odabrana obelezja, backward: {backward_features}, score: {cross_val_score(pipeline, X_train[backward_features], y_train, cv = cv, scoring='r2').mean()}")
  • Možemo li sada da primenimo selekciju obeležja? Da li to treba uraditi?

Da, apsolutno možeš i treba da primeniš selekciju obeležja (feature selection)

  • Smanjuje multikolinearnost → stabilniji i robusniji model
  • Smanjuje kompleksnost modela → brže treniranje, bolja generalizacija
  • Poboljšava interpretabilnost → lakše razumevanje odnosa ulaz–izlaz
  • Može poboljšati performanse na test skupu
  • Posebno važno ako koristiš nelinearne transformacije ili interakcije

Isprobati selekciju obeležja spram postavljenih modela sa različitim hipotezama. Uvesti ograničenje od 5 obeležja. Koji model ima najbolje performanse.

In [ ]:
# TODO
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

# s = StandardScaler()
# s.fit(X_train[numeric_feats])
# X_train[numeric_feats] = s.transform(X_train[numeric_feats])
# X_val[numeric_feats] = s.transform(X_val[numeric_feats])
# X_test[numeric_feats] = s.transform(X_test[numeric_feats])

# pipeline = Pipeline([
#     ('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)),
#     ('select', SelectKBest(score_func=f_regression, k=10)),
#     ('ridge', Ridge(alpha=1.0))
# ])
my_tmp_pipeline = Pipeline([('scaler', s), ('LinReg', LinearRegression())])
my_tmp_sfs = SequentialFeatureSelector(pipeline, n_features_to_select = 9, direction='forward', scoring='r2') ## , cv=cv
my_tmp_sfs.fit(X_train, y_train)

## n_features_to_select = 9: 'Wind Speed (km/h)' was the 1st to go

print(my_tmp_sfs.get_feature_names_out())
# print(X_train.columns)  ## 'Temperature (C)', 'Apparent Temperature (C)', 'Humidity','Wind Speed (km/h)', 'Pressure (millibars)', 'Precip Type_rain','Precip Type_snow', 'Wind bearing cat_N', 'Wind bearing cat_S','Wind bearing cat_W'
# print(len(X_train.columns)) ## 10

del my_tmp_pipeline, my_tmp_sfs
In [ ]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/Linearna_Regresija_moja_rjesenja.ipynb"