Data Science And Machine Learning Project
Let's get started
Project Short introduction - In Our Indian Yield prediction is a very important issue in agriculture. Any farmer is interested in knowing how much yield he is about to expect. In the past, yield prediction was performed by considering farmer's experience in particular fields and crops. The yield prediction is a major issue that remains to be solved based on available data. Machine learning and data science are the better choices for this purpose. Different Data techniques are also used and evaluated in agriculture for estimating the future year's crop production. ML and Data science propose and implement a system to predict crop yield from previous agriculture data.
So, Our first step is Data preprocessing. we need to understand, clean, and balance data and we will understand step by step.
1) load real-life data, we need to import pandas library.
raw_data = pd.read_csv('yield_df.csv')
| Unnamed: 0 | Area | Item | Year | hg/ha_yield | average_rain_fall_mm_per_year | pesticides_tonnes | avg_temp | |
|---|---|---|---|---|---|---|---|---|
| 3579 | 3579 | Brazil | Sorghum | 2000 | 15013 | 1761.0 | 140423.00 | 18.01 |
| 13673 | 13673 | India | Cassava | 2009 | 343433 | 1083.0 | 28707.01 | 24.87 |
| 21356 | 21356 | Nicaragua | Sorghum | 1995 | 20630 | 2280.0 | 876.00 | 27.20 |
*) It's a good habit to make a copy of the data.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28242 entries, 0 to 28241 Data columns (total 8 columns): Unnamed: 0 28242 non-null int64 Area 28242 non-null object Item 28242 non-null object Year 28242 non-null int64 hg/ha_yield 28242 non-null int64 average_rain_fall_mm_per_year 28242 non-null float64 pesticides_tonnes 28242 non-null float64 avg_temp 28242 non-null float64 dtypes: float64(3), int64(3), object(2) memory usage: 1.7+ MB
Item 0 Year 0 hg/ha_yield 0 average_rain_fall_mm_per_year 0 pesticides_tonnes 0 avg_temp 0 dtype: int64
no , there is no null value.
so we will move to next.
| Item | Year | hg/ha_yield | average_rain_fall_mm_per_year | pesticides_tonnes | avg_temp | Area | |
|---|---|---|---|---|---|---|---|
| count | 28242 | 28242.000000 | 28242.000000 | 28242.00000 | 28242.000000 | 28242.000000 | 28242.000000 |
| unique | 10 | NaN | NaN | NaN | NaN | NaN | NaN |
| top | Potatoes | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 4276 | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 2001.544296 | 77053.332094 | 1149.05598 | 37076.909344 | 20.542627 | 46.494724 |
| std | NaN | 7.051905 | 84956.612897 | 709.81215 | 59958.784665 | 6.312051 | 26.813405 |
| min | NaN | 1990.000000 | 50.000000 | 51.00000 | 0.040000 | 1.300000 | 0.000000 |
| 25% | NaN | 1995.000000 | 19919.250000 | 593.00000 | 1702.000000 | 16.702500 | 24.000000 |
| 50% | NaN | 2001.000000 | 38295.000000 | 1083.00000 | 17529.440000 | 21.510000 | 42.000000 |
| 75% | NaN | 2008.000000 | 104676.750000 | 1668.00000 | 48687.880000 | 26.000000 | 68.000000 |
| max | NaN | 2013.000000 | 501412.000000 | 3240.00000 | 367778.000000 | 30.650000 | 100.000000 |
*) Now, we will check which feature has more significance in our given data.
so, we will use stats model , and find variance_inflation_factor
| VIF | Features | |
|---|---|---|
| 0 | 1.907311 | hg/ha_yield |
| 1 | 13.650550 | Year |
| 2 | 1.491602 | pesticides_tonnes |
| 3 | 4.069444 | average_rain_fall_mm_per_year |
| 4 | 13.157713 | avg_temp |
array([[-0.47156037, 0.47905342, -0.66263568, -0.67562489, -1.75232436],
[-0.08594389, 0.47905342, -0.66263568, -0.67562489, -1.75232436],
[-0.64195323, 0.47905342, -0.66263568, -0.67562489, -1.75232436],
...,
[-0.77271178, -0.68151298, -0.61482275, -0.13560958, 2.00046897],
[-0.65620823, -0.68151298, -0.61482275, -0.13560958, 2.00046897],
[-0.64766293, -0.68151298, -0.61482275, -0.13560958, 2.00046897]])*) Now , Split the data
from sklearn.model_selection import train_test_split
# declare 4 variables for the split
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets, test_size = 0.2, random_state = 20)
test_size - It will split data by giving test size and randome_data - it will give suffeled data.
Now we have train and test data.
*)Now , all done , time to apply algorithm in our train data.
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)
here we are using classifier algorithm and i choose DecisionTreeClassifier after comparing other algorithm.This algo gave highest accuracy to our data. Means our modelprediction is very high and accurateclf.score(x_train,y_train)0.9996836301184128testing the test data.prediction = clf.predict(x_test) from sklearn.metrics import accuracy_score accuracy_score(prediction,y_test)0.8550253073029646Now , we are in last but very important stepPost pruning decision trees with cost complexity pruning
.. currentmodule:: sklearn.tree
The :class:
DecisionTreeClassifierprovides parameters such asmin_samples_leafandmax_depthto prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In :class:DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter,ccp_alpha. Greater values ofccp_alphaincrease the number of nodes pruned. Here we only show the effect ofccp_alphaon regularizing the trees and how to choose accp_alphabased on validation scores.
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
For the remainder of this example, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
clf.fit(x_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Accuracy vs alpha for training and testing sets
When ccp_alpha is set to zero and keeping the other default parameters of :class:DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 82% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.000026 maximizes the testing accuracy.
train_scores = [clf.score(x_train, y_train) for clf in clfs]
test_scores = [clf.score(x_test, y_test) for clf in clfs]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=0.000026)
clf.fit(x_train,y_train)
Now , check again accuaracy
pred=clf.predict(x_test) from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
Thats how we will prevent our model by overfitting.
NOW , I made web application using django using given data.GitHub Link - https://github.com/ArpitaG10/ML_-_DS_project/tree/masterFour pageslogin,registrationinput field pagePredicted output pageThankyou


Comments
Post a Comment