Advanced and Effective Application for Farmers to Predict Yield of Crop

Data Science And Machine Learning Project

Let's get started

Project Short introduction - In Our Indian Yield prediction is a very important issue in agriculture. Any farmer is interested in knowing how much yield he is about to expect. In the past, yield prediction was performed by considering farmer's experience in particular fields and crops. The yield prediction is a major issue that remains to be solved based on available data. Machine learning and data science are the better choices for this purpose. Different Data techniques are also used and evaluated in agriculture for estimating the future year's crop production. ML and Data science propose and implement a system to predict crop yield from previous agriculture data.

So, Our first step is Data preprocessing. we need to understand, clean, and balance data and we will understand step by step.

1) load real-life data, we need to import pandas library.

raw_data = pd.read_csv('yield_df.csv')

Unnamed: 0	Area	Item	Year	hg/ha_yield	average_rain_fall_mm_per_year	pesticides_tonnes	avg_temp
3579	3579	Brazil	Sorghum	2000	15013	1761.0	140423.00	18.01
13673	13673	India	Cassava	2009	343433	1083.0	28707.01	24.87
21356	21356	Nicaragua	Sorghum	1995	20630	2280.0	876.00	27.20

*) It's a good habit to make a copy of the data.

data = raw_data.copy()

*) lets check info inside the data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28242 entries, 0 to 28241
Data columns (total 8 columns):
Unnamed: 0                       28242 non-null int64
Area                             28242 non-null object
Item                             28242 non-null object
Year                             28242 non-null int64
hg/ha_yield                      28242 non-null int64
average_rain_fall_mm_per_year    28242 non-null float64
pesticides_tonnes                28242 non-null float64
avg_temp                         28242 non-null float64
dtypes: float64(3), int64(3), object(2)
memory usage: 1.7+ MB

Now, here we get to know that how many entries we have in the data. And we can see Unnamed feature is not required clearly, so we will drop it.

data_drop_id = data.drop(['Unnamed: 0',Area],axis=1)

*) Then, we will check, is there any null value in the given data?

Item                             0
Year                             0
hg/ha_yield                      0
average_rain_fall_mm_per_year    0
pesticides_tonnes                0
avg_temp                         0
dtype: int64

no , there is no null value.

so we will move to next.

*) We can use describe() method for a better understanding of data

data_drop_id.describe(include='all')

from this method, we can see some strange things in hg/ha_yield feature, the 'max' row showing a very high value which tells us that we have outliers in the data.

	Item	Year	hg/ha_yield	average_rain_fall_mm_per_year	pesticides_tonnes	avg_temp	Area
count	28242	28242.000000	28242.000000	28242.00000	28242.000000	28242.000000	28242.000000
unique	10	NaN	NaN	NaN	NaN	NaN	NaN
top	Potatoes	NaN	NaN	NaN	NaN	NaN	NaN
freq	4276	NaN	NaN	NaN	NaN	NaN	NaN
mean	NaN	2001.544296	77053.332094	1149.05598	37076.909344	20.542627	46.494724
std	NaN	7.051905	84956.612897	709.81215	59958.784665	6.312051	26.813405
min	NaN	1990.000000	50.000000	51.00000	0.040000	1.300000	0.000000
25%	NaN	1995.000000	19919.250000	593.00000	1702.000000	16.702500	24.000000
50%	NaN	2001.000000	38295.000000	1083.00000	17529.440000	21.510000	42.000000
75%	NaN	2008.000000	104676.750000	1668.00000	48687.880000	26.000000	68.000000
max	NaN	2013.000000	501412.000000	3240.00000	367778.000000	30.650000	100.000000

we can see it with graph using matplotlib library

sns.distplot(data_drop_id['hg/ha_yield'])

<matplotlib.axes._subplots.AxesSubplot at 0x1eaf80f6588>

we can see outliers on right hand side

so , what to do now.

# Let's declare a variable that will be equal to the 99th percentile of the 'hg/ha_yield' variable

q = data_area_drop['hg/ha_yield'].quantile(0.99)

# Then we can create a new df, with the condition that all prices must be below the 99 percentile of 'hg/ha_yield'

data_removed_outliers = data_area_drop[data_area_drop['hg/ha_yield']<q]

# In this way we have essentially removed the top 1% of the data about 'hg/ha_yield'

data_removed_outliers .describe(include='all')

we can see ' max ' has a low value from before, like that we will remove other feature outliers too.

*) Now, we will check which feature has more significance in our given data.

so, we will use stats model , and find variance_inflation_factor

VIF	Features
0	1.907311	hg/ha_yield
1	13.650550	Year
2	1.491602	pesticides_tonnes
3	4.069444	average_rain_fall_mm_per_year
4	13.157713	avg_temp

from this table we can see the year and avg_temp has VIF value 13, VIF value should be under 1 to 5. comparing between two feature, we will drop Year only then again check and final the data.

By this VIF, we are also checking the multicollinearity issue.

Hence out data preprocessing finished here. now we will move to the model.

MODEL

*) Select input and the target data from the given data.

inputs = data_preprocessed.iloc[:,1:]

targets = data_preprocessed.iloc[:,:1]

*) Standardize the input data

input_scaler = StandardScaler()

input_scaler.fit(inputs)

input_scaled = input_scaler.transform(inputs)

array([[-0.47156037,  0.47905342, -0.66263568, -0.67562489, -1.75232436],
       [-0.08594389,  0.47905342, -0.66263568, -0.67562489, -1.75232436],
       [-0.64195323,  0.47905342, -0.66263568, -0.67562489, -1.75232436],
       ...,
       [-0.77271178, -0.68151298, -0.61482275, -0.13560958,  2.00046897],
       [-0.65620823, -0.68151298, -0.61482275, -0.13560958,  2.00046897],
       [-0.64766293, -0.68151298, -0.61482275, -0.13560958,  2.00046897]])

*) Now , Split the data

from sklearn.model_selection import train_test_split
# declare 4 variables for the split
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets, test_size = 0.2, random_state = 20)

test_size - It will split data by giving test size and

randome_data - it will give suffeled data.

Now we have train and test data.

*)Now , all done , time to apply algorithm in our train data.

from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)

here we are using classifier algorithm and i choose DecisionTreeClassifier after comparing other algorithm. 
This algo gave highest accuracy to our data. Means our model
 prediction is very high and accurate

clf.score(x_train,y_train)
0.9996836301184128

 testing the test data.

prediction = clf.predict(x_test)
from sklearn.metrics import accuracy_score
accuracy_score(prediction,y_test)

0.8550253073029646

Now , we are in last but very important step


Post pruning decision trees with cost complexity pruning
.. currentmodule:: sklearn.tree
The :class:DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In :class:DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
For the remainder of this example, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(x_train, y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))

Accuracy vs alpha for training and testing sets
When ccp_alpha is set to zero and keeping the other default parameters of :class:DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 82% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.000026 maximizes the testing accuracy.

train_scores = [clf.score(x_train, y_train) for clf in clfs]
test_scores = [clf.score(x_test, y_test) for clf in clfs]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()


clf = DecisionTreeClassifier(random_state=0, ccp_alpha=0.000026)
clf.fit(x_train,y_train)

Now , check again accuaracy
pred=clf.predict(x_test)
from sklearn.metrics import accuracy_score 
accuracy_score(y_test, pred)

Thats how we will prevent our model by overfitting.



NOW , I made web application using django using given data.

GitHub Link - https://github.com/ArpitaG10/ML_-_DS_project/tree/master

Four pages 

login,
registration
input field page
Predicted output page









Thankyou

Amazing ! Seriously !

Search This Blog

Advanced and Effective Application for Farmers to Predict Yield of Crop

Post pruning decision trees with cost complexity pruning

Accuracy vs alpha for training and testing sets

Labels

Comments

Post a Comment