Skip to main content

Advanced and Effective Application for Farmers to Predict Yield of Crop

 Data Science And Machine Learning Project

Let's get started


Project Short introduction - In Our Indian Yield prediction is a very important issue in agriculture. Any farmer is interested in knowing how much yield he is about to expect. In the past, yield prediction was performed by considering farmer's experience in particular fields and crops. The yield prediction is a major issue that remains to be solved based on available data. Machine learning and data science are the better choices for this purpose. Different Data techniques are also used and evaluated in agriculture for estimating the future year's crop production. ML and Data science propose and implement a system to predict crop yield from previous agriculture data. 

So, Our first step is Data preprocessing. we need to understand, clean, and balance data and we will understand step by step.

1) load real-life data, we need to import pandas library.

     raw_data = pd.read_csv('yield_df.csv')

Unnamed: 0AreaItemYearhg/ha_yieldaverage_rain_fall_mm_per_yearpesticides_tonnesavg_temp
35793579BrazilSorghum2000150131761.0140423.0018.01
1367313673IndiaCassava20093434331083.028707.0124.87
2135621356NicaraguaSorghum1995206302280.0876.0027.20


*) It's a good habit to make a copy of the data.
data = raw_data.copy()

*) lets check info inside the data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28242 entries, 0 to 28241
Data columns (total 8 columns):
Unnamed: 0                       28242 non-null int64
Area                             28242 non-null object
Item                             28242 non-null object
Year                             28242 non-null int64
hg/ha_yield                      28242 non-null int64
average_rain_fall_mm_per_year    28242 non-null float64
pesticides_tonnes                28242 non-null float64
avg_temp                         28242 non-null float64
dtypes: float64(3), int64(3), object(2)
memory usage: 1.7+ MB


Now,  here we get to know that how many entries we have in the data. And we can see Unnamed feature is not required clearly, so we will drop it.

data_drop_id = data.drop(['Unnamed: 0',Area],axis=1)

*) Then, we will check, is there any null value in the given data?

Item                             0
Year                             0
hg/ha_yield                      0
average_rain_fall_mm_per_year    0
pesticides_tonnes                0
avg_temp                         0
dtype: int64

no , there is no null value. 
so we will move to next.

*) We can use describe() method for a better understanding of data 

data_drop_id.describe(include='all')

from this method,  we can see some strange things in hg/ha_yield feature, the 'max' row showing a very high value which tells us that we have outliers in the data.
ItemYearhg/ha_yieldaverage_rain_fall_mm_per_yearpesticides_tonnesavg_tempArea
count2824228242.00000028242.00000028242.0000028242.00000028242.00000028242.000000
unique10NaNNaNNaNNaNNaNNaN
topPotatoesNaNNaNNaNNaNNaNNaN
freq4276NaNNaNNaNNaNNaNNaN
meanNaN2001.54429677053.3320941149.0559837076.90934420.54262746.494724
stdNaN7.05190584956.612897709.8121559958.7846656.31205126.813405
minNaN1990.00000050.00000051.000000.0400001.3000000.000000
25%NaN1995.00000019919.250000593.000001702.00000016.70250024.000000
50%NaN2001.00000038295.0000001083.0000017529.44000021.51000042.000000
75%NaN2008.000000104676.7500001668.0000048687.88000026.00000068.000000
maxNaN2013.000000501412.0000003240.00000367778.00000030.650000100.000000


we can see it with graph using matplotlib library

sns.distplot(data_drop_id['hg/ha_yield'])

<matplotlib.axes._subplots.AxesSubplot at 0x1eaf80f6588>
we can see outliers on right hand side

so , what to do now.

# Let's declare a variable that will be equal to the 99th percentile of the 'hg/ha_yield' variable
q = data_area_drop['hg/ha_yield'].quantile(0.99)
# Then we can create a new df, with the condition that all prices must be below the 99 percentile of 'hg/ha_yield'
data_removed_outliers = data_area_drop[data_area_drop['hg/ha_yield']<q]
# In this way we have essentially removed the top 1% of the data about 'hg/ha_yield'
data_removed_outliers .describe(include='all')
we can see ' max ' has a low value from before, like that we will remove other feature outliers too.

*) Now, we will check which feature has more significance in our given data.

so, we will use stats model , and find variance_inflation_factor

VIFFeatures
01.907311hg/ha_yield
113.650550Year
21.491602pesticides_tonnes
34.069444average_rain_fall_mm_per_year
413.157713avg_temp

from this table we can see the year and avg_temp has VIF value 13, VIF value should be under 1 to 5. comparing between two feature, we will drop Year only then again check and final the data.
By this VIF, we are also checking the multicollinearity issue.

Hence out data preprocessing finished here. now we will move to the model.

MODEL


*) Select input and the target data from the given data.

inputs = data_preprocessed.iloc[:,1:]
targets = data_preprocessed.iloc[:,:1]

*) Standardize the input data

input_scaler = StandardScaler()
input_scaler.fit(inputs)
input_scaled = input_scaler.transform(inputs)

array([[-0.47156037,  0.47905342, -0.66263568, -0.67562489, -1.75232436],
       [-0.08594389,  0.47905342, -0.66263568, -0.67562489, -1.75232436],
       [-0.64195323,  0.47905342, -0.66263568, -0.67562489, -1.75232436],
       ...,
       [-0.77271178, -0.68151298, -0.61482275, -0.13560958,  2.00046897],
       [-0.65620823, -0.68151298, -0.61482275, -0.13560958,  2.00046897],
       [-0.64766293, -0.68151298, -0.61482275, -0.13560958,  2.00046897]])

*) Now , Split the data 

from sklearn.model_selection import train_test_split
# declare 4 variables for the split
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets, test_size = 0.2, random_state = 20)

test_size - It will split data by giving test size and 
randome_data - it will give suffeled data.

Now we have train and test data.

*)Now , all done , time to apply algorithm in our train data.

from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)

here we are using classifier algorithm and i choose DecisionTreeClassifier after comparing other algorithm. 
This algo gave highest accuracy to our data. Means our model
 prediction is very high and accurate

clf.score(x_train,y_train)
0.9996836301184128

 testing the test data.

prediction = clf.predict(x_test)
from sklearn.metrics import accuracy_score
accuracy_score(prediction,y_test)

0.8550253073029646

Now , we are in last but very important step


Post pruning decision trees with cost complexity pruning

.. currentmodule:: sklearn.tree

The :class:DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In :class:DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.


path = clf.cost_complexity_pruning_path(x_train, y_train)

ccp_alphas, impurities = path.ccp_alphas, path.impurities

For the remainder of this example, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.


clfs = []

for ccp_alpha in ccp_alphas:

    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)

    clf.fit(x_train, y_train)

    clfs.append(clf)

print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(

      clfs[-1].tree_.node_count, ccp_alphas[-1]))


Accuracy vs alpha for training and testing sets

When ccp_alpha is set to zero and keeping the other default parameters of :class:DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 82% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.000026 maximizes the testing accuracy.


train_scores = [clf.score(x_train, y_train) for clf in clfs]

test_scores = [clf.score(x_test, y_test) for clf in clfs]


fig, ax = plt.subplots()

ax.set_xlabel("alpha")

ax.set_ylabel("accuracy")

ax.set_title("Accuracy vs alpha for training and testing sets")

ax.plot(ccp_alphas, train_scores, marker='o', label="train",

        drawstyle="steps-post")

ax.plot(ccp_alphas, test_scores, marker='o', label="test",

        drawstyle="steps-post")

ax.legend()

plt.show()



clf = DecisionTreeClassifier(random_state=0, ccp_alpha=0.000026)

clf.fit(x_train,y_train)


Now , check again accuaracy

pred=clf.predict(x_test) from sklearn.metrics import accuracy_score 

accuracy_score(y_test, pred)


Thats how we will prevent our model by overfitting.




NOW , I made web application using django using given data.

GitHub Link - https://github.com/ArpitaG10/ML_-_DS_project/tree/master

Four pages

login,
registration
input field page
Predicted output page









Thankyou



Comments