Random Data

Random data: weather and potato production in Idaho

The potato is one of the most important agricultural products (officially a vegetable, but sometimes hard to think of that way) in the world. With fewer growing requirements than most other staple foods, it was instrumental in fueling the European agricultural revolution and rapid population growth of the 18th and 19th centuries. It probably allowed for European nations to dominate the world for two hundred years. More importantly, potatoes – especially the Russet Burbank variety – are the key component of French fries.

I don’t really know much about them, so I decided to do some digging. Quickly, I learned that they are typically planted in late April and early May, while most of the harvesting is done in late September and early October, at least in North America. Obviously, optimal potato production depends on much more than just weather conditions. You also have to look at soil quality, precipitation pest control, diseases, and other things (which themselves may depend on the weather). I don’t really know much about those things, but I do know about land area and its importance to agricultural production, so that’s going to be included in the model.

I decided to find weather data for Idaho Falls, using the extremely scientific method of looking for the city with the biggest text on Google Maps in eastern Idaho, where most of the potatoes are grown. In the interest of keeping things as simple as possible, I will use the mean temperatures for four months: April, May, September, and October. I pulled some data from the Department of Agriculture and National Oceanic and Atmospheric Administration into 4 csv files. They include 11 years of data from 2006-2016, and the relevant items for my regression are harvested land acreage, average highs in planting and harvesting months, and production in hundredweight. I also have the planting acreage, but they area by necessity weakly greater than the harvested acreage and don’t really differ too much. Also, I have all the temperatures in Celsius, just because.

idahofallsfanningfield idahopotatoarea idahopotatoplanted idahopotatoproduction

# Import the weather data from Idaho Falls, Fanning Field station.
if.weather=read.csv("idahofallsfanningfield.csv")

# Get only the date by month, high and low temperatures. I'm not using low
# for this analysis but this might come in handy for a future post.
if.growharv=if.weather[,c(3,26,27)]

# Pull only April, May, September and October data, then refresh the row number.
if.growharv=data.frame(if.growharv[grep("-04|-05|-09|-10",if.growharv$DATE),])
rownames(if.growharv)=seq(length=nrow(if.growharv))

# We only need the first 44 entries since 2017 is still in progress. But if
# we wanted to really split training and test sets this would be the way.
if.train=if.growharv[1:44,]
if.mat=matrix(if.train[,2],c(11,4),byrow=T)

# Get the production numbers in hundredweight, have them represented in millions.
id.production=read.csv("idahopotatoproduction.csv")
id.production=matrix(as.numeric(gsub(",","",rev(id.production[-6,20]))))/1000000

# Put everything in easy to read/write X/Y format.
Y=id.production
X=if.mat

# I forgot to get the harvested area data, so do it now and give everything a nice
# Xi name, with X1...X4 being Monthly temperatures and X5 being land area in 10000 acres.
id.area=read.csv("idahopotatoarea.csv")
X1=X[,1];X2=X[,2];X3=X[,3];X4=X[,4];
X5=matrix(as.numeric(gsub(",","",rev(id.area[,20]))))/10000

# Fit a no intercept linear model.
lm.fit=lm(Y~X1+X2+X3+X4+X5-1)
summary(lm.fit)

This gives the output:

Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8514 -2.2445 -0.9718  3.2498  6.7506 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)
X1   1.9140     1.2286   1.558    0.170
X2   0.9234     1.2611   0.732    0.492
X3   1.1731     0.9647   1.216    0.270
X4  -0.6026     0.9592  -0.628    0.553
X5   1.9993     1.2225   1.635    0.153

Residual standard error: 5.947 on 6 degrees of freedom
Multiple R-squared:  0.9989,    Adjusted R-squared:  0.9979 
F-statistic:  1047 on 5 and 6 DF,  p-value: 9.83e-09

That’s a really high adjusted R-squared, but have I really learned anything here? None of the individual effects have achieved a decent significance level. Why is the X4 coefficient negative?¬† If I add in an X4^2 term the X4 coefficient becomes positive. Does a really hot October have a major detrimental effect on yields? There may also be some correlation between weather and land usage: do farmers plant different amounts depending on April and May weather? What about¬† the weather in the in-between months?

Intuitively, above 10 degrees, weather should have a non-monotonic concave effect that’s not readily apparent until it gets closer to 30 or above. Hot weather is very harmful to newly planted potatoes, but how hot does it need to get to have a detrimental effect once a plant has been established? Low temperatures would also have a negative effect, and a simple monthly mean does not take into account more granular fluctuations in temperature.

Obviously I’m not going to find some magical model in a single caffeine-fueled early morning data bender. This is serious stuff that takes a lifetime of research and requires some degree of knowledge of economics, climatology, biology and organic chemistry. I was just a little interested in potatoes.