Skip to content

R, e.g.: Grid of aligned time series plots in R

May 29, 2012
tags: ,

Maybe this is a counter example.  The outcome of this activity illustrates one of my points for  my introductory “R, For Example” post. Specifically, the format of the data is nearly everything.  Here’s the backstory of the data.  I’ll wait.  Okay, here’s the R problem…

The goal was to create a stacked chart of aligned time series plots like this:

The plots have different y-scales–and this is where the trouble starts. ggplot does a great job of generating every appropriate y-tick marks and giving everything that ggplot prettiness. But if you do this plot naively, you get a plot where the x-ranges are identical, but not aligned.  Here was my original graph,

The arrow shows the alignment problem. This is because printing the scales takes up different space, depending on the magnitude (decade, really) of the tick labels.

I imagined the easy way to fix this was a custom formatter, so I write a quick bit of code to format the y-tick labels with the right number of leading spaces.

# Fix it by manually formatting the y-tick labels
lalb <- function(x) {
lx <- sprintf('%04d', x) # Format the strings as HH:MM:SS
lx <- gsub('^000', ' ', lx) # Remove leading 00: if present
lx <- gsub('^00', ' ', lx) # Remove leading 00: if present
lx <- gsub('^0', ' ', lx) # Remove leading 0 if present
}

Quick and easy, right? Nope. The tick labels use proportionally spaced fonts, so this gets hack gets us closer, but not quite there.  In the end, I had to add some artificial padding to the tick label. I ended up adding “8” to the plot with the largest scale and this pushed everything far enough right to get it pretty close.  The final results is shown in the first graph of the post.

Lesson: A better way to do this is to load the data in column form and make each publisher a type (in a single column) instead of making each publisher its own column. R likes tall data frames. Then a ggplot-generated grid of plots would be aligned perfectly.

Here’s the code:

#!/usr/bin/env Rscript
 library(ggplot2)
 library(gridExtra)
 library(scales)

# data in columns, joined on date field
 x<-read.table("./joined.csv", header=TRUE, sep=",")
 x$date = as.POSIXct(x$time)
 summary(x)

# Make some times series plots
 p1 <- ggplot(data=x) +
 geom_line(aes(date, twcount), color="red") +
 opts(title="Twitter", axis.title.x = theme_blank())
 p3 <- ggplot(data=x) +
 geom_line(aes(date, wpcccount + wpoccount), color="red") +
 opts(title="Wordpress Comments", axis.title.x = theme_blank())
 p4 <- ggplot(data=x) +
 geom_line(aes(date, ngcount), color="red") +
 opts(title="Newsgator")+ xlab("Time (UTC)")

png(filename = "./timelines_noalign.png", width = 550, height = 750, units = 'px', res = 100)
 print(
 grid.arrange(p1, p3, p4 ,ncol = 1,
 main=textGrob("Earthquake - Mexico - 20 March 2012", gp=gpar(fontsize=14, lineheight=18)) )
 )

# Fix it by manually formatting the y-tick labels
 lalb <- function(x) {
 lx <- sprintf('%04d', x) # Format the strings as HH:MM:SS
 lx <- gsub('^000', ' ', lx) # Remove leading 00: if present
 lx <- gsub('^00', ' ', lx) # Remove leading 00: if present
 lx <- gsub('^0', ' ', lx) # Remove leading 0 if present
 }

# Unfortunately, do to proportional spaced fonts?, spaces dont' quite do ti. Add a manual adjustment.
 p1 <- ggplot(data=x) +
 geom_line(aes(date, twcount), color="red") +
 opts(title="Twitter", axis.title.x = theme_blank(), axis.title.y = theme_text(hjust=8)) +
 ylab("") + scale_y_continuous(labels = lalb)
 p3 <- ggplot(data=x) +
 geom_line(aes(date, wpcccount + wpoccount), color="red") +
 opts(title="Wordpress Comments", axis.title.x = theme_blank()) +
 ylab("") + scale_y_continuous(labels = lalb)
 p4 <- ggplot(data=x) +
 geom_line(aes(date, ngcount), color="red") +
 opts(title="Newsgator")+ xlab("Time (UTC)") +
 ylab("") + scale_y_continuous(labels = lalb)

png(filename = "./timelines_align.png", width = 550, height = 750, units = 'px', res = 100)
 print(
 grid.arrange(p1, p3, p4 ,ncol = 1,
 main=textGrob("Earthquake - Mexico - 20 March 2012", gp=gpar(fontsize=14, lineheight=18)) )
 )
 dev.off()

Data and code on github.

Social Media Pulse

May 25, 2012
tags: ,

Social media stories evolve according to a particular pattern when surprise events are experience by many social media users at once.  An example from Twitter was the earthquake in Oaxaca, MX on 20 March 2012.  The Twitter volume for terms “quake” and “terremoto” are shown in the figure.

The Social Media Pulse for simultaneously observed events can be modeled as a Stationary Poisson Point Process.  The model yields a Gamma distribution,

f_S(t)=\frac{\beta^{-\alpha}(t-t_0)^{\alpha-1}\exp(\frac{-(t-t_0)}{\beta}) }{\Gamma(\alpha)}

and the rate,

 

r(t)=N_{activities}f_S(t)

 

(Another, slightly more tractable model analytically for the The Social Media Pulse is to fit with a double exponential, r(t) = r_0(1-\exp(-\alpha (t - t_0))\exp(-\beta (t-t_0)))

Doing this enables consistent comparison of a story across social media publishers as well as comparisons of various stories.

A detailed writeup of the pulse parameters and fitting is available at Social Media Pulse and additional discussion of social media firehoses can be found in my Gnip blog series Taming the Social Media Firehose.

Python code on Github.

Dimension reduction for machine learning – simple example of SVD, PCA, pathology

May 14, 2012

Of the fast and loose practices in machine learning that I cringe at, throwing one’s favorite tool at high-dimensional data and expecting algorithms to learn to generalize well.

Part of the problem is we don’t always know if the features we chose are meaningful to the learning problem. Another part of the problem is that a useful regression requires that our sample fills the space adequately to regress.  (An n-sphere has a volume that scales as  V \approx R^n .  This means that if you feel comfortable with 10 samples for a 1-d linear regression, you should have in your mind a million samples for a 6-d regression.)

One strategy for dealing with high-dimensionality is to rotate and scale the data set along axes of high sample variation. Two closely related mathematical tools are available to help with this: Singular Value Decomposition (SVD) and Principal Component Analysis). I won’t add anything to the mathematics described in these articles.  The purpose here is

  1. Explore some questions with a simple example
  2. Explore the link between SVD and PCA with some simple data to understand how it works, and build a little bit of intuition on what linear transformation can do for machine learning.
  3. Demonstrate how this strategy could work to reduce the dimentionality of a problem (This really turns out to be reducing the dimentionality of the representation of the problem.)
  4. Show how this strategy can easily go wrong in order to build intuition about when and how this might work for a real machine learning problem.
There is a link to the all of the code at the bottom.  To get this output, install IPython, the matplotlib, scipy, numpy, scikits.learn packages, then paste code into a session.  Or just edit the script to include the output you want and run from the command line.
Here we go…first SVD.

 In [23]: ########################
 In [24]: # Demo Part 1
 In [25]: ########################
 In [26]: # SVD: decompose a matrix X into product of unitary matrix (U), diagonal
 In [27]: # matrix (S) and vector V, a rotation, (if it helps, think:
 In [28]: # basis vectors, scaling, rotation) such that X.T = W x S x V_t.
 In [29]: #
 In [30]: # In machine learning, X will be a matrix of samples (4, rows)
 In [31]: # and features (3, columns) for our learning examples.
 In [32]: X_t = mat([ [ 1, 3, -10 ],
 ....: [ 0, 2, 1 ],
 ....: [ -1, -3, 9 ],
 ....: [ 0, -2, 0 ] ])
 In [33]: X = X_t.T
 In [34]: # The scipy library makes this step easy
 In [35]: W, s, V_t = linalg.svd( X )
 In [36]: S = linalg.diagsvd(s, len(X), len(V_t))
 In [37]: recon = dot( dot( W, S), V_t)
 In [38]: # Are these equal (to within rounding)?
 In [39]: abs(X - recon)
 Out[39]:
 matrix([[ 4.44089210e-16, 1.73472348e-16, 6.66133815e-16,
 1.59594560e-16],
 [ 8.88178420e-16, 2.22044605e-16, 4.44089210e-16,
 1.33226763e-15],
 [ 1.77635684e-15, 4.44089210e-16, 5.32907052e-15,
 6.73940070e-16]])
 In [40]: # maximum error
 In [41]: np.max(abs(X - recon))
 Out[41]: 5.3290705182007514e-15
 In [42]: # One key to understanding the link to PCA is
 In [43]: # understanding the diagonal matrix, S
 In [44]: S
 Out[44]:
 array([[ 14.19266482, 0. , 0. , 0. ],
 [ 0. , 2.9255743 , 0. , 0. ],
 [ 0. , 0. , 0.09633518, 0. ]])
 In [45]: # Given that the features have zero-mean:
 In [46]: [np.mean(i) for i in X]
 Out[46]: [0.0, 0.0, 0.0]
 In [47]: # s is an ordered vector that tells us the "significance" of each dimension
 In [48]: # in the rotated space. (Yes, I arranged it that way. To be clear,
 In [49]: # you can do SVD on a matrix without zero-mean features, but the dimension
 In [50]: # reduction part we are about to do requires it.)
 In [52]: # We can selectively set lower numbers to
 In [53]: # zero to reduce dimension.
 In [54]: s_red = s
 In [55]: s_red[2] = 0
 In [56]: # Approximately reconstruct our original matrix, but with
 In [57]: # a reduced-dimension representation
 In [58]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
 In [59]: S_red
 Out[59]:
 array([[ 14.19266482, 0. , 0. , 0. ],
 [ 0. , 2.9255743 , 0. , 0. ],
 [ 0. , 0. , 0. , 0. ]])
 In [60]: recon_red = dot( dot( W, S_red), V_t)
 In [61]: abs(X - recon_red)
 Out[61]:
 matrix([[ 0.04297068, 0.04061026, 0.05215936, 0.05451977],
 [ 0.00118308, 0.00111809, 0.00143606, 0.00150105],
 [ 0.00412864, 0.00390185, 0.00501149, 0.00523828]])
 In [62]: # maximum error
 In [63]: np.max(abs(X - recon_red))
 Out[63]: 0.054519772460470559
 In [64]: # ratio of errors
 In [65]: np.max(abs(X - recon))/np.max(abs(X - recon_red))
 Out[65]: 9.7745648554652029e-14
 In [66]:
 In [66]: # We "lost" 14 orders of magnitude in precision of the reconstruction, but this turns
 In [67]: # out to be okay for some machine learning problems.

So that is a very simple SVD example.  Now use the same representation for a simple machine learning problem.   Classify to groups of data in a two dimensional space using Logistic Regression.   (Logistic regression is very good a solving problems just like this without SVD or PCA, so this is merely to get at how it all works together.)

 In [68]: ########################
 In [69]: # Demo Part 2
 In [70]: ########################
 In [71]: from scikits.learn import linear_model
 In [72]: import scipy
 In [73]: from copy import copy
 In [74]: # Classification problem where points are linearly classifiable
 In [75]: # in 2 dim.
 In [76]: N=200
 In [77]: x = np.random.normal(0,4,N)
 In [78]: y1 = 3*x + np.random.normal(3,1,N)
 In [79]: y2 = 3*x + np.random.normal(-3,1,N)
 In [80]: y_avg = np.mean(np.append(y1, y2, 1))
 In [81]: figure()
 In [82]: plot(x,y1 - y_avg, 'bo');
 In [83]: plot(x,y2 - y_avg, 'ro');
 In [84]: title("Original data sets (0-average)")

Image

Boundary between groups at  y(x) \approx 3x

In [85]: # features x and y are rows in this matrix
In [86]: X = np.append([x,y1 - y_avg],[x,y2 - y_avg],1)
In [87]: X_t = X.T
In [88]: # y1 group 0; y2 group 1
In [89]: truth = np.append(scipy.ones([1, N]), scipy.zeros([1, N]), 1)
In [90]: # 2d model works very well
In [91]: lr = linear_model.LogisticRegression()
In [92]: lr.fit(np.asarray(X_t),truth[0])
Out[92]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
 penalty='l2', tol=0.0001)
In [93]: lr.score(np.asarray(X_t),truth[0])
Out[93]: 1.0
In [94]: lr.predict(np.asarray(X_t))
Out[94]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [95]: # Now try dimension reduciton with SVD
In [96]: W, s, V_t = linalg.svd( X )
In [97]: s
Out[97]: array([ 273.98689602, 19.47579928])
In [98]: # both transformed representations are the same before trimming:
In [99]: S = linalg.diagsvd(s, len(X), len(V_t))
In [100]: np.max(abs(X.T*matrix(W) - matrix(V_t).T*matrix(S).T))
Out[100]: 6.7501559897209518e-14
In [101]: # Now work with the transformed coordinates. It might not have been clear
In [102]: # from above what the transformed coordinate system was. We can get there
In [103]: # by either the product of the first two or last two terms.
In [104]: X_prime = matrix(V_t).T*matrix(S).T
In [105]: x_prime = np.asarray(X_prime.T[0])
In [106]: y_prime = np.asarray(X_prime.T[1])
In [107]: figure()
In [108]: plot(x_prime, y_prime, 'go');
In [109]: title("Features after SVD Transformation")
Out[109]: <matplotlib.text.Text at 0x11b956890>

Image

Boundary between groups at  y(x) \approx 0

In [110]: # Linearly classifiable in 1-d? Try all new basis directions (extremes of variation)
In [111]: # Min variation - Training along y-dim nearly perfect
In [112]: ypt = np.asarray(y_prime.T)
In [113]: lr.fit(ypt, truth[0])
Out[113]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
 penalty='l2', tol=0.0001)
In [114]: lr.score(ypt, truth[0])
Out[114]: 0.99750000000000005
In [115]: lr.predict(ypt)
Out[115]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [116]: # Max variation - Nothing here
In [117]: xpt = np.asarray(x_prime.T)
In [118]: lr.fit(xpt, truth[0])
Out[118]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
 penalty='l2', tol=0.0001)
In [119]: lr.score(xpt, truth[0])
Out[119]: 0.58250000000000002
In [120]: lr.predict(xpt)
Out[120]:
array([1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
 1, 0, 0, 0, 1, 1, 0, 1, 0], dtype=int32)

Notice that the transformation made the problem “easier” in that it was solved with 1-d instead of 2-d machine learning and that the two groups appear even more separated after the transformation than before.

Lesson 1: Look at all of the dimensions–in this case the smallest variation axis rather than the largest variation axis solves the problem.  This is going to catch anyone who blindly applies PCA for machine learning.  See Part 3.


In [121]: ########################
In [122]: # Demo Part 3
In [123]: ########################
In [124]: # Use PCA idea to reduce to 1-D
In [125]: s_red = copy(s)
In [126]: s_red[1] = 0
In [127]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [128]: X_prime = matrix(V_t).T*matrix(S_red).T
In [129]: x_prime = np.asarray(X_prime.T[0])
In [130]: y_prime = np.asarray(X_prime.T[1])
In [131]: figure()
Out[131]: <matplotlib.figure.Figure at 0x1193e7450>
In [132]: plot(x_prime, y_prime, 'yo');
In [133]: title("Reduce S by removing s[1] = %2.5f"%s[1])

All original group information lost.

In [134]: # Try all new basis directions (not just greatest variations)
In [135]: # 1-D: Max variation - Training along x-dim performs poorly
In [136]: ypt = np.asarray(y_prime.T)
In [137]: lr.fit(ypt, truth[0])
Out[137]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
 penalty='l2', tol=0.0001)
In [138]: lr.score(ypt, truth[0])
Out[138]: 0.5
In [139]: lr.predict(ypt)
Out[139]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [140]: # This is the other extrema to "principal" components
In [141]: s_red = copy(s)
In [142]: s_red[0] = 0
In [143]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [144]: X_prime = matrix(V_t).T*matrix(S_red).T
In [145]: x_prime = np.asarray(X_prime.T[0])
In [146]: y_prime = np.asarray(X_prime.T[1])
In [147]: figure()
Out[147]: <matplotlib.figure.Figure at 0x11c74f6d0>
In [148]: plot(x_prime, y_prime, 'mo');
In [149]: title("Reduce S by removing value s[0] = %2.5f"%s[0])

Image

Trivial classification, now in 1-d.

In [150]: # 1-D: Min variation - Training along y-dim nearly perfect
In [151]: ypt = np.asarray(y_prime.T)
In [152]: lr.fit(ypt, truth[0])
Out[152]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
 penalty='l2', tol=0.0001)
In [153]: lr.score(ypt, truth[0])
Out[153]: 0.99750000000000005
In [154]: lr.predict(ypt)
Out[154]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

So our 1-d model performs great. Quoting from Wikipedia,

“PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information.”

The problem with following this without skepticism is that, according to the guidelines of PCA, our problem depends on using the “wrong” dimension.  (The properties are not quite as arbitrary as they may seem. For example, race data for long distances that includes both men and women have this quality because the variation within gender of total race times can be greater than the variation between gender times.)

Code on Github

R, e.g.: Bokal’s cheese balls

May 9, 2012

Recently, someone in the office (Bokal?) was eating cheese balls from a barrel.

My first reaction was “Is this food?”  My second was slightly less judgmental and more fertile for R exploration “In what context does this food live?” This sent me off to the web and R.  I found a fairly comprehensive data set for various food and plotted Bokal’s Cheese ball in context for fat, carbs and protein–mostly what food is (fiber?). I didn’t get mass-density information, so I plotted by energy density.

And the scatter plots show more clearly that Cheese balls are out on the edge–only possible with heroic technological means.

#!/usr/bin/env Rscript
#
# Scott Hendrickson
#   2012-01-27
#

library(ggplot2)
library(gridExtra)

# read in nutritional data
cnt<-read.table("./foodNutritionalData.csv", header=TRUE, sep=",")
cnt$edensity <- cnt$Energy/cnt$Weight
cnt$feratio <- cnt$Fat/cnt$Energy
cnt$peratio <- cnt$Protein/cnt$Energy
cnt$ceratio <- cnt$Carbohydrate/cnt$Energy

summary(cnt)

# single out Bokal's cheese balls
cnt$Type <- "Everything Else"
# this record was entered first in the file
cnt$Type[c(1)] <- "Bokal's Cheese Balls"
# cnt[1,]

#order and context by energy density
cntbyedensity <- cnt[order(cnt$edensity),]
ordr <- match("Bokal's Cheese Balls",  cntbyedensity$Type)
print(cntbyedensity[ (ordr - 5): (ordr + 5), ], width = ,digits = 3)

# arrow - horizontal arrow pointing to the x value of point offset by aoffset
# point(x,y); point$dx, point$dy size; ad direction; xoffset arrow point offset; xhf, xhf arrow size fractions
arrow <- function(point, ad=1, xoffset=0, xhf=0.3, yhf=0.3) {
	xhead <- xhf * point$dx
	yhead <- (1. + yhf) * point$dy
	data.frame(
    x = c(	point$x-ad*xoffset,
	 	point$x-ad*xhead, point$x-ad*xhead,
		point$x-ad*point$dx, point$x-ad*point$dx,
		point$x-ad*xhead, point$x-ad*xhead,
		point$x-ad*xoffset),
    y = c(	point$y,
		point$y+yhead,
		point$y-point$dy/2., point$y-point$dy/2.,
		point$y+point$dy/2., point$y+point$dy/2.,
		point$y-yhead,
		point$y)
    )
}

## Plots

pnt_o <- data.frame( x <- c(cnt[1,"edensity"]), y <- c(80), dx <- 4, dy <- 20 )
arrow_o <- arrow( pnt_o, -1 )
ptext_o <- data.frame(label="Bokal's Cheese Balls", x=pnt_o$x + 2,  y=pnt_o$y)
lx_o <- c(cnt$edensity[1], cnt$edensity[1])
ly_o <- c(10, 90)

o <-qplot(edensity, data=cnt, geom="histogram",
	binwidth = 0.2,
	xlab="Energy Density (cal/g)",
	ylab="Number of Foods") +
	geom_line(aes(lx_o,ly_o), color="red") +
  	geom_polygon(aes(x,y), data=arrow_o, fill="yellow") +
	geom_text(aes(x, y, label=label), ptext_o)
#
pnt_a <- data.frame( x <- c(cnt[1,"feratio"]), y <- c(150), dx <- 0.07, dy <- 60 )
arrow_a <- arrow( pnt_a , -1)
ptext_a <- data.frame(label="Bokal's Cheese Balls", x=pnt_a$x + 0.035,  y=pnt_a$y)
lx_a <- c(cnt$feratio[1], cnt$feratio[1])
ly_a <- c(50, 200)

a <- qplot(feratio, data=cnt, geom="histogram",
	binwidth = 0.004,
	xlab="Fat/Energy (g/cal)",
	ylab="Number of Foods") +
  	geom_line(aes(lx_a,ly_a), color="red") +
  	geom_polygon(aes(x,y), data=arrow_a, fill="yellow") +
	geom_text(aes(x, y, label=label), ptext_a)
#
pnt_b <- data.frame( x <- c(cnt[1,"peratio"]), y <- c(110), dx <- 0.14, dy <- 30 )
arrow_b <- arrow( pnt_b ,-1 )
ptext_b <- data.frame(label="Bokal's Cheese Balls", x=pnt_b$x + 0.07,  y=pnt_b$y)
lx_b <- c(cnt$peratio[1], cnt$peratio[1])
ly_b <- c(80, 140)

b <- qplot(peratio, data=cnt, geom="histogram",
	binwidth = 0.005,
	xlab="Protein/Energy (g/cal)",
	ylab="Number of Foods") +
  	geom_line(aes(lx_b,ly_b), color="red") +
  	geom_polygon(aes(x,y), data=arrow_b, fill="yellow") +
	geom_text(aes(x, y, label=label), ptext_b)
#
pnt_c <- data.frame( x <- c(cnt[1,"ceratio"]), y <- c(100), dx <- 0.24, dy <- 30 )
arrow_c <- arrow( pnt_c ,-1 )
ptext_c <- data.frame(label="Bokal's Cheese Balls", x=pnt_c$x + 0.12,  y=pnt_c$y)
lx_c <- c(cnt$ceratio[1], cnt$ceratio[1])
ly_c <- c(20, 120)

c <- qplot(ceratio, data=cnt, geom="histogram",
	binwidth = 0.005,
	xlab="Carb/Energy (g/cal)",
	ylab="Number of Foods") +
  	geom_line(aes(lx_c,ly_c), color="red") +
  	geom_polygon(aes(x,y), data=arrow_c, fill="yellow") +
	geom_text(aes(x, y, label=label), ptext_c)

png(filename = "./denrat.png", width = 520, height = 900, units = 'px')
print(
   grid.arrange(o, a, b, c, ncol=1)
)
dev.off()

# this would work better with volume density
#p <- ggplot(cnt, aes(Energy, Fat),
#	xlab="Energy (cal)",
#	ylab="Fat (g)")

# try some fat-energy relationships
#p + geom_point(aes(color = Type )) +
#	scale_x_log2() +
#	scale_y_log2()

#p + geom_point(aes(color = Type, size = edensity )) +
#	scale_x_log2() +
#	scale_y_log2()

# context for Bokal's cheese balls
q <- ggplot(cnt, aes(x = edensity))
q1 <- q +
	geom_point(aes(edensity, feratio, color = Type)) +
	xlab("Energy Density (cal/g)") +
	ylab("Fat/Energy (g/cal)") +
	geom_point(aes(edensity, feratio), data=cnt[1,], size = 4, color="red")
q2 <- q +
	geom_point(aes(edensity, peratio, color = Type)) +
	xlab("Energy Density (cal/g)") +
	ylab("Protein/Energy (g/cal)") +
	geom_point(aes(edensity, peratio), data=cnt[1,], size = 4, color="red")
q3 <- q +
	geom_point(aes(edensity, ceratio, color = Type)) +
	xlab("Energy Density (cal/g)") +
	ylab("Carb/Energy (g/cal)") +
	geom_point(aes(edensity, ceratio), data=cnt[1,], size = 4, color="red")

png(filename = "./relrat.png", width = 520, height = 900, units = 'px')
print(
   grid.arrange(q1, q2, q3, ncol=1)
)
dev.off()

(Bonus here would be some R-ish way to define and see “near” foods”)

Data and source on Github.

R, e.g.: New Years Eve Tweets

May 5, 2012

First, simple plotting with ggplot.

New Year’s eve is an interesting time to look at twitter volumes because the event moves from time zone to time zone over a 24 hour period.  Below are hourly NY-related tweets by UTC.

The three major humps in volume correspond to the Americas, Europe and Asia.

Not too surprisingly, comparing 2010 and 2011 shows twitter usage growing in most timezones. In the 2010 vs. 2011 scatter plot there are three outliers for growth.

(Basic skill to add here is to exclude the outliers from the linear regression in some R-ish way, rather than manually. Use this to estimate volumes based on the other points and output the difference.)

#!/usr/bin/env Rscript
#
# Plot time series volume data

# install.packages("ggplot2", dependencies = TRUE )
# install.packages("gridExtra", dependencies = TRUE )

library(ggplot2)
library(gridExtra)

X <-read.delim("./twitter.timeline.csv.byhour.csv", sep=",", header=TRUE)
# parse datetime strings
X$date <- as.POSIXct(X$time)

summary(X)

# ts is unix style timestamp
sub0 <- X[ (X$ts > 1293814803 & X$ts  < 1293901203)  , ]
sub1 <- X[ (X$ts > 1325350803 & X$ts  < 1325437203)  , ]

summary(sub0)
summary(sub1)

p0 <- qplot(sub0$date, sub0$count, geom="bar", stat="identity", xlab="2010:  12/31 - 1/1 GMT", ylab="tweets/hr") + scale_y_continuous(limits = c(0, 1100000))
p1 <- qplot(sub1$date, sub1$count, geom="bar", stat="identity", xlab="2011:  12/31 - 1/1 GMT", ylab="tweets/hr") + scale_y_continuous(limits = c(0, 1100000))

png(filename = "./NYTweets-World-2011.png", width = 600, height = 600, units = 'px')
print(
    p1
)
dev.off()

png(filename = "./NYTweets-World-2011v2010.png", width = 600, height = 600, units = 'px')
print(
    grid.arrange(p0, p1, ncol = 1, main="Compare 2010 vs 2011")
)
dev.off()

png(filename = "./NYTweets-Growth-2011v2010.png", width = 600, height = 600, units = 'px')
print(
qplot(sub0$count, sub1$count, xlab="2010 tweets/hr", ylab="2011 tweets/hr") +
     geom_smooth(method=lm,se=FALSE) +
     scale_y_continuous(limits = c(0, 1100000)) +
     scale_x_continuous(limits = c(0, 1100000))
)
dev.off()

Data and Source Gist.

R, for example… Moving here

May 3, 2012

R is a statistical computation language.  As a programming language, I am not too fond of it, but for many quick analysis and visualization problems, R is just the tool.  The ggplot package from Hadley Wickham produces very good looking plots quickly.

3 + \epsilon observations about R:

  • I thought I would master the basics in about 20% of the time it has actually taken. It seems R has a steep learning curve as they say.  May this is only true when you are coming from other object/procedural languages (my experience is with C, Java, Python, C++…)?
  • Nearly every frustration I have experience with R comes from not understanding the required form the data and not knowing the tools for transform data. When you get the data in the right structure, many powerful packages will give sophisticated answers in a line or two.
  • When I start writing functions and loop in R, it is type to move to Python/NumPy/SciPy/Matplotlib.
  • (There is an art to searching for things related to “R” on the web–almost as bad as searching for the game of “go”.)

I started posting R examples on Tumblr. The idea behind “R, for example…” is to provide input data, code and output to quickly walk through a short analysis illustrating new (to me) or useful R. But the Tumblr platform really isn’t very friendly to math, data and plots.  Going forward, R examples will appear here on Skippy Records.

Text-formatted tables of numbers

November 2, 2011

This is a simple Python text formatting project.  After searching for this capability for a little while, I decided to write my own as the following requirements seemed to be unique:

  • Create a table of numbers from a CSV file
  • Control the number of significant figures (significant digits) displayed
  • Format numbers with “,” and ” ” like this 123,456.789 123
  • Align columns on decimal points–this is best for readability
  • Fit the columns automatically based on data and formatting
There are many code snippets that do one part or another of this, but I didn’t find anything that fulfilled all the requirements.   So I created  couple of simple classes to accomplish the task.  One caveat is that the output needs to be displayed in a constant-space font.  Some example output:
      483,000. |       471.7 |     0.460 6 | 0.000 449 8
      966,000. |       943.4 |     0.921 3 | 0.000 899 7
    1,449,000. |     1,415.  |     1.382   | 0.001 349  
    1,932,000. |     1,887.  |     1.843   | 0.001 799  
    2,415,000. |     2,358.  |     2.303   | 0.002 249  
    2,898,000. |     2,830.  |     2.764   | 0.002 699  
    3,381,000. |     3,302.  |     3.224   | 0.003 149  
    3,864,000. |     3,773.  |     3.685   | 0.003 599  
    5,796,000. |     5,660.  |     5.528   | 0.005 398  
    7,728,000. |     7,547.  |     7.370   | 0.007 197  
    9,660,000. |     9,434.  |     9.213   | 0.008 997  
   19,320,000. |    18,870.  |    18.43    | 0.017 99   
   38,640,000. |    37,730.  |    36.85    | 0.035 99   
   57,960,000. |    56,600.  |    55.28    | 0.053 98   
   77,280,000. |    75,470.  |    73.70    | 0.071 97   
   96,600,000. |    94,340.  |    92.13    | 0.089 97   
  144,900,000. |   141,500.  |   138.2     | 0.134 9    
  193,200,000. |   188,700.  |   184.3     | 0.179 9    
  289,800,000. |   283,000.  |   276.4     | 0.269 9    
  386,400,000. |   377,300.  |   368.5     | 0.359 9    
  579,600,000. |   566,000.  |   552.8     | 0.539 8    
  772,800,000. |   754,700.  |   737.0     | 0.719 7    
  966,000,000. |   943,400.  |   921.3     | 0.899 7    
1,159,000,000. | 1,132,000.  | 1,106.      | 1.080      

You can download the code at GitHub: https://github.com/DrSkippy27/Text-Format-Table
Follow

Get every new post delivered to your Inbox.

Join 253 other followers

%d bloggers like this: