Of the fast and loose practices in machine learning that I cringe at, throwing one’s favorite tool at high-dimensional data and expecting algorithms to learn to generalize well.
Part of the problem is we don’t always know if the features we chose are meaningful to the learning problem. Another part of the problem is that a useful regression requires that our sample fills the space adequately to regress. (An n-sphere has a volume that scales as . This means that if you feel comfortable with 10 samples for a 1-d linear regression, you should have in your mind a million samples for a 6-d regression.)
One strategy for dealing with high-dimensionality is to rotate and scale the data set along axes of high sample variation.
Two closely related mathematical tools are available to help with this: Singular Value Decomposition (SVD) and Principal Component Analysis). I won’t add anything to the mathematics described in these articles. The purpose here is
- Explore some questions with a simple example
- Explore the link between SVD and PCA with some simple data to understand how it works, and build a little bit of intuition on what linear transformation can do for machine learning.
- Demonstrate how this strategy could work to reduce the dimentionality of a problem (This really turns out to be reducing the dimentionality of the representation of the problem.)
- Show how this strategy can easily go wrong in order to build intuition about when and how this might work for a real machine learning problem.
In [23]: ######################## In [24]: # Demo Part 1 In [25]: ######################## In [26]: # SVD: decompose a matrix X into product of unitary matrix (U), diagonal In [27]: # matrix (S) and vector V, a rotation, (if it helps, think: In [28]: # basis vectors, scaling, rotation) such that X.T = W x S x V_t. In [29]: # In [30]: # In machine learning, X will be a matrix of samples (4, rows) In [31]: # and features (3, columns) for our learning examples. In [32]: X_t = mat([ [ 1, 3, -10 ], ....: [ 0, 2, 1 ], ....: [ -1, -3, 9 ], ....: [ 0, -2, 0 ] ]) In [33]: X = X_t.T In [34]: # The scipy library makes this step easy In [35]: W, s, V_t = linalg.svd( X ) In [36]: S = linalg.diagsvd(s, len(X), len(V_t)) In [37]: recon = dot( dot( W, S), V_t) In [38]: # Are these equal (to within rounding)? In [39]: abs(X - recon) Out[39]: matrix([[ 4.44089210e-16, 1.73472348e-16, 6.66133815e-16, 1.59594560e-16], [ 8.88178420e-16, 2.22044605e-16, 4.44089210e-16, 1.33226763e-15], [ 1.77635684e-15, 4.44089210e-16, 5.32907052e-15, 6.73940070e-16]]) In [40]: # maximum error In [41]: np.max(abs(X - recon)) Out[41]: 5.3290705182007514e-15 In [42]: # One key to understanding the link to PCA is In [43]: # understanding the diagonal matrix, S In [44]: S Out[44]: array([[ 14.19266482, 0. , 0. , 0. ], [ 0. , 2.9255743 , 0. , 0. ], [ 0. , 0. , 0.09633518, 0. ]]) In [45]: # Given that the features have zero-mean: In [46]: [np.mean(i) for i in X] Out[46]: [0.0, 0.0, 0.0] In [47]: # s is an ordered vector that tells us the "significance" of each dimension In [48]: # in the rotated space. (Yes, I arranged it that way. To be clear, In [49]: # you can do SVD on a matrix without zero-mean features, but the dimension In [50]: # reduction part we are about to do requires it.) In [52]: # We can selectively set lower numbers to In [53]: # zero to reduce dimension. In [54]: s_red = s In [55]: s_red[2] = 0 In [56]: # Approximately reconstruct our original matrix, but with In [57]: # a reduced-dimension representation In [58]: S_red = linalg.diagsvd(s_red, len(X), len(V_t)) In [59]: S_red Out[59]: array([[ 14.19266482, 0. , 0. , 0. ], [ 0. , 2.9255743 , 0. , 0. ], [ 0. , 0. , 0. , 0. ]]) In [60]: recon_red = dot( dot( W, S_red), V_t) In [61]: abs(X - recon_red) Out[61]: matrix([[ 0.04297068, 0.04061026, 0.05215936, 0.05451977], [ 0.00118308, 0.00111809, 0.00143606, 0.00150105], [ 0.00412864, 0.00390185, 0.00501149, 0.00523828]]) In [62]: # maximum error In [63]: np.max(abs(X - recon_red)) Out[63]: 0.054519772460470559 In [64]: # ratio of errors In [65]: np.max(abs(X - recon))/np.max(abs(X - recon_red)) Out[65]: 9.7745648554652029e-14 In [66]: In [66]: # We "lost" 14 orders of magnitude in precision of the reconstruction, but this turns In [67]: # out to be okay for some machine learning problems.
So that is a very simple SVD example. Now use the same representation for a simple machine learning problem. Classify to groups of data in a two dimensional space using Logistic Regression. (Logistic regression is very good a solving problems just like this without SVD or PCA, so this is merely to get at how it all works together.)
In [68]: ########################
In [69]: # Demo Part 2
In [70]: ########################
In [71]: from scikits.learn import linear_model
In [72]: import scipy
In [73]: from copy import copy
In [74]: # Classification problem where points are linearly classifiable
In [75]: # in 2 dim.
In [76]: N=200
In [77]: x = np.random.normal(0,4,N)
In [78]: y1 = 3*x + np.random.normal(3,1,N)
In [79]: y2 = 3*x + np.random.normal(-3,1,N)
In [80]: y_avg = np.mean(np.append(y1, y2, 1))
In [81]: figure()
In [82]: plot(x,y1 - y_avg, 'bo');
In [83]: plot(x,y2 - y_avg, 'ro');
In [84]: title("Original data sets (0-average)")
Boundary between groups at
In [85]: # features x and y are rows in this matrix
In [86]: X = np.append([x,y1 - y_avg],[x,y2 - y_avg],1)
In [87]: X_t = X.T
In [88]: # y1 group 0; y2 group 1
In [89]: truth = np.append(scipy.ones([1, N]), scipy.zeros([1, N]), 1)
In [90]: # 2d model works very well
In [91]: lr = linear_model.LogisticRegression()
In [92]: lr.fit(np.asarray(X_t),truth[0])
Out[92]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [93]: lr.score(np.asarray(X_t),truth[0])
Out[93]: 1.0
In [94]: lr.predict(np.asarray(X_t))
Out[94]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [95]: # Now try dimension reduciton with SVD
In [96]: W, s, V_t = linalg.svd( X )
In [97]: s
Out[97]: array([ 273.98689602, 19.47579928])
In [98]: # both transformed representations are the same before trimming:
In [99]: S = linalg.diagsvd(s, len(X), len(V_t))
In [100]: np.max(abs(X.T*matrix(W) - matrix(V_t).T*matrix(S).T))
Out[100]: 6.7501559897209518e-14
In [101]: # Now work with the transformed coordinates. It might not have been clear
In [102]: # from above what the transformed coordinate system was. We can get there
In [103]: # by either the product of the first two or last two terms.
In [104]: X_prime = matrix(V_t).T*matrix(S).T
In [105]: x_prime = np.asarray(X_prime.T[0])
In [106]: y_prime = np.asarray(X_prime.T[1])
In [107]: figure()
In [108]: plot(x_prime, y_prime, 'go');
In [109]: title("Freatures after SVD Transformation")
Out[109]: <matplotlib.text.Text at 0x11b956890>
Boundary between groups at
In [110]: # Linearly classifiable in 1-d? Try all new basis directions (extremes of variation) In [111]: # Min variation - Training along y-dim nearly perfect In [112]: ypt = np.asarray(y_prime.T) In [113]: lr.fit(ypt, truth[0]) Out[113]: LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True, penalty='l2', tol=0.0001) In [114]: lr.score(ypt, truth[0]) Out[114]: 0.99750000000000005 In [115]: lr.predict(ypt) Out[115]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32) In [116]: # Max variation - Nothing here In [117]: xpt = np.asarray(x_prime.T) In [118]: lr.fit(xpt, truth[0]) Out[118]: LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True, penalty='l2', tol=0.0001) In [119]: lr.score(xpt, truth[0]) Out[119]: 0.58250000000000002 In [120]: lr.predict(xpt) Out[120]: array([1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0], dtype=int32)
Notice that the transformation made the problem “easier” in that it was solved with 1-d instead of 2-d machine learning and that the two groups appear even more separated after the transformation than before.
Lesson 1: Look at all of the dimensions–in this case the smallest variation axis rather than the largest variation axis solves the problem. This is going to catch anyone who blindly applies PCA for machine learning. See Part 3.
In [121]: ########################
In [122]: # Demo Part 3
In [123]: ########################
In [124]: # Use PCA idea to reduce to 1-D
In [125]: s_red = copy(s)
In [126]: s_red[1] = 0
In [127]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [128]: X_prime = matrix(V_t).T*matrix(S_red).T
In [129]: x_prime = np.asarray(X_prime.T[0])
In [130]: y_prime = np.asarray(X_prime.T[1])
In [131]: figure()
Out[131]: <matplotlib.figure.Figure at 0x1193e7450>
In [132]: plot(x_prime, y_prime, 'yo');
In [133]: title("Reduce S by removing s[1] = %2.5f"%s[1])
All original group information lost.
In [134]: # Try all new basis directions (not just greatest variations)
In [135]: # 1-D: Max variation - Training along x-dim performs poorly
In [136]: ypt = np.asarray(y_prime.T)
In [137]: lr.fit(ypt, truth[0])
Out[137]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [138]: lr.score(ypt, truth[0])
Out[138]: 0.5
In [139]: lr.predict(ypt)
Out[139]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [140]: # This is the other extrema to "principal" components
In [141]: s_red = copy(s)
In [142]: s_red[0] = 0
In [143]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [144]: X_prime = matrix(V_t).T*matrix(S_red).T
In [145]: x_prime = np.asarray(X_prime.T[0])
In [146]: y_prime = np.asarray(X_prime.T[1])
In [147]: figure()
Out[147]: <matplotlib.figure.Figure at 0x11c74f6d0>
In [148]: plot(x_prime, y_prime, 'mo');
In [149]: title("Reduce S by removing value s[0] = %2.5f"%s[0])
Trivial classification, now in 1-d.
In [150]: # 1-D: Min variation - Training along y-dim nearly perfect In [151]: ypt = np.asarray(y_prime.T) In [152]: lr.fit(ypt, truth[0]) Out[152]: LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True, penalty='l2', tol=0.0001) In [153]: lr.score(ypt, truth[0]) Out[153]: 0.99750000000000005 In [154]: lr.predict(ypt) Out[154]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
So our 1-d model performs great. Quoting from Wikipedia,
“PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information.”
The problem with following this without skepticism is that, according to the guidelines of PCA, our problem depends on using the “wrong” dimension. (The properties are not quite as arbitrary as they may seem. For example, race data for long distances that includes both men and women have this quality because the variation within gender of total race times can be greater than the variation between gender times.)
R, e.g.: Bokal’s cheese balls
Recently, someone in the office (Bokal?) was eating cheese balls from a barrel.
My first reaction was “Is this food?” My second was slightly less judgmental and more fertile for R exploration “In what context does this food live?” This sent me off to the web and R. I found a fairly comprehensive data set for various food and plotted Bokal’s Cheese ball in context for fat, carbs and protein–mostly what food is (fiber?). I didn’t get mass-density information, so I plotted by energy density.
And the scatter plots show more clearly that Cheese balls are out on the edge–only possible with heroic technological means.
#!/usr/bin/env Rscript
#
# Scott Hendrickson
# 2012-01-27
#
library(ggplot2)
library(gridExtra)
# read in nutritional data
cnt<-read.table("./foodNutritionalData.csv", header=TRUE, sep=",")
cnt$edensity <- cnt$Energy/cnt$Weight
cnt$feratio <- cnt$Fat/cnt$Energy
cnt$peratio <- cnt$Protein/cnt$Energy
cnt$ceratio <- cnt$Carbohydrate/cnt$Energy
summary(cnt)
# single out Bokal's cheese balls
cnt$Type <- "Everything Else"
# this record was entered first in the file
cnt$Type[c(1)] <- "Bokal's Cheese Balls"
# cnt[1,]
#order and context by energy density
cntbyedensity <- cnt[order(cnt$edensity),]
ordr <- match("Bokal's Cheese Balls", cntbyedensity$Type)
print(cntbyedensity[ (ordr - 5): (ordr + 5), ], width = ,digits = 3)
# arrow - horizontal arrow pointing to the x value of point offset by aoffset
# point(x,y); point$dx, point$dy size; ad direction; xoffset arrow point offset; xhf, xhf arrow size fractions
arrow <- function(point, ad=1, xoffset=0, xhf=0.3, yhf=0.3) {
xhead <- xhf * point$dx
yhead <- (1. + yhf) * point$dy
data.frame(
x = c( point$x-ad*xoffset,
point$x-ad*xhead, point$x-ad*xhead,
point$x-ad*point$dx, point$x-ad*point$dx,
point$x-ad*xhead, point$x-ad*xhead,
point$x-ad*xoffset),
y = c( point$y,
point$y+yhead,
point$y-point$dy/2., point$y-point$dy/2.,
point$y+point$dy/2., point$y+point$dy/2.,
point$y-yhead,
point$y)
)
}
## Plots
pnt_o <- data.frame( x <- c(cnt[1,"edensity"]), y <- c(80), dx <- 4, dy <- 20 )
arrow_o <- arrow( pnt_o, -1 )
ptext_o <- data.frame(label="Bokal's Cheese Balls", x=pnt_o$x + 2, y=pnt_o$y)
lx_o <- c(cnt$edensity[1], cnt$edensity[1])
ly_o <- c(10, 90)
o <-qplot(edensity, data=cnt, geom="histogram",
binwidth = 0.2,
xlab="Energy Density (cal/g)",
ylab="Number of Foods") +
geom_line(aes(lx_o,ly_o), color="red") +
geom_polygon(aes(x,y), data=arrow_o, fill="yellow") +
geom_text(aes(x, y, label=label), ptext_o)
#
pnt_a <- data.frame( x <- c(cnt[1,"feratio"]), y <- c(150), dx <- 0.07, dy <- 60 )
arrow_a <- arrow( pnt_a , -1)
ptext_a <- data.frame(label="Bokal's Cheese Balls", x=pnt_a$x + 0.035, y=pnt_a$y)
lx_a <- c(cnt$feratio[1], cnt$feratio[1])
ly_a <- c(50, 200)
a <- qplot(feratio, data=cnt, geom="histogram",
binwidth = 0.004,
xlab="Fat/Energy (g/cal)",
ylab="Number of Foods") +
geom_line(aes(lx_a,ly_a), color="red") +
geom_polygon(aes(x,y), data=arrow_a, fill="yellow") +
geom_text(aes(x, y, label=label), ptext_a)
#
pnt_b <- data.frame( x <- c(cnt[1,"peratio"]), y <- c(110), dx <- 0.14, dy <- 30 )
arrow_b <- arrow( pnt_b ,-1 )
ptext_b <- data.frame(label="Bokal's Cheese Balls", x=pnt_b$x + 0.07, y=pnt_b$y)
lx_b <- c(cnt$peratio[1], cnt$peratio[1])
ly_b <- c(80, 140)
b <- qplot(peratio, data=cnt, geom="histogram",
binwidth = 0.005,
xlab="Protein/Energy (g/cal)",
ylab="Number of Foods") +
geom_line(aes(lx_b,ly_b), color="red") +
geom_polygon(aes(x,y), data=arrow_b, fill="yellow") +
geom_text(aes(x, y, label=label), ptext_b)
#
pnt_c <- data.frame( x <- c(cnt[1,"ceratio"]), y <- c(100), dx <- 0.24, dy <- 30 )
arrow_c <- arrow( pnt_c ,-1 )
ptext_c <- data.frame(label="Bokal's Cheese Balls", x=pnt_c$x + 0.12, y=pnt_c$y)
lx_c <- c(cnt$ceratio[1], cnt$ceratio[1])
ly_c <- c(20, 120)
c <- qplot(ceratio, data=cnt, geom="histogram",
binwidth = 0.005,
xlab="Carb/Energy (g/cal)",
ylab="Number of Foods") +
geom_line(aes(lx_c,ly_c), color="red") +
geom_polygon(aes(x,y), data=arrow_c, fill="yellow") +
geom_text(aes(x, y, label=label), ptext_c)
png(filename = "./denrat.png", width = 520, height = 900, units = 'px')
print(
grid.arrange(o, a, b, c, ncol=1)
)
dev.off()
# this would work better with volume density
#p <- ggplot(cnt, aes(Energy, Fat),
# xlab="Energy (cal)",
# ylab="Fat (g)")
# try some fat-energy relationships
#p + geom_point(aes(color = Type )) +
# scale_x_log2() +
# scale_y_log2()
#p + geom_point(aes(color = Type, size = edensity )) +
# scale_x_log2() +
# scale_y_log2()
# context for Bokal's cheese balls
q <- ggplot(cnt, aes(x = edensity))
q1 <- q +
geom_point(aes(edensity, feratio, color = Type)) +
xlab("Energy Density (cal/g)") +
ylab("Fat/Energy (g/cal)") +
geom_point(aes(edensity, feratio), data=cnt[1,], size = 4, color="red")
q2 <- q +
geom_point(aes(edensity, peratio, color = Type)) +
xlab("Energy Density (cal/g)") +
ylab("Protein/Energy (g/cal)") +
geom_point(aes(edensity, peratio), data=cnt[1,], size = 4, color="red")
q3 <- q +
geom_point(aes(edensity, ceratio, color = Type)) +
xlab("Energy Density (cal/g)") +
ylab("Carb/Energy (g/cal)") +
geom_point(aes(edensity, ceratio), data=cnt[1,], size = 4, color="red")
png(filename = "./relrat.png", width = 520, height = 900, units = 'px')
print(
grid.arrange(q1, q2, q3, ncol=1)
)
dev.off()
(Bonus here would be some R-ish way to define and see “near” foods”)
R, e.g.: New Years Eve Tweets
First, simple plotting with ggplot.
New Year’s eve is an interesting time to look at twitter volumes because the event moves from time zone to time zone over a 24 hour period. Below are hourly NY-related tweets by UTC.

The three major humps in volume correspond to the Americas, Europe and Asia.
Not too surprisingly, comparing 2010 and 2011 shows twitter usage growing in most timezones. In the 2010 vs. 2011 scatter plot there are three outliers for growth.

(Basic skill to add here is to exclude the outliers from the linear regression in some R-ish way, rather than manually. Use this to estimate volumes based on the other points and output the difference.)
#!/usr/bin/env Rscript
#
# Plot time series volume data
# install.packages("ggplot2", dependencies = TRUE )
# install.packages("gridExtra", dependencies = TRUE )
library(ggplot2)
library(gridExtra)
X <-read.delim("./twitter.timeline.csv.byhour.csv", sep=",", header=TRUE)
# parse datetime strings
X$date <- as.POSIXct(X$time)
summary(X)
# ts is unix style timestamp
sub0 <- X[ (X$ts > 1293814803 & X$ts < 1293901203) , ]
sub1 <- X[ (X$ts > 1325350803 & X$ts < 1325437203) , ]
summary(sub0)
summary(sub1)
p0 <- qplot(sub0$date, sub0$count, geom="bar", stat="identity", xlab="2010: 12/31 - 1/1 GMT", ylab="tweets/hr") + scale_y_continuous(limits = c(0, 1100000))
p1 <- qplot(sub1$date, sub1$count, geom="bar", stat="identity", xlab="2011: 12/31 - 1/1 GMT", ylab="tweets/hr") + scale_y_continuous(limits = c(0, 1100000))
png(filename = "./NYTweets-World-2011.png", width = 600, height = 600, units = 'px')
print(
p1
)
dev.off()
png(filename = "./NYTweets-World-2011v2010.png", width = 600, height = 600, units = 'px')
print(
grid.arrange(p0, p1, ncol = 1, main="Compare 2010 vs 2011")
)
dev.off()
png(filename = "./NYTweets-Growth-2011v2010.png", width = 600, height = 600, units = 'px')
print(
qplot(sub0$count, sub1$count, xlab="2010 tweets/hr", ylab="2011 tweets/hr") +
geom_smooth(method=lm,se=FALSE) +
scale_y_continuous(limits = c(0, 1100000)) +
scale_x_continuous(limits = c(0, 1100000))
)
dev.off()
R, for example… Moving here
R is a statistical computation language. As a programming language, I am not too fond of it, but for many quick analysis and visualization problems, R is just the tool. The ggplot package from Hadley Wickham produces very good looking plots quickly.
observations about R:
- I thought I would master the basics in about 20% of the time it has actually taken. It seems R has a steep learning curve as they say. May this is only true when you are coming from other object/procedural languages (my experience is with C, Java, Python, C++…)?
- Nearly every frustration I have experience with R comes from not understanding the required form the data and not knowing the tools for transform data. When you get the data in the right structure, many powerful packages will give sophisticated answers in a line or two.
- When I start writing functions and loop in R, it is type to move to Python/NumPy/SciPy/Matplotlib.
- (There is an art to searching for things related to “R” on the web–almost as bad as searching for the game of “go”.)
I started posting R examples on Tumblr. The idea behind “R, for example…” is to provide input data, code and output to quickly walk through a short analysis illustrating new (to me) or useful R. But the Tumblr platform really isn’t very friendly to math, data and plots. Going forward, R examples will appear here on Skippy Records.
Text-formatted tables of numbers
This is a simple Python text formatting project. After searching for this capability for a little while, I decided to write my own as the following requirements seemed to be unique:
- Create a table of numbers from a CSV file
- Control the number of significant figures (significant digits) displayed
- Format numbers with “,” and ” ” like this 123,456.789 123
- Align columns on decimal points–this is best for readability
- Fit the columns automatically based on data and formatting
483,000. | 471.7 | 0.460 6 | 0.000 449 8
966,000. | 943.4 | 0.921 3 | 0.000 899 7
1,449,000. | 1,415. | 1.382 | 0.001 349
1,932,000. | 1,887. | 1.843 | 0.001 799
2,415,000. | 2,358. | 2.303 | 0.002 249
2,898,000. | 2,830. | 2.764 | 0.002 699
3,381,000. | 3,302. | 3.224 | 0.003 149
3,864,000. | 3,773. | 3.685 | 0.003 599
5,796,000. | 5,660. | 5.528 | 0.005 398
7,728,000. | 7,547. | 7.370 | 0.007 197
9,660,000. | 9,434. | 9.213 | 0.008 997
19,320,000. | 18,870. | 18.43 | 0.017 99
38,640,000. | 37,730. | 36.85 | 0.035 99
57,960,000. | 56,600. | 55.28 | 0.053 98
77,280,000. | 75,470. | 73.70 | 0.071 97
96,600,000. | 94,340. | 92.13 | 0.089 97
144,900,000. | 141,500. | 138.2 | 0.134 9
193,200,000. | 188,700. | 184.3 | 0.179 9
289,800,000. | 283,000. | 276.4 | 0.269 9
386,400,000. | 377,300. | 368.5 | 0.359 9
579,600,000. | 566,000. | 552.8 | 0.539 8
772,800,000. | 754,700. | 737.0 | 0.719 7
966,000,000. | 943,400. | 921.3 | 0.899 7
1,159,000,000. | 1,132,000. | 1,106. | 1.080
Investment lifetime…Three follow-up thoughts
(1) I think on the whole my life has been much easier than either of my parent’s lives. At least easier in the sense that the context of stability, motivation, rich experiences and economic support they provided was at a higher level than their parents provided them. One of the points of my post was to illustrate how there may be some differences in how I think of luck vs. direct outcomes of my actions. My parents don’t acknowledge luck much while I identify it as key to most of what I have accomplished. Luck as to where I was born, that I was born to my parents, that I didn’t die of cholera as an infant, …
(2) This one seemed obvious, but in case it didn’t quite come out: The charts shown are of much of the same data–it is the window on the data that may be an explanation for the difference in perceptions.
(3) Something I did not point out, but also probably contributes to a difference in perceptions is that, during the time period represented in my window of investing, the very wealthy got decidedly wealthier. Clearly, we can be sure that the wealthy were not relying on the stock market to grow their wealth in the way that I was to grow my investments.
Where were the very wealthy investing? I don’t know.
But many of my contemporaries have been encouraged to and have practiced investing their new-style pensions (401Ks) in the stock market, which isn’t likely to go anywhere for at least a few more years. Among other things, this makes Bush’s proposal to privatize Social Security by “letting” us invest in the public markets seem rather cynical.
Investment-lifetime comparison: me vs. my father
Like many of my contemporaries, I often find myself at political odds with my parents. I always find this strange, because I don’t feel that I rejected the values of my parents. My concerns for the workings of the economy and the country, my concerns that we practice responsible stewardship of the Earth, my feelings of what kinds of business behavior is moral and good for society and what isn’t, my concerns for security and what we trade for it, etc. came fairly directly from my parents. I learned my ethical patterns and moral sensitivities from Mom and Dad. However, I am pretty sure we reliably vote for different politicians (and a different set of ideas!). I have only anecdotes, but this pattern seems to be repeated with many of my contemporaries.
What experiences were different for my parent’s generation and mine? Can any of these differences help account for similar values leading to opposing politics? I am not sure this explains much, but below is a pair of charts that illustrate one way in which my parent’s experience and my experience lead us to see things differently. I did the analysis below out of curiosity, but I decided to write it up after I realized a similar analysis could be done with health care costs or income gains, neither of which have evolved in our favor in the last 15 years. (For broader perspective and context on where the S&P is now relative to the past, check out advisorperspectives.)
The scenario I modeled was investing for retirement: Imagine that at 30 years old, both my father and I have resources and get serious about investing. We invest the same amount of money every year until age 65, for my father, this is the period from 1966-2001, for me, the period from 1996-2031.
During the 30 years my father invested, the stock market did this:
He ends up with 7.23 times the money he put into the market. (Simple model: invest and compound gains (losses) on the first day of each year.)
During my investment lifetime, the stock market has done this:
And I am faced with the realisation that I currently have 1.18 times what I have put into the market–I have gained 18% in the last 15 years.
One clear perspective difference is that my father is looking back on the last 20 years of gains as the model for what his and his peer’s individual investments got him. I, on the other had am looking forward to 20 years of investments with uncertainty. My 15% gains are more appropriately compared to my father’s 41% during his first 15 years of investing. Clearly, the bulk of my father’s gains come in the second half of the cycle, the last 20 years. But at the point I am in my life, he was well ahead.
I doubt the important difference is so much how things happened in the past as how things appear to be going forward. What sort of action do I believe will significantly and positively influence the outcome of prosperity for myself and my family and my community? Is this very different from the sorts of actions my father believed were effective at the same point in his life?
Real Time Social Search: It’s coming around again?
I worked at a company called me.dium a few years ago. They have since turned into OneRiot and gone on to another target market and business model. Recently, some me.dium alums circulated an email noting a new company with a very similar idea to the old me.dium. The new company, Whoislive, is jumping in with a web browser sidebar that you install when you start using their service to surf with your friends (“Surf with Friends”(TM) me.dium?).
From their website:
For the first time ever, see who is browsing on any Web page. Chat, get tips, share links, and meet new people with similar interests. Use the Web like you always do, only now, see what happens when you can see Who is Live!
Well…sort of.
One major difference between Whoislive and me.dium, again, from their website:
Can I also see people who are browsing on other pages of the same website?
No. You can only see who is live with you on one page at a time.
Me.dium had a map of the local neighborhood of sites your friends were surfing. More or less creepy? You decide.
Anyway, this reminded me that, while the business model of me.dium didn’t get anywhere, there was some interesting technology under the covers. It was just getting off the ground, but there were hints at some possibly fertile directions for research and development. It also reminded me that I started to write this up a few years ago and still had a draft around. Most of the ideas are covered under patent applications made by me.dium (and so are public information already, but nearly unreadable) and some are fairly “obvious” (though that seems to be mostly neglected as criteria for patentability (see, e.g. this patent for a wooden dog toy you throw)). The explorations of how the system might work and attempts to create some toy examples to assist with explanations, as well as, the section on proposed new signals is my work. If you are at all interested in the details of how we were trying to create surf-with-friends, take a peak. (Fair warning, this draft hasn’t been edited much, so some of it might be a little cumbersome.)
Smart tear-down and build-up of trust from Sinek
Sinek talks about what is common in us and how it affects social interactions–and business speak. Is trust simply pattern recognition? I think this is an interesting evolutionary and social question. I hope it is no more cynical than necessary.
Certainty and possibilities
I don’t know that I love the name–”possibilians” seems a bit clumsy to me–but I value the idea.
I find too many skeptics who assume that we know a lot more about what isn’t than it seems possible to know. Vigilant skepticism adds to the quality of conversation. As a vehicle for combating dangerous silliness like parents not vaccinating children against smallpox or polio, sorting through the facts carefully and re-evaluating the explanation, insisting on coherent, repeatable experiments makes sense. But knowing what isn’t is much more difficult. Certainty on the right and the left has left us less able to solve difficult problems.
Neuroscientist David Eagleman’s Pop Tech! presentation is worth a watch.
David Eagleman on Possibilianism from PopTech on Vimeo.









