# Apply LDA to Generate User Interest Topics

We generate training dataset following these steps:
1) Select "UserID, Query" from 1 year image search logs and then reduce the items by UserID and merge all DISTINCT Query
2) Filter the dataset for heavy users
then we got ~1.5M users.

For model selection, we use LDA model, and set numtopics = 200, iteration = 200, then we can get 200 dimensional vectors for each users. Also, the model returns some keywords of each latent topics.

# SVD++ Model in Recommender System

It's very interesting to share my recently reading of Koren's paper "Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model", which is well known for Netflix Prize.
Actually, my industry experience on Recommender System was APPs Recommendation for Android Chinese Market in Wandou Lab. Considered we have to handle 30M+ users and 300K+ Apps with a very cheap machine, only ItemCF & Content Based K-nearest neighbors algorithms are used in our model in the first beginning, later LDA and topic clustering methods are applied to improve some categories.

# Stochastic Gradient Descent (SGD)- Singular Value Decomposition (SVD) Algorithms Notes

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It has many useful applications in signal processing and statistics. Linear Algebra provides several numerical methods, such as Householder reduction, Golub–Reinsch SVD, High Relative Accuracy Bidiagonal SVD and etc., to solve SVD accurately. Details can be found in the paper of http://www.cs.utexas.edu/users/inderjit/public_papers/HLA_SVD.pdf .

In Machine Learning, SVD is widely used to do predictions. Thinking about KDD cup of Yahoo music recommendation, user-item rating matrix can be decomposed according to non-zero ratings and reconstructed to predict ratings in empty positions. From this angle, accurate SVD algorithms are no longer suitable because those methods treat empty positions as real zero ratings, which will lead bad prediction results.

# 湾区找工作小记

14年7月校招加入微软Bing，原打算满一年后找transfer到总部的机会，但进展不如预期顺利，加之Dora在加州工作并强烈不喜欢西雅图天气，所以15年初我就开始找湾区的工作了。

# An Incremental Sampling Method

Sampling K points out of total N is trivial and we always use random number to determine sample current points or not.
However, during our daily work, the total points size increases day by day, then how to sample K points in an incremental way is a challenge. The following algorithm introduces a smart and efficient sampling method and proof is also given after it.

Incremental Sampling Method
- step 1: Add first $k$ points to the select set $\mathcal{S}$;
- step 2: For index $N = k+1, \cdots$, generate a random number $r$, if $(r > \frac{k}{N})$, continue; else replace one of points in $\mathcal{S}$ with same possibility.

How to prove for any $N > k$, each point is selected with the possibility $\frac{k}{N}$:

The index starts from 1, these points can be divided into two parts due to the selected number $k$:
1. Index $1\le i \le k$, the possibility to select the $i$-th point in the $\mathcal{S}$ is
$p_i = (1-\frac{k}{k+1}\cdot \frac{1}{k})\cdot(1-\frac{k}{k+2}\cdot \frac{1}{k})\cdots(1-\frac{k}{N}\cdot \frac{1}{k}) = \frac{k}{N}$.

2. Index $i > k$, the possibility to select it as final points is
$p_i = \frac{k}{i}\cdot(1-\frac{k}{i+1}\cdot\frac{1}{k})\cdots(1-\frac{k}{N}\cdot \frac{1}{k}) = \frac{k}{N} = \frac{k}{N}$.

This method is of course can be used in 1-pass large-scale data loading and sampling.

# Clustering Methods Review

Reference: Zhou Z H. Ensemble methods: foundations and algorithms[M]. CRC Press, 2012.

Clustering aims to find the inherent structure of the unlabeled data by grouping them into clusters of objects. A satisfactory clustering method should increase the intra-cluster similarity while decrease inter-cluster similarity meanwhile. Define the sample sets $D = \{\bf{x_1}, \bf{x_2}, \cdots, \bf{x_m}\}$, then the clusters can be defined as $\{C_j | j = 1, \cdots, k\}$ with $\cup_{j=1}^k C_j = D$ and $C_i\cap_{i \neq j} C_j = \varnothing$.

# Logistic Regression - Mathematics Related

Reference: http://www.cnblogs.com/sparkwen/p/3441197.html

$h(z) = \frac{1}{1+e^{-z}}$

$\bf{x}=\{x_1, x_2, \cdots, x_n\}$为样本数据，$\bf{y}=\{y_1, y_2, \cdots, y_n\}$为各个样本分类结果。那么对于二分类问题，我们可以构建逻辑回归模型如下。

$P\{y=1|x,\theta\} = \frac{1}{1+e^{-\theta^Tx}}$

$P\{y=0|x,\theta\} = \frac{1}{1+e^{\theta^Tx}} = 1-P\{y=1|x,\theta\}$

$h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}}$, 则对于单个样本来说，它的后验概率是：
$p(y|x,\theta) = h_{\theta}(x)^y(1-h_{\theta}(x))^{1-y}$,

$L(\theta|x,y) = \Pi_{i=1}^n h_{\theta}(x_i)^{y_i}(1-h_{\theta}(x_i))^{1-y_i}$

$l(\theta) = -\text{log}L(\theta|x,y)$

$\frac{\partial}{\partial\theta}(l(\theta)) = \sum_{i=1}^n y_i\text{log}h_{\theta}(x_i) + (1-y_i)\text{log}(1-h_{\theta}(x_i))$
$\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg)\frac{\partial}{\partial\theta}(\frac{1}{1+e^{-\theta^Tx_i}})$
$\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg)\frac{e^{-\theta^Tx_i}}{(1+e^{-\theta^Tx_i})^2}x_i$
$\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg) h_{\theta}(x_i)(1-h_{\theta}(x_i))x_i$
$\qquad = \sum_i(y_i-h_{\theta}(x_i))x_i$

# Quick Start of Gnuplot

Gnuplot 快速使用指南

Step 1: 数据准备

speedup_1:
4 4
8 8.44
16 9.65
32 13.6
64 26.2
speedup_2:
4 4
8 5.909417545
16 7.055328586
32 7.102585194
64 7.373120673
(省略其他2个文件)

Step 2: vi Speedup.gnu， 在文本中写入内容：
plot "speedup_1" t "Mesh Refinement" with linespoint lw 2, "speedup_2" t "Modify Region Marks" with linespoint lw 2, "speedup_3" t "Overall" with linespoint lw 2, "speedup_4" t "Ideal" with lines ls 0
set xlabel "Number of Processors" font "Times-Roman,20"
set ylabel "Speed Up" font "Times-Roman,20"
set xrange [0:64]
set xtics 16
set yrange [0:64]
set ytics 16
#set size 0.65,0.65

set key box
set key left top

set terminal png
set out 'speedup.png'
#set terminal postscript eps
#set terminal postscript color
#set terminal postscript solid
#set output 'speedup.eps'
replot

Step 3: gnuplot Speedup.gnu

plot "speedup_1" t "Mesh Refinement" with linespoint lw 2, "speedup_2" t "Modify Region Marks" with linespoint lw 2, "speedup_3" t "Overall" with linespoint lw 2, "speedup_4" t "Ideal" with lines ls 0
# 这一句是把数据文件读入并命名，linespoint表示点线方式，lines表示线方式，lw是线宽度， ls是线颜色，我发现取ls 6时是黄色实线，0是黑色虚线.

set xlabel "Number of Processors" font "Times-Roman,20"
set ylabel "Speed Up" font "Times-Roman,20"
# 设置x，y轴标签，font后的部分是字体

set xrange [0:64]
set xtics 16
set yrange [0:64]
set ytics 16
# 设置x，y轴坐标范围和刻度

set size 0.65,0.65
#设置图占默认值的大小，如果是png，把这句可以注释掉，如果是eps，这样加上更好，因为会使得字体相对变大

set key box
#legend加框
set key left top
#legend位置为左上， left bottom是左下

set terminal png
set out 'speedup.png'
#输出成png格式

#set terminal postscript eps
#set terminal postscript color
#set terminal postscript solid
#set output 'speedup.eps'
#以上是在linux下输出成eps格式，如果在windows下，将postscript去掉 (未测试，你可以试试)

replot
#画图

# 找工作总结

1. 笔试之后无下文的：

2. 等不起的：

JP MORGAN -- 投的QUANT职位，就是凑个热闹，听说global pay，不过其实也没有那么吓人。10月份电面过一次，问了我常微题，C++面向对象的题，感觉就是有那么几个方向，你会什么他问什么。我觉得他们对于写程序的考察，侧重点和IT大牛公司不一样，貌似不会问算法题，某人的解释是，他们更希望你代码写得规范，而不是注重创造性。认识也有其他人电面了，都是还没有接到面试通知，估计是会等电面全部结束后集中onsite interview。我个人感觉那个面试官GG对我印象应该不错，不过就算过了电面，我应该也是说我不会去onsite interview了吧。

3. 有口头offer的

4. 没有面试但要感谢的公司

Dora总是在鼓励我，有时候特别不喜欢听他说我这没问题，那也行，因为做不成的时候感觉更难受。之后也许还有四处求职的日子，这种不安定感总是时刻伴随着自己，有时候想着如果在一个稳当的环境下搞科研或者教书挺好的，但真那样又会受不了。作为一个矛盾综合体，也只能走一步看一步了，未来完全out of control.