## An Incremental Sampling Method

Sampling K points out of total N is trivial and we always use random number to determine sample current points or not.
However, during our daily work, the total points size increases day by day, then how to sample K points in an incremental way is a challenge. The following algorithm introduces a smart and efficient sampling method and proof is also given after it.

Incremental Sampling Method
- step 1: Add first $k$ points to the select set $\mathcal{S}$;
- step 2: For index $N = k+1, \cdots$, generate a random number $r$, if $(r > \frac{k}{N})$, continue; else replace one of points in $\mathcal{S}$ with same possibility.

How to prove for any $N > k$, each point is selected with the possibility $\frac{k}{N}$:

The index starts from 1, these points can be divided into two parts due to the selected number $k$:
1. Index $1\le i \le k$, the possibility to select the $i$-th point in the $\mathcal{S}$ is
$p_i = (1-\frac{k}{k+1}\cdot \frac{1}{k})\cdot(1-\frac{k}{k+2}\cdot \frac{1}{k})\cdots(1-\frac{k}{N}\cdot \frac{1}{k}) = \frac{k}{N}$.

2. Index $i > k$, the possibility to select it as final points is
$p_i = \frac{k}{i}\cdot(1-\frac{k}{i+1}\cdot\frac{1}{k})\cdots(1-\frac{k}{N}\cdot \frac{1}{k}) = \frac{k}{N} = \frac{k}{N}$.

This method is of course can be used in 1-pass large-scale data loading and sampling.

## Clustering Methods Review

Reference: Zhou Z H. Ensemble methods: foundations and algorithms[M]. CRC Press, 2012.

Clustering aims to find the inherent structure of the unlabeled data by grouping them into clusters of objects. A satisfactory clustering method should increase the intra-cluster similarity while decrease inter-cluster similarity meanwhile. Define the sample sets $D = \{\bf{x_1}, \bf{x_2}, \cdots, \bf{x_m}\}$, then the clusters can be defined as $\{C_j | j = 1, \cdots, k\}$ with $\cup_{j=1}^k C_j = D$ and $C_i\cap_{i \neq j} C_j = \varnothing$.

Methods
1. Partitioning Methods
A partitioning method organizes $D$ into $k$ partitions by optimizing an objective partitioning criterion. $\bf{\text{K-Means}}$ is one the well-known partitioning methods. It optimizes the error function $\text{err} = \sum_{j=1}^k\sum_{x\in C_j} \text{dist}(x, \hat{x_j})^2$, where $\hat{x_j} = \frac{1}{|C_j|}\sum_{x \in C_j} x$.

To circumvent the solution, k-means adopts an iterative relocation technique to find the desired solution heuristically.
-step 1: select k instances from D as initial centers;
-step 2: find the nearest center for each instance;
-step 3: update the center $\hat{x_j}$;
-step 4: repeat until convergence.

This figure shows how k-means works:

Another well-known clustering algorithm is $\bf{\text{Affinity Propagation (AP)}}$ clustering method, which introduces "message passing" between data points. AP takes as input a collection of real-valued similarities between data points, where the similarity $s(i,k)$ indicates how well the data point with index $k$ is suited to be the class center for data point $i$. When the goal is o minimize the squared error, each similarity is set to a negative Euclidean distance: for points $\bf{x_i}$ and$\bf{x_k}$, $s(i,k) = -\|\bf{x_i-x_k}\|^2$.
Rather than requiring that the number of clusters be prespecified, AP takes as input a real number $s(k,k)$ for each data point $k$ so that data points with larger values of $s(k,k)$ are more likely to be chosen as class centers. These values are referred to as 'preferences', also called 'exemplars'. There are two kinds of messages exchanged between data points, i.e. 'responsibility' and 'availability'.
- Responsibility: Matrix $R$, $r(i,k)$ denotes how well-suited $\bf{x_k}$ is to serve as the exemplar for point $i$, taking into account other potential exemplars for point $i$.
- Availability: Matrix $A$, $a(i,k)$ denotes how appropriate it would be for point $i$ to choose point $k$ as its exemplar, taking into account the support from other points that point $k$ should be an exemplar. The messages need only be exchanged between pairs of points with known similarities. The following figures and algorithms are how AP works.

The advantages of AP include stable convergence and independence of clustering prior numbers and initial points. However, due to $\mathcal{O}(N^2\text{log}(N))$ time complexity, tradeoff of the algorithm is required on large scale data.

2. Hierarchical Methods
It creates a hierarchy of clustering on $D$ at various granular levels, where a specific clustering can be obtained by threshholding the hierarchy at a specific level of granule. A typical method is SAHN method:
- step 1: Each data point is place into a cluster as its own, then the similarity matrix $D_{m\times m}$ has the component $D(i,j) = \text{dist}(x_i, x_j)$
- step 2: Two closest clusters $C_i \& C_j$ are identified based on $D$ and replaced by the agglomerated cluster $C_h$, then $D(h,k) = \alpha_i D(i,k) + \alpha_j D(j,k) + \beta D(i,j) + \gamma |D(i,k)-D(j,k)|$.
All coefficients are given before calculations.
This figure shows how SAHN works:

3. Density-based Methods
A density-based method constructs clusters on $D$ based on the notion of the density, where regions of instances with high density are regarded as clusters which are separated by regions of low density. DBSCAN is a representative density-based clustering method, which characterizes the density of the data space with a pair of parameters $(\epsilon, MinPts)$. This pair parameters determines a unique $\epsilon$-neighborhood for any given point $\bf{x}$.
- step 1: DBSCAN identifies core objects which satisfy the requirement imposed by the $(\epsilon, MinPts)$ parameters.
- step 2: It forms clusters by iteratively connecting the directly density-reachable instances starting from those core objects.
- step 3: The connecting process terminates when no new data point can be added to any cluster.

4. Grid-based Methods
A grid based method quantizes $D$ into a finite number of cells forming a grid-structure, where the quantization process is usually performed in a multi-resolution style. STING is a representative grid-based method, which divides the data space into a number of rectangular cells. Each cell stores statistical information of the instances falling into this cell. There are several levels of rectangular cells, each corresponding to a different level of resolution. Here, each cell at a higher level is partitioned into a number of cells at the next lower level, and statistical information of high-level cells can be easily inferred from its lower-level cells with simple operations such as elementary algebraic calculations.

5. Model-based Methods
It assumes a mathematical model characterizing the properties of $D$, where the clusters are formed to optimize the fit between the data and the underlying model. The most famous model-based method is GMM-based clustering, which works b utilizing the Gaussian Mixture Model (GMM)
$p(x,\mathcal{\theta}) = \sum_{j = 1}^k \alpha_j\mathcal{N}(x|\mu_j, \sum_j)$,
where $\mathcal{\theta} = \{\alpha_j, \mu_j, \sum_j|j = 1, \cdots, k\}$ and $\sum_j \alpha_j = 1$.
The cluster label $\lambda_i$ for each $\bf{x_i} \in D$ is specified according to the rule
$\lambda_i =\text{argmax}_{1\le l \le k} \frac{\alpha_l\mathcal{N}(x_i|\mu_l, \sum_l)}{\sum_{j=1}^k\alpha_j\mathcal{N}(x_i|\mu_j, \sum_j)}$.
$\mathcal{\theta}$ is learned from $D$ by employing the popular EM procedure to maximize log-likelihood function in an iterative manner.

## Logistic Regression - Mathematics Related

Reference: http://www.cnblogs.com/sparkwen/p/3441197.html

$h(z) = \frac{1}{1+e^{-z}}$

$\bf{x}=\{x_1, x_2, \cdots, x_n\}$为样本数据，$\bf{y}=\{y_1, y_2, \cdots, y_n\}$为各个样本分类结果。那么对于二分类问题，我们可以构建逻辑回归模型如下。

$P\{y=1|x,\theta\} = \frac{1}{1+e^{-\theta^Tx}}$

$P\{y=0|x,\theta\} = \frac{1}{1+e^{\theta^Tx}} = 1-P\{y=1|x,\theta\}$

$h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}}$, 则对于单个样本来说，它的后验概率是：
$p(y|x,\theta) = h_{\theta}(x)^y(1-h_{\theta}(x))^{1-y}$,

$L(\theta|x,y) = \Pi_{i=1}^n h_{\theta}(x_i)^{y_i}(1-h_{\theta}(x_i))^{1-y_i}$

$l(\theta) = -\text{log}L(\theta|x,y)$

$\frac{\partial}{\partial\theta}(l(\theta)) = \sum_{i=1}^n y_i\text{log}h_{\theta}(x_i) + (1-y_i)\text{log}(1-h_{\theta}(x_i))$
$\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg)\frac{\partial}{\partial\theta}(\frac{1}{1+e^{-\theta^Tx_i}})$
$\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg)\frac{e^{-\theta^Tx_i}}{(1+e^{-\theta^Tx_i})^2}x_i$
$\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg) h_{\theta}(x_i)(1-h_{\theta}(x_i))x_i$
$\qquad = \sum_i(y_i-h_{\theta}(x_i))x_i$

## Quick Start of Gnuplot

Gnuplot 快速使用指南

Step 1: 数据准备

speedup_1:
4 4
8 8.44
16 9.65
32 13.6
64 26.2
speedup_2:
4 4
8 5.909417545
16 7.055328586
32 7.102585194
64 7.373120673
(省略其他2个文件)

Step 2: vi Speedup.gnu， 在文本中写入内容：
plot "speedup_1" t "Mesh Refinement" with linespoint lw 2, "speedup_2" t "Modify Region Marks" with linespoint lw 2, "speedup_3" t "Overall" with linespoint lw 2, "speedup_4" t "Ideal" with lines ls 0
set xlabel "Number of Processors" font "Times-Roman,20"
set ylabel "Speed Up" font "Times-Roman,20"
set xrange [0:64]
set xtics 16
set yrange [0:64]
set ytics 16
#set size 0.65,0.65

set key box
set key left top

set terminal png
set out 'speedup.png'
#set terminal postscript eps
#set terminal postscript color
#set terminal postscript solid
#set output 'speedup.eps'
replot

Step 3: gnuplot Speedup.gnu

plot "speedup_1" t "Mesh Refinement" with linespoint lw 2, "speedup_2" t "Modify Region Marks" with linespoint lw 2, "speedup_3" t "Overall" with linespoint lw 2, "speedup_4" t "Ideal" with lines ls 0
# 这一句是把数据文件读入并命名，linespoint表示点线方式，lines表示线方式，lw是线宽度， ls是线颜色，我发现取ls 6时是黄色实线，0是黑色虚线.

set xlabel "Number of Processors" font "Times-Roman,20"
set ylabel "Speed Up" font "Times-Roman,20"
# 设置x，y轴标签，font后的部分是字体

set xrange [0:64]
set xtics 16
set yrange [0:64]
set ytics 16
# 设置x，y轴坐标范围和刻度

set size 0.65,0.65
#设置图占默认值的大小，如果是png，把这句可以注释掉，如果是eps，这样加上更好，因为会使得字体相对变大

set key box
#legend加框
set key left top
#legend位置为左上， left bottom是左下

set terminal png
set out 'speedup.png'
#输出成png格式

#set terminal postscript eps
#set terminal postscript color
#set terminal postscript solid
#set output 'speedup.eps'
#以上是在linux下输出成eps格式，如果在windows下，将postscript去掉 (未测试，你可以试试)

replot
#画图

## 找工作总结

1. 笔试之后无下文的：

2. 等不起的：

JP MORGAN -- 投的QUANT职位，就是凑个热闹，听说global pay，不过其实也没有那么吓人。10月份电面过一次，问了我常微题，C++面向对象的题，感觉就是有那么几个方向，你会什么他问什么。我觉得他们对于写程序的考察，侧重点和IT大牛公司不一样，貌似不会问算法题，某人的解释是，他们更希望你代码写得规范，而不是注重创造性。认识也有其他人电面了，都是还没有接到面试通知，估计是会等电面全部结束后集中onsite interview。我个人感觉那个面试官GG对我印象应该不错，不过就算过了电面，我应该也是说我不会去onsite interview了吧。

3. 有口头offer的

4. 没有面试但要感谢的公司

Dora总是在鼓励我，有时候特别不喜欢听他说我这没问题，那也行，因为做不成的时候感觉更难受。之后也许还有四处求职的日子，这种不安定感总是时刻伴随着自己，有时候想着如果在一个稳当的环境下搞科研或者教书挺好的，但真那样又会受不了。作为一个矛盾综合体，也只能走一步看一步了，未来完全out of control.

## Notes 1: Active Learning

1. supervised learning/semi-supervised learning/active learning
supervised learning:利用一组已知样本调整分类器的参数，使其达到所要求的性能的过程，所给样本包含分类信息,常见的supervised learning方法有：决策树，神经网络等;
semi-supervised VS. active learning:

active learning选择需要被labeled的sample，是主动的学习者

2. 按照对未标注数据的选择策略，可以把当前的主动学习算法分成很多类，本文列举其中两类：
(1) 基于评委(committee-based)
(2) 基于置信度的方法(certainty-based)

(1) 对于n个未标注样本构成的样本组B中每个样本e,使用先前从标注数据集中训练出的k个模型对其进行标注,得到k个分类结果{L1, L2, ...., Lk}，通过他们，对每个e测量出最有争议的标注结果De
(2) 选出m个De最大的未标注数据，交给标注人员标注
(3) 将刚被标注的样本又加入原模型再重新train，又可以得到m个De最大的未标注数据，如此反复，直到达到需要的数据量

(1) 从先前标注的样本中训练出一个模型
(2) 对N个未标注样本组中每个样本用模型对其进行标注，评估模型标注的置信度
(3) m个置信度低的样本交给标注人员标注
(4) 加入新样本train，重复上述过程直到达到所需数据量

## priority_queue的用法

/**
* struct ListNode {
*     int val;
*     ListNode *next;
*     ListNode(int x) : val(x), next(NULL) {}
* };
*/

struct data{
int value;
int index;
};

struct compare{
bool operator()(data a, data b) {
if (a.value < b.value) return false;
else if (a.value == b.value) {
if (a.index <= b.index) return false;
}
return true;
}
};

class Solution {
public:
ListNode *mergeKLists(vector<ListNode *> &lists) {
vector<ListNode *> tmpList = lists;
priority_queue<data, vector<data>, compare> datas;
ListNode *res;
int i;
data tmp;
for (i = 0; i < lists.size(); i++) {
if (!tmpList[i]) continue;
tmp.value = tmpList[i]->val;
tmp.index = i;
datas.push(tmp);
}

while (!datas.empty()) {
data minV = datas.top();
}else{
res -> next = tmpList[minV.index];
res = res->next;
}
datas.pop();
tmpList[minV.index] = tmpList[minV.index]->next;
if (tmpList[minV.index]) {
tmp.value = tmpList[minV.index]->val;
tmp.index = minV.index;
datas.push(tmp);
}
}
}
};


## InterviewStreet: Triplets

Source of the problem: https://www.hackerrank.com/challenges/triplets

My submission got 32.22 score and TLE for the other test cases. The complexity of my solution is O(N^2), which should be optimized to O(NlogN), but I don't know how.

General idea of my submissions:
1) Record the first position and the last for each integer.
2) Two arrays, same length with the input, should be updated sequentially.
---- The 1st array Array1[i]: how many smaller integers subject to: input[j] < input[i] and j < i;
---- The 2nd array Array2[i]: how many triplets got if input[i] is the third number of triplets.
--- to update Array2[i], just summarize Array1[j] which subject to: input[j] < input[i] and j < i; It's worth noting the repeated integers.
3) Do summation of all Array2[i], notice repeated integers.

P.S.: You have to use 'long long int' for output format. Or else you will got WA. When N = 100000, the maximum results can be C_{100000}^{3}, which is beyond 'unsigned int'.

Codes as follows:

typedef struct{
unsigned int value;
long long int two;
long long int three;
} data;

typedef struct{
int pre;
int post;
} position;

map<unsigned int, position> dataPos;

long long  int getResNum(vector<data> dataVec){
int i, j;
for (i = 1; i < dataVec.size(); i++){
long long int count1 = 0;
long long int count2 = 0;
for (j = 0; j < i; j++){
if (dataVec[j].value < dataVec[i].value){
if (dataPos[dataVec[j].value].post < i &&  \
j != dataPos[dataVec[j].value].post) {
continue;
}else{
count1++;
count2 += dataVec[j].two;
}
}
}
dataVec[i].two = count1;
dataVec[i].three = count2;
}
long long int res = 0;
for (i = 1; i < dataVec.size(); i++){
if (dataPos[dataVec[i].value].post != i) continue;
res += dataVec[i].three;
}
return res;
}

int main(){
int N, i;
unsigned int a;
vector<data> dataVec;
while (scanf("%d", &N) == 1){
dataVec.clear();
dataPos.clear();
data tmp;
position p;
for (i = 0; i < N; i++){
scanf("%u", &a);
tmp.value = a;
tmp.two = tmp.three = 0;
dataVec.push_back(tmp);
if (dataPos.find(a) == dataPos.end()) {
p.pre = p.post = i;
dataPos[a] = p;
}else{
dataPos[a].post = i;
}
}
printf("%lld\n", getResNum(dataVec));
}
return 0;
}


## ubuntu下安装rpm包

1. 确认已经安装alien,如果没有：
sudo apt-get install alien

2. 转换rpm到deb包
sudo alien --scripts filename.rpm

3. 安装deb包
sudo dpkg -i filename.deb

## interview street challenge -- Event Tree

You are given a tree (a simple connected graph with no cycles).You have to remove as many edges from the tree as possible to obtain a forest with the condition that : Each connected component of the forest contains even number of vertices

Your task is to calculate the number of removed edges in such a forest.

Input:
The first line of input contains two integers N and M. N is the number of vertices and M is the number of edges. 2 <= N <= 100.
Next M lines contains two integers ui and vi which specifies an edge of the tree. (1-based index)

Output:
Print a single integer which is the answer

Sample Input
10 9
2 1
3 1
4 3
5 2
6 1
7 2
8 6
9 8
10 8

Sample Output :
2

/* Enter your code here. Read input from STDIN. Print output to STDOUT */
#include <stdio.h>
#include <vector>
#include <algorithm>
using namespace std;

int a[101][101];

void mark(int *dist, int *flag, int N, int start){
int i;
flag[start] = 1;
for (i = 1; i <= N; i++){
if (a[start][i] ==1 && flag[i] == 0) break;
}
if (i > N) {
return;
}
for (i = 1; i <= N; i++){
if (a[start][i] == 1 && flag[i] == 0){
mark(dist, flag, N, i);
dist[start] += dist[i];
}
}
}

int main(){
int N, M;
int i, j, from, to;
int dist[101], flag[101];
int count;
while (scanf("%d%d", &N, &M) == 2){
for (i = 1; i <= N; i++){
for (j = 1; j <= N; j++){
a[i][j] = 0;
}
flag[i] = 0;
dist[i] = 1;
}
for (i = 0; i < M; i++){
scanf("%d%d", &from, &to);
a[from][to] = a[to][from] = 1;
}
mark(dist, flag, N, 1);
count = 0;
for (i = 2; i <= N; i++){
if (dist[i] % 2 == 0) count++;
}
printf("%d\n", count);
}
return 0;
}