An Incremental Sampling Method

Sampling K points out of total N is trivial and we always use random number to determine sample current points or not.
However, during our daily work, the total points size increases day by day, then how to sample K points in an incremental way is a challenge. The following algorithm introduces a smart and efficient sampling method and proof is also given after it.

Incremental Sampling Method
- step 1: Add first k points to the select set \mathcal{S};
- step 2: For index N = k+1, \cdots, generate a random number r, if (r > \frac{k}{N}), continue; else replace one of points in \mathcal{S} with same possibility.

How to prove for any N > k, each point is selected with the possibility \frac{k}{N}:

The index starts from 1, these points can be divided into two parts due to the selected number k:
1. Index 1\le i \le k, the possibility to select the i-th point in the \mathcal{S} is
p_i = (1-\frac{k}{k+1}\cdot \frac{1}{k})\cdot(1-\frac{k}{k+2}\cdot \frac{1}{k})\cdots(1-\frac{k}{N}\cdot \frac{1}{k}) = \frac{k}{N}.

2. Index i > k, the possibility to select it as final points is
p_i = \frac{k}{i}\cdot(1-\frac{k}{i+1}\cdot\frac{1}{k})\cdots(1-\frac{k}{N}\cdot \frac{1}{k}) = \frac{k}{N} = \frac{k}{N}.

This method is of course can be used in 1-pass large-scale data loading and sampling.

发表在 搜索,推荐相关 | 留下评论

Clustering Methods Review

Reference: Zhou Z H. Ensemble methods: foundations and algorithms[M]. CRC Press, 2012.

Clustering aims to find the inherent structure of the unlabeled data by grouping them into clusters of objects. A satisfactory clustering method should increase the intra-cluster similarity while decrease inter-cluster similarity meanwhile. Define the sample sets D = \{\bf{x_1}, \bf{x_2}, \cdots, \bf{x_m}\}, then the clusters can be defined as \{C_j | j = 1, \cdots, k\} with \cup_{j=1}^k C_j = D and C_i\cap_{i \neq j} C_j = \varnothing.

Methods
1. Partitioning Methods
A partitioning method organizes D into k partitions by optimizing an objective partitioning criterion. \bf{\text{K-Means}} is one the well-known partitioning methods. It optimizes the error function \text{err} = \sum_{j=1}^k\sum_{x\in C_j} \text{dist}(x, \hat{x_j})^2, where \hat{x_j} = \frac{1}{|C_j|}\sum_{x \in C_j} x.

To circumvent the solution, k-means adopts an iterative relocation technique to find the desired solution heuristically.
-step 1: select k instances from D as initial centers;
-step 2: find the nearest center for each instance;
-step 3: update the center \hat{x_j};
-step 4: repeat until convergence.

This figure shows how k-means works:

Another well-known clustering algorithm is \bf{\text{Affinity Propagation (AP)}} clustering method, which introduces "message passing" between data points. AP takes as input a collection of real-valued similarities between data points, where the similarity s(i,k) indicates how well the data point with index k is suited to be the class center for data point i. When the goal is o minimize the squared error, each similarity is set to a negative Euclidean distance: for points \bf{x_i} and\bf{x_k}, s(i,k) = -\|\bf{x_i-x_k}\|^2.
Rather than requiring that the number of clusters be prespecified, AP takes as input a real number s(k,k) for each data point k so that data points with larger values of s(k,k) are more likely to be chosen as class centers. These values are referred to as 'preferences', also called 'exemplars'. There are two kinds of messages exchanged between data points, i.e. 'responsibility' and 'availability'.
- Responsibility: Matrix R, r(i,k) denotes how well-suited \bf{x_k} is to serve as the exemplar for point i, taking into account other potential exemplars for point i.
- Availability: Matrix A, a(i,k) denotes how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar. The messages need only be exchanged between pairs of points with known similarities. The following figures and algorithms are how AP works.



The advantages of AP include stable convergence and independence of clustering prior numbers and initial points. However, due to \mathcal{O}(N^2\text{log}(N)) time complexity, tradeoff of the algorithm is required on large scale data.

2. Hierarchical Methods
It creates a hierarchy of clustering on D at various granular levels, where a specific clustering can be obtained by threshholding the hierarchy at a specific level of granule. A typical method is SAHN method:
- step 1: Each data point is place into a cluster as its own, then the similarity matrix D_{m\times m} has the component D(i,j) = \text{dist}(x_i, x_j)
- step 2: Two closest clusters C_i \& C_j are identified based on D and replaced by the agglomerated cluster C_h, then D(h,k) = \alpha_i D(i,k) + \alpha_j D(j,k) + \beta D(i,j) + \gamma |D(i,k)-D(j,k)|.
All coefficients are given before calculations.
This figure shows how SAHN works:

3. Density-based Methods
A density-based method constructs clusters on D based on the notion of the density, where regions of instances with high density are regarded as clusters which are separated by regions of low density. DBSCAN is a representative density-based clustering method, which characterizes the density of the data space with a pair of parameters (\epsilon, MinPts). This pair parameters determines a unique \epsilon-neighborhood for any given point \bf{x}.
- step 1: DBSCAN identifies core objects which satisfy the requirement imposed by the (\epsilon, MinPts) parameters.
- step 2: It forms clusters by iteratively connecting the directly density-reachable instances starting from those core objects.
- step 3: The connecting process terminates when no new data point can be added to any cluster.

4. Grid-based Methods
A grid based method quantizes D into a finite number of cells forming a grid-structure, where the quantization process is usually performed in a multi-resolution style. STING is a representative grid-based method, which divides the data space into a number of rectangular cells. Each cell stores statistical information of the instances falling into this cell. There are several levels of rectangular cells, each corresponding to a different level of resolution. Here, each cell at a higher level is partitioned into a number of cells at the next lower level, and statistical information of high-level cells can be easily inferred from its lower-level cells with simple operations such as elementary algebraic calculations.

5. Model-based Methods
It assumes a mathematical model characterizing the properties of D, where the clusters are formed to optimize the fit between the data and the underlying model. The most famous model-based method is GMM-based clustering, which works b utilizing the Gaussian Mixture Model (GMM)
p(x,\mathcal{\theta}) = \sum_{j = 1}^k \alpha_j\mathcal{N}(x|\mu_j, \sum_j),
where \mathcal{\theta} = \{\alpha_j, \mu_j, \sum_j|j = 1, \cdots, k\} and \sum_j \alpha_j = 1.
The cluster label \lambda_i for each \bf{x_i} \in D is specified according to the rule
\lambda_i =\text{argmax}_{1\le l \le k} \frac{\alpha_l\mathcal{N}(x_i|\mu_l, \sum_l)}{\sum_{j=1}^k\alpha_j\mathcal{N}(x_i|\mu_j, \sum_j)}.
\mathcal{\theta} is learned from D by employing the popular EM procedure to maximize log-likelihood function in an iterative manner.

发表在 搜索,推荐相关 | 留下评论

Logistic Regression - Mathematics Related

Reference: http://www.cnblogs.com/sparkwen/p/3441197.html

逻辑回归(Logistic Regression, LR)模型其实仅在线性回归的基础上,套用了一个逻辑函数。这主要是由于线性回归在整个实数域内敏感度一致,而分类范围,需要在[0,1]。逻辑回归就是一种减小预测范围,将预测值限定为[0,1]间的一种回归模型,其回归方程为:
h(z) = \frac{1}{1+e^{-z}}
与回归曲线如下图所示。逻辑曲线在z=0时,十分敏感,在z>>0或z<<0处,都不敏感,将预测值限定为(0,1).

\bf{x}=\{x_1, x_2, \cdots, x_n\}为样本数据,\bf{y}=\{y_1, y_2, \cdots, y_n\}为各个样本分类结果。那么对于二分类问题,我们可以构建逻辑回归模型如下。
首先,样本被分类为1的概率可以写作:
P\{y=1|x,\theta\} = \frac{1}{1+e^{-\theta^Tx}}
那么被分类为0的概率则可以写成:
P\{y=0|x,\theta\} = \frac{1}{1+e^{\theta^Tx}} = 1-P\{y=1|x,\theta\}

h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}}, 则对于单个样本来说,它的后验概率是:
p(y|x,\theta) = h_{\theta}(x)^y(1-h_{\theta}(x))^{1-y},
其中y = 1 或者 0;

那么所有样本的似然函数是:
L(\theta|x,y) = \Pi_{i=1}^n h_{\theta}(x_i)^{y_i}(1-h_{\theta}(x_i))^{1-y_i}

为了极大化这个似然函数,我们可以极大化这个函数(因为每一项都小于1,所有log后是负数):
l(\theta) = -\text{log}L(\theta|x,y)

我们可以求l(\theta)关于\theta的偏导数,这就是它的负梯度方向。我们可以采用梯度下降法作迭代策略, 求解\theta^* = \text{arg}\max_{\theta}(l(\theta)).

\frac{\partial}{\partial\theta}(l(\theta)) = \sum_{i=1}^n y_i\text{log}h_{\theta}(x_i) + (1-y_i)\text{log}(1-h_{\theta}(x_i))
\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg)\frac{\partial}{\partial\theta}(\frac{1}{1+e^{-\theta^Tx_i}})
\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg)\frac{e^{-\theta^Tx_i}}{(1+e^{-\theta^Tx_i})^2}x_i
\qquad = \sum_i\bigg( \frac{y_i}{h_{\theta}(x_i)} - \frac{1-y_i}{1-h_{\theta}(x_i)}\bigg) h_{\theta}(x_i)(1-h_{\theta}(x_i))x_i
\qquad = \sum_i(y_i-h_{\theta}(x_i))x_i

所以,迭代可写成: \theta_j = \theta_j + \alpha \sum_i(y_i-h_{\theta}(x_i))x_{i}.

发表在 搜索,推荐相关 | 留下评论

Quick Start of Gnuplot

Gnuplot 快速使用指南

在Linux下的开源软件中画图最好的应该是Gnuplot了,虽然octave下可以运行matlab的.m文件来画图,但是有些语法不兼容,而且出来的图很丑,scilab出的图亦然。其中scilab是带界面的,octave和gnuplot都是用命令行画图~

以下是一张由gnuplot画出来的图,在图之后我会给出画这张图的命令和数据。

Step 1: 数据准备
准备好四个文本文件,文件名分别为"speedup_1", "speedup_2","speedup_3","speedup_4", 他们的内容分别为:
speedup_1:
4 4
8 8.44
16 9.65
32 13.6
64 26.2
speedup_2:
4 4
8 5.909417545
16 7.055328586
32 7.102585194
64 7.373120673
(省略其他2个文件)

Step 2: vi Speedup.gnu, 在文本中写入内容:
plot "speedup_1" t "Mesh Refinement" with linespoint lw 2, "speedup_2" t "Modify Region Marks" with linespoint lw 2, "speedup_3" t "Overall" with linespoint lw 2, "speedup_4" t "Ideal" with lines ls 0
set xlabel "Number of Processors" font "Times-Roman,20"
set ylabel "Speed Up" font "Times-Roman,20"
set xrange [0:64]
set xtics 16
set yrange [0:64]
set ytics 16
#set size 0.65,0.65

set key box
set key left top

set terminal png
set out 'speedup.png'
#set terminal postscript eps
#set terminal postscript color
#set terminal postscript solid
#set output 'speedup.eps'
replot

Step 3: gnuplot Speedup.gnu
之后你可以在本地目录中得到speedup.png
如果把set terminal png 和下一句.png都用#注释掉,去掉set terminal postcript *这些语句前面的#,则可以得到speedup.eps, 这在Latex中插入图片的格式中更常见。

以下是对于Speedup.gnu更详细的解释:
plot "speedup_1" t "Mesh Refinement" with linespoint lw 2, "speedup_2" t "Modify Region Marks" with linespoint lw 2, "speedup_3" t "Overall" with linespoint lw 2, "speedup_4" t "Ideal" with lines ls 0
# 这一句是把数据文件读入并命名,linespoint表示点线方式,lines表示线方式,lw是线宽度, ls是线颜色,我发现取ls 6时是黄色实线,0是黑色虚线.

set xlabel "Number of Processors" font "Times-Roman,20"
set ylabel "Speed Up" font "Times-Roman,20"
# 设置x,y轴标签,font后的部分是字体

set xrange [0:64]
set xtics 16
set yrange [0:64]
set ytics 16
# 设置x,y轴坐标范围和刻度

set size 0.65,0.65
#设置图占默认值的大小,如果是png,把这句可以注释掉,如果是eps,这样加上更好,因为会使得字体相对变大

set key box
#legend加框
set key left top
#legend位置为左上, left bottom是左下

set terminal png
set out 'speedup.png'
#输出成png格式

#set terminal postscript eps
#set terminal postscript color
#set terminal postscript solid
#set output 'speedup.eps'
#以上是在linux下输出成eps格式,如果在windows下,将postscript去掉 (未测试,你可以试试)

replot
#画图

发表在 Linux下工具使用 | 留下评论

找工作总结

今天决定签哪家了,至此,找工作全部结束,所以把过程总结一下。

面过的公司包括:

1. 笔试之后无下文的:
有道 -- 内推也要笔试,估计笔得不怎么样,确实我基础题做得不行。
雅虎 -- 笔试时特别奇葩,我本来内推的是 research engineer职位,结果笔试时傻眼,大题完全没法编。果断把试卷右上角的职位改成SDE(软件开发工程师), 后来竟然接到面试通知了,可是后来面试日期要延后,我也不等了。

2. 等不起的:
hulu -- 因为听朋友说这家公司的bar比google高,所以我才考虑去试试的(看有多难)。经历一次电面,3次现场面,后来加了一面,最后终面。因为不是数据挖掘和机器学习背景,我想投研发被HR劝说可能不合适,然后我投的开发。据说开发是要做产品线上的东西,和做网站有关,具体做啥我也不太清楚。我是上周二面完的,由于这周三就要给微软答复,所以催了hulu尽快给结果。HR很nice,看起来还是挺认真负责的,但是很抱歉我每次满怀期待地接电话,得到的通知都是“还不能做出决定,要再等等”。今天仍然如此,我觉得这是soft reject,和HR说那就算了,我得签微软。听说很多人都被pending了,直接收到拒信的应该也有,那我就不是特别清楚了。待遇方面,听说是开发比research要低。基本没有户口。然后应该普通offer也没有给非常高,绝对不像梁斌在weibo里说的那么夸张。P.S. 特别牛的人也许可以拿到那个数。

google -- 10月份电面过1次,现场面了2次,因为现场面那天给我安排的是9:00第一场,9:45第二场,然后我那天手机出毛病,闹钟没 响,直接睡过,风风火火地赶上第二场。又在中午补上了第一面。遇到的都是google 美国的面试官,一个老外,一个中国人,一个考算法题,一个考设计题。我好像答得都不怎么样,程序写得也一般,聊得还可以(那有个什么用呢)。不过竟然还给我安排3,4面了,我还挺感动的,虽然前2面反馈不咋的吧。面试日期本来是上周,可是有突然状况,推迟了,我估计这两天就会发信过去说我已经决定签哪了,不去面了。说多了都是泪,因为J1 2-year rules,我只能申请国内职位。google北京这里一堆人排着,轮到我的机会渺茫,几乎是0吧。google的HR很nice,非常认真负责。

JP MORGAN -- 投的QUANT职位,就是凑个热闹,听说global pay,不过其实也没有那么吓人。10月份电面过一次,问了我常微题,C++面向对象的题,感觉就是有那么几个方向,你会什么他问什么。我觉得他们对于写程序的考察,侧重点和IT大牛公司不一样,貌似不会问算法题,某人的解释是,他们更希望你代码写得规范,而不是注重创造性。认识也有其他人电面了,都是还没有接到面试通知,估计是会等电面全部结束后集中onsite interview。我个人感觉那个面试官GG对我印象应该不错,不过就算过了电面,我应该也是说我不会去onsite interview了吧。

3. 有口头offer的
华为 -- 博士特招项目,8月底就面完了。就面了2次,也没问太技术的问题,就是聊了聊之前的一些经历和一些想法。怎么说呢,觉得华为给我留下的印象不错,感觉还是一个很不错的公司。面试官大概都是40左右的leader,态度都非常谦逊,他们说特别希望招到清华北大中科院的最优秀的学生,以前给不了这些人理想的待遇,现在他们博士特招项目的待遇和发展都非常优厚。(待遇确实很不错了,和本科硕士不是一个level。)给我分到的是做数据库方向,我对这个方向没有任何背景,也不太想做这个方向。重申一次,华为是个好公司,比起很多国内互联网公司的虚假繁荣,它毕竟是个认认真真干实事的公司,你有时觉得他做事口号喊得有点大,但是能够做到就是很强的。

阿里 -- 准确地说,是阿里妈妈搜索广告。我比较曲折,先是暑假和我BF去提前批,面了2轮后,面了ZWS大牛老师,终面评价不好,不过终面不挂人。(后来听说,终面过的人就有资格参加阿里星面试了。)然后就是发实习offer了,但是我实习有地方去了,所以就这样了。之后,校招的时候,我们又被当实习生转正,面了终面,当时也有HR在场,面试感觉还比较愉快。面完回家路上,被HR叫回去补了个交叉面,是阿里云的一个人面我的,我当时抽风了最后问他的问题让氛围好尴尬,好在他没往心里去。最后拿到A类口头offer,不过我对于package不是很满意,但是他们觉得这个价格不低了,而且要我看以后的薪资涨幅。我觉得他们说得很有道理,去阿里还是很有前途的,但是第一年package实在远低于我预期。

微软 -- 实习生转正,也拖到现在才有正式offer的眉目。实习的时候,是好朋友内推的,当时错过大规模招暑期实习生的时间,所以我和head count斗争了很久。期间我还找了Intel 工程院。最后在大家的共同努力以及STC老大的special approval下,才拿到那个来之不易的实习生名额。我实习期间真的算很拼了,因为想转正,并且知道今年STC招人少很多,而且大部分在苏州。后来我mentor力挺我,给了我很多鼓励和支持,真是太感激她了。好人有好报,前不久她终于升职了,而且连升了2级。实习生转正的面试我被安排在9月下旬2面,10月中旬三四面,给我三,四面的都是principle manager。都是面算法题,程序题以及问项目,没有谈人生和理想。四面的时候,我遇到的面试官很tough,感觉不是很友好,一开始还不是很适应,不过脸皮厚的人总是能够战胜自尊,最后我们聊得还可以。反正当时我面完后我和我mentor都觉得我可能完蛋了,我还难过了几天,结果发现竟然还面试通过了。我突然开始觉得第四面面试官好有爱啊,在他严冬的外表下有颗盛夏的心(我太恶心了有木有)。微软给博士入职的level比硕士要多一级,package符合我预期,IT公司里面应该只比google的博士起薪少那么一点点吧,我还比较满意。

4. 没有面试但要感谢的公司
豌豆荚 -- 我2012年去实习的公司,如今规模是我去时的3-4倍。一家非常有爱,有品味的公司,所有显摆自己公司零食,午餐,甜点的公司,我还没有见过一个比得上这家的,甚至我认为都超过google的奢侈程度了。我今年找实习和工作没有那么顺的时候,他们是我最坚强的后盾,我很爱我的小伙伴们,一直想的就是如果微软没有转正的话,我最有可能就是回豌豆荚。(不是因为觉得微软比豌豆荚好,而是别的原因,也就不啥都说那么清楚了)

渣打 -- 我和BF还有来神之前参加过一个渣打的比赛,成绩还不错,然后就接到他们一个senior manager的电话,意思是让我考虑一下他们的职位。好吧,其实我写渣打不是很合适,应该是渣打科营,实在拎不清他们的组织结构。无论他有些关于行业的观点我是否同意,我都还是很感谢他能在周末抽出自己宝贵的休息时间给我电话和我聊聊。但是我比较想在北京,不想去天津塘沽,所以没有考虑那里。

最后,还要加上个内推后无下文的:网易游戏数据挖掘研究员,真是一点消息都木有啊,不过听说研发都给信了,估计今年是我投这个方向提前招到合适的人了。

说了那么多,其实仔细想想,面了也没几家公司。找工作是特别辛苦的过程,希望现在还没有拿到offer的XDJM们继续加油,会有大offer在后面等着你们。

Dora总是在鼓励我,有时候特别不喜欢听他说我这没问题,那也行,因为做不成的时候感觉更难受。之后也许还有四处求职的日子,这种不安定感总是时刻伴随着自己,有时候想着如果在一个稳当的环境下搞科研或者教书挺好的,但真那样又会受不了。作为一个矛盾综合体,也只能走一步看一步了,未来完全out of control.

发表在 互联网产品类 | 5 条评论

Notes 1: Active Learning

1. supervised learning/semi-supervised learning/active learning
supervised learning:利用一组已知样本调整分类器的参数,使其达到所要求的性能的过程,所给样本包含分类信息,常见的supervised learning方法有:决策树,神经网络等;
semi-supervised VS. active learning:
相同: to get good performance without demanding too many labeled samples
不同:semi-supervised 使用unlabeled data
active learning选择需要被labeled的sample,是主动的学习者

2. 按照对未标注数据的选择策略,可以把当前的主动学习算法分成很多类,本文列举其中两类:
(1) 基于评委(committee-based)
(2) 基于置信度的方法(certainty-based)

基于评委的方法的流程:
(1) 对于n个未标注样本构成的样本组B中每个样本e,使用先前从标注数据集中训练出的k个模型对其进行标注,得到k个分类结果{L1, L2, ...., Lk},通过他们,对每个e测量出最有争议的标注结果De
(2) 选出m个De最大的未标注数据,交给标注人员标注
(3) 将刚被标注的样本又加入原模型再重新train,又可以得到m个De最大的未标注数据,如此反复,直到达到需要的数据量

基于置信度的流程:
(1) 从先前标注的样本中训练出一个模型
(2) 对N个未标注样本组中每个样本用模型对其进行标注,评估模型标注的置信度
(3) m个置信度低的样本交给标注人员标注
(4) 加入新样本train,重复上述过程直到达到所需数据量

发表在 搜索,推荐相关, 边学边记 | 留下评论

priority_queue的用法

通过解决leetcode的一个问题:Merge K Sorted Lists 来使用priority_queue,熟悉了这个数据结构的函数用法。
它建立大顶堆的时间复杂度是O(KlogK),然后每次更新堆顶元素复杂度是O(logK),所以整个合并的时间复杂度是: O(N*NlogK),其中N表示一个链表的平均长度.

代码如下:

/**
 * Definition for singly-linked list.
 * struct ListNode {
 *     int val;
 *     ListNode *next;
 *     ListNode(int x) : val(x), next(NULL) {}
 * };
 */

 struct data{
     int value;
     int index;
 };

 struct compare{
     bool operator()(data a, data b) {
         if (a.value < b.value) return false;
         else if (a.value == b.value) {
             if (a.index <= b.index) return false;
         }
         return true;
     }
 };

class Solution {
public:
    ListNode *mergeKLists(vector<ListNode *> &lists) {
        vector<ListNode *> tmpList = lists;
        priority_queue<data, vector<data>, compare> datas;
        ListNode *head = NULL;
        ListNode *res;
        int i;
        data tmp;
        for (i = 0; i < lists.size(); i++) {
            if (!tmpList[i]) continue;
            tmp.value = tmpList[i]->val;
            tmp.index = i;
            datas.push(tmp);
        }

        while (!datas.empty()) {
            data minV = datas.top();
            if (!head) {
                head = tmpList[minV.index];
                res = head;
            }else{
                res -> next = tmpList[minV.index];
                res = res->next;
            }
            datas.pop();
            tmpList[minV.index] = tmpList[minV.index]->next;
            if (tmpList[minV.index]) {
                tmp.value = tmpList[minV.index]->val;
                tmp.index = minV.index;
                datas.push(tmp);
            }
        }
        return head;
    }
};
发表在 算法类, 边学边记 | 留下评论

InterviewStreet: Triplets

Source of the problem: https://www.hackerrank.com/challenges/triplets

My submission got 32.22 score and TLE for the other test cases. The complexity of my solution is O(N^2), which should be optimized to O(NlogN), but I don't know how.

General idea of my submissions:
1) Record the first position and the last for each integer.
2) Two arrays, same length with the input, should be updated sequentially.
---- The 1st array Array1[i]: how many smaller integers subject to: input[j] < input[i] and j < i;
---- The 2nd array Array2[i]: how many triplets got if input[i] is the third number of triplets.
--- to update Array2[i], just summarize Array1[j] which subject to: input[j] < input[i] and j < i; It's worth noting the repeated integers.
3) Do summation of all Array2[i], notice repeated integers.

P.S.: You have to use 'long long int' for output format. Or else you will got WA. When N = 100000, the maximum results can be C_{100000}^{3}, which is beyond 'unsigned int'.

Codes as follows:

typedef struct{
   unsigned int value;
   long long int two;
   long long int three;
} data;

typedef struct{
    int pre;
    int post;
} position;

map<unsigned int, position> dataPos;

long long  int getResNum(vector<data> dataVec){
    int i, j;
    for (i = 1; i < dataVec.size(); i++){
        long long int count1 = 0;
        long long int count2 = 0;
        for (j = 0; j < i; j++){
            if (dataVec[j].value < dataVec[i].value){
                if (dataPos[dataVec[j].value].post < i &&  \
                           j != dataPos[dataVec[j].value].post) {
                    continue;
                }else{
                    count1++;
                    count2 += dataVec[j].two;
                }
            }
        }
        dataVec[i].two = count1;
        dataVec[i].three = count2;
    }
    long long int res = 0;
    for (i = 1; i < dataVec.size(); i++){
        if (dataPos[dataVec[i].value].post != i) continue;
        res += dataVec[i].three;
    }
    return res;
}

int main(){
    int N, i;
    unsigned int a;
    vector<data> dataVec;
    while (scanf("%d", &N) == 1){
        dataVec.clear();
        dataPos.clear();
        data tmp;
        position p;
        for (i = 0; i < N; i++){
            scanf("%u", &a);
            tmp.value = a;
            tmp.two = tmp.three = 0;
            dataVec.push_back(tmp);
            if (dataPos.find(a) == dataPos.end()) {
                p.pre = p.post = i;
                dataPos[a] = p;
            }else{
                dataPos[a].post = i;
            }
        }
        printf("%lld\n", getResNum(dataVec));
    }
    return 0;
}
发表在 算法类, 边学边记 | 留下评论

ubuntu下安装rpm包

做法:
1. 确认已经安装alien,如果没有:
sudo apt-get install alien

2. 转换rpm到deb包
sudo alien --scripts filename.rpm
你会在同目录下看到filename.deb

3. 安装deb包
sudo dpkg -i filename.deb

好了,安装结束。

发表在 Linux下工具使用, 边学边记 | 留下评论

interview street challenge -- Event Tree

题目:https://www.interviewstreet.com/challenges/dashboard/#problem/4fffc24df25cd

You are given a tree (a simple connected graph with no cycles).You have to remove as many edges from the tree as possible to obtain a forest with the condition that : Each connected component of the forest contains even number of vertices

Your task is to calculate the number of removed edges in such a forest.

Input:
The first line of input contains two integers N and M. N is the number of vertices and M is the number of edges. 2 <= N <= 100.
Next M lines contains two integers ui and vi which specifies an edge of the tree. (1-based index)

Output:
Print a single integer which is the answer

Sample Input
10 9
2 1
3 1
4 3
5 2
6 1
7 2
8 6
9 8
10 8

Sample Output :
2

算法:
从1开始为树的根,递归把树的各个节点对应的自己包括自己子树的节点数目 dist[i] 求出来。
然后从2开始遍历所有dist[i]为偶数的点,返回结果就可以。

代码:

/* Enter your code here. Read input from STDIN. Print output to STDOUT */
#include <stdio.h>
#include <vector>
#include <algorithm>
using namespace std;

int a[101][101];

void mark(int *dist, int *flag, int N, int start){
    int i;
    flag[start] = 1;
    for (i = 1; i <= N; i++){
    	if (a[start][i] ==1 && flag[i] == 0) break;
    }
    if (i > N) {
    	return;
    }
    for (i = 1; i <= N; i++){
    	if (a[start][i] == 1 && flag[i] == 0){
    		mark(dist, flag, N, i);
    		dist[start] += dist[i];
    	}
    }
}

int main(){
	int N, M;
	int i, j, from, to;
	int dist[101], flag[101];
	int count;
	while (scanf("%d%d", &N, &M) == 2){
		for (i = 1; i <= N; i++){
			for (j = 1; j <= N; j++){
				a[i][j] = 0;
			}
			flag[i] = 0;
			dist[i] = 1;
		}
		for (i = 0; i < M; i++){
			scanf("%d%d", &from, &to);
			a[from][to] = a[to][from] = 1;
		}
		mark(dist, flag, N, 1);
		count = 0;
		for (i = 2; i <= N; i++){
		    if (dist[i] % 2 == 0) count++;
		}
		printf("%d\n", count);
	}
	return 0;
}
发表在 算法类, 边学边记 | 留下评论