2016-04-12 UniCareer



有哪几种 Quant?

(1) Desk Quant

Desk Quant 开发直接被交易员使用的价格模型,优势是接近交易中所遇到的Money和机会,劣势是压力很大。

(2) Model Validating Quant

Model Validating Quant 独立开发价格模型,不过是为了确定Desk Quant开发的模型的正确性。优势是更轻松,压力比较小,劣势是这种小组会比较没有作为而且远离Money。

(3) Research Quant

Research Quant 尝试发明新的价格公式和模型,有时还会执行Blue-Sky Research(不太清楚是什么),优势是比较有趣(对喜欢这些人来说),而且你学到很多东西。劣势是有时会比较难证明有你这个人存在(跟科学家一样,没有什么大的成果就没人注意你)

(4) Quant Developer


(5) Statistical Arbitrage Quant

Statistical Arbitrage Quant 在数据中寻找自动交易系统的模式(就是套利系统),这种技术比起衍生物定价的技术有很大的不同,它主要用在对冲基金里,而且这种位置的回报是极不稳定的。

(6) Capital Quant

Capital Quant 建立银行的信用和资本模型,相比衍生物定价相关的工作,它没有那么吸引人,但是随着巴塞尔II银行协议的到来,它变的越来越重要,你会得到不错的收入(但不会很多),更少的压力和更少的工作时间。



(1) FX


(2) Equities


(3) Fixed Income

Fixed Income的意思是基于利息的衍生物,这从市值上来说可能是最大的市场,他用到的数学会更加复杂因为从根本上来说他是多维的,技术上的技巧会用的很多,他的收入比较高。

(4) Credit Derivatives

Credit Derivatives是建立在那些公司债务还清上的衍生产品,他发展的非常快并有大量需求,所以也有很高的收入,尽管如此,他表明了一些当前经济的泡沫因素。

(5) Commodities


(6) Hybrids



(1) 商业银行 (HSBC, RBS)


(2) 投行 (高盛, Lehman Brothers)


(3) 对冲基金 (Citadel Group)


(4) 会计公司


(5) 软件公司


《Options Future and Other Derivatives》
John C. Hull

不管是找工作还是senior quant都会用到。John Hull本人也是非常厉害的,各个方面都有开创性的成果。现在Toronto Uni,经典中的经典,涉猎还算广泛,不过不够数学—-人称华尔街的圣经,自然不算很难。

《Stochastic Calculus for Finance II》
Steven E. Shreve

Shreve的新书,非常elegant, 非常仔细,非常数学完备,适合数学背景, 但是比较厚,对于入门来说还是3好。作者现在CMU纽约。教授。顶尖人物。I是讲离散模型,II讲连续模型。

《Liar’s Poker》
Michael Lewis 

讲以前Solomon brothers的Arb team的,当时是世界最厉害的quant trader。这本书搞trading的人都会看。

《C++ Design Patterns and Derivatives Pricing》
Mark S. Joshi 

对于懂得C++基础的人来说很重要,更重要的是教你学会Monte Carlo。

《Modeling Derivatives in C++ (Wiley Finance)》
Justin London 
《The Concepts and Practice of Mathematical Finance》
Mark S. Joshi 
《Interest Rate Models – Theory and Practice》
Damiano Brigo / Fabio Mercurio 
评价超高的书。这本书最大的精华是关于Libor market model的论述。本书的特点是作者将所有细节和盘托出,包括大量的数值结果,这样可以方便读者自学和验证。
《Probability with Martingales》
David Williams
主要是围绕martingale展开的,前面一部份介绍必要的measure theory的部分,点到即止,都是后面基本的probability theory需要用到的。即使你之前不懂measure theory也能看懂。难怪是给undergraduate用的。Williams是这个方向上文笔最好的数学家了。
《Monte Carlo Methods in Financial Engineering》
Paul Glasserman
本书很实用,紧扣标题,就是围绕着金融工程中蒙特卡洛的应用展开,真正读过的人可能会有感受,此书不太适合作为first book来读,最好两方面都已经有所涉及,再来读收获更大也更舒服些。
《My Life as a Quant: Reflections on Physics and Finance》
Emanuel Derman
作者是第一代quant,以前是GS的quant 研究部门head,现在哥大。是stochastic vol领域顶尖人物,其实也是很多其他领域顶尖人物。
面试官更在乎你对基本知识的了解是否透彻,而不是你懂得多少东西,展示你对这个领域的兴趣也很重要,你需要经常阅读Economist, FT 和Wall Street Journal,面试会问到一些基本微积分或分析的问题,例如Logx的积分是什么。问到类似Black-Scholes公式怎么得出的问题也是很正常的,他们也会问到你的论文相关的问题。











一个Quant工作的时间变化很大。在RBS我们8:30上班,6pm下班。压力也是变化很大的, 一些美国银行希望你工作时间更长。 在伦敦有5-6个星期的假期,而在美国2-3个是正常的。


这年头,但凡和教育沾点边的,还有谁不知道可汗学院啊-Khan Academy?前段时间他的故事被编成了鸡汤文,弄得我妈都知道,因为比尔盖茨都为他唱赞歌。















可汗学院的 算术与代数预备课程,是从零开始学习数学的起始点,是代数课程的先导课。对于那些想从最基础开始学习数学,或者以后想要学习代数课程的幼儿园大班和小学生来说,这套课程比较适合。



[第1集] 加法1

[第2集] 加法2

[第3集] 减法1

[第4集] 减法2

[第5集] 减法3

[第6集] 加法3

[第7集] 加法4

[第8集] 减法3

[第9集] 为什么可以借代

[第10集] 加法5

[第11集] 减法4

[第12集] 减法应用题

[第13集] 交替心算减法

[第14集] 负数介绍

[第15集] 加负数

[第16集] 加不同符号的数

[第17集] 加减负数

[第18集] 两位数相加

[第19集] 加减法应用题



[第1集] 基本乘法

[第2集] 乘法表

[第3集] 10 11 12的乘法表

[第4集] 除法1

[第5集] 除法2

[第6集] 将总数平分及其应用

[第7集] 两位数乘一位数

[第8集] 两位数乘两位数

[第9集] 数字相乘及其应用2

[第10集] 格形乘法




[第1集] 加法交换律

[第2集] 乘法交换律

[第3集] 加法结合律

[第4集] 乘法结合律

[第5集] 加法结合律性质

[第6集] 分配律性质

[第7集] 分配律性质2

[第8集] 分配律性质例1

[第9集] 数性质和绝对值

[第10集] 恒等性质1

[第11集] 恒等性质例2

[第12集] 0的恒等性质

[第13集] 加法的逆的性质

[第14集] 乘法的逆的性质

[第15集] 为什么除以0没有定义

[第16集] 为什么0除以0没有定义

[第17集] 没有定义和不确定





[第1集] 分数的分子和分母

[第2集] 解释分数的意义

[第3集] 等价分数

[第4集] 等价分数例题

[第5集] 分数比较大小

[第6集] 最简分数

[第7集] 分数比较大小第2部分

[第8集] 分数排序




[第1集] 小数加法

[第2集] 小数位值

[第3集] 小数位值2

[第4集] 用数轴表示小数

[第5集] 小数近似

[第6集] 小数估计

[第7集] 小数比较

[第8集] 小数加法

[第9集] 小数减法

[第10集] 小数减法应用题

[第11集] 小数乘以10的指数

[第12集] 小数乘法

[第13集] 小数乘法

[第14集] 小数除以10的指数

[第15集] 小数除法1

[第16集] 小数除法2

[第17集] 小数除法3

[第18集] 小数乘法3

[第19集] 小数和分数互化

[第20集] 把分数化成小数例题

[第21集] 把分数化成小数

[第22集] 把分数转换为小数例1

[第23集] 把分数转换为小数例2

[第24集] 把循环小数化成分数1

[第25集] 把循环小数化成分数2

[第26集] 把小数化成分数1例1

[第27集] 把小数化成分数1例2

[第28集] 把小数化成分数1例3

[第29集] 把小数化成分数2例1

[第30集] 把小数化成分数2例2

[第31集] 百分数和小数

[第32集] 把一个数表示成小数 百分数 分数

[第33集] 把一个数表示成小数 百分数 分数2

[第34集] 数轴上一点

[第35集] 数字排序





[第1集] 负数介绍

[第2集] 负数的大小排序

[第3集] 加负数

[第4集] 加不同符号的数

[第5集] 加减负数

[第6集] 正数和负数相乘

[第7集] 正数和负数除法

[第8集] 为什么负数乘以负数得到正数




[第1集] 质数

[第2集] 判断质数

[第3集] 判断整除

[第4集] 共同整除的例题

[第5集] 找一个数的因数

[第6集] 质因数分解

[第7集] 最小公倍数

[第8集] 最小公倍数(LCM)

[第9集] 最大公约数

[第10集] 代数表达式的最小公倍数

[第11集] 3的整除性质




[第1集] 理解指数

[第2集] 理解指数2

[第3集] 指数第一级

[第4集] 指数第二级

[第5集] 指数第三级

[第6集] 指数法则1

[第7集] 指数法则2



[第1集] 比例介绍

[第2集] 理解比例

[第3集] 比例分数的最简形式

[第4集] 比例化简

[第5集] 求比例中的未知数

[第6集] 求单位速度




[第1集] 运算顺序介绍

[第2集] 更复杂的运算顺序例子

[第3集] 运算顺序1

[第4集] 运算顺序2





2016-04-12 顾险峰 赛先生

图1 庞加莱猜想电脑三维模型

顾险峰 (纽约州立大学石溪分校终身教授,清华大学丘成桐数学科学中心访问教授,计算共形几何创始人)

最近英国上议院议员马特瑞德利(Matt Ridley)在《华尔街日报》上撰文《基础科学的迷思》(The Myth of Basic Science)。他认为“科学驱动创新,创新驱动商业”这一说法基本上是错误的,反而是商业驱动了创新,创新驱动了科学,正如科学家被实际需求所驱动,而不是科学家驱动实际需求一样。总之,他认为“科学突破是技术进步的结果,而不是原因”。

瑞德利先生的言论反映了许多人对基础科学的严重误解,会给年轻学子们带来思想混乱和价值观念上的困扰,有必要加以澄清。诚然,商业需求和工程实践会为基础科学提供研究的素材,比如历史上最优传输理论(OptimalMass Transportation Theory)和蒙日-安培方程(Monge-Ampere)起源于土石方的运输,最后猜想被康塔洛维奇解决,康塔洛维奇为此获得了诺贝尔经济学奖。数年前,为了解决医学图像的压缩问题,陶哲轩提出了压缩感知(Compressive Sensing)理论。但是,从根本上而言,基础科学的源动力来自于科学家对于自然真理的好奇和对美学价值的追求。基础科学上的突破,因为揭示了自然界的客观真理,往往会引发应用科学的革命。纯粹数学的研究因为其晦涩抽象,实用价值并不明显直观,普罗大众一直倾向于认为其“无用”。但实际上,纯粹数学对应用科学的指导作用是无可替代的。



1  庞加莱猜想

法国数学家庞加莱(Jules Henri Poincaré)是现代拓扑学的奠基人。拓扑学研究几何体,例如流形,在连续形变下的不变性质。我们可以想象曲面由橡皮膜制成,我们对橡皮膜拉伸压缩,扭转蜷曲,但是不会撕破或粘联,那么这些形变都是连续形变,或被称之为拓扑形变,在这些形变下保持不变的量就是拓扑不变量。如果一张橡皮膜曲面经由拓扑形变得到另外一张橡皮膜曲面,则这两张曲面具有相同的拓扑不变量,它们彼此拓扑等价。如图2 所示,假设兔子曲面由橡皮膜做成,我们象吹气球一样将其膨胀成标准单位球面,因此兔子曲面和单位球面拓扑等价。

图2. 兔子曲面可以连续形变成单位球面,因此兔子曲面和球面拓扑等价。


图3. 亏格为2的封闭曲面。亏格是曲面最重要的拓扑不变量。


图4. 曲面上生活的蚂蚁如何检测曲面的拓扑?


图5. 亏格为1的曲面上,无法缩成点的闭圈。


图6. 带边界的三流形,用三角剖分表示。


2  曲面单值化定理


图7. 人脸曲面上连接两点的测地线。


图8. 曲面单值化定理,所有封闭曲面都可以保角地形变成常曲率曲面。


图9. 共形变换保持局部形状。

3  瑟斯顿几何化猜想


图10. 瑟斯顿的苹果,几何化猜想。


4  哈密尔顿的里奇曲率

本质的突破来自于哈密尔顿的里奇曲率流(Hamilton’s Ricci Flow)。哈密尔顿的想法来自经典的热力学扩散现象。假设我们有一只铁皮兔子,初始时刻兔子表面的温度分布并不均匀,依随时间流逝,温度渐趋一致,最后在热平衡状态,温度为常数。哈密尔顿设想:如果黎曼度量依随时间变化,度量的变化率和曲率成正比,那么曲率就像温度一样扩散,逐渐变得均匀,直至变成常数。如图11所示,初始的哑铃曲面经由曲率流,曲率变得越来越均匀,最后变成常数,曲面变成了球面。

图11. 曲率流使得曲率越来越均匀,直至变成常数,曲面变成球面。

在二维曲面情形,哈密尔顿和Ben Chow证明了曲率流的确将任何一个黎曼度量形变成常值曲率度量,从而给出了曲面单值化定理的一个构造性证明。但是在三维流形情形,里奇曲率流遇到了巨大的挑战。在二维曲面情形,在曲率流过程中,在任意时刻,曲面上任意一点的曲率都是有限的;在三维流形情形,在有限时间内,流形的某一点处,曲率有可能趋向于无穷,这种情况被称为是曲率爆破(blowup),爆破点被称为是奇异点(singularity)。


5 庞加莱猜想带来的计算技术



哈密尔顿的里奇流是定义在光滑流形上的,在计算机的表示中,所有的流形都被离散化。因此,我们需要建立一套离散里奇流理论来发展相应的计算方法。历经多年的努力,笔者和合作者们建立了离散曲面的里奇曲率流理论,证明了离散解的存在性和唯一性。因为几乎所有曲面微分几何的重要问题,都无法绕过单值化定理。我们相信离散曲率流的计算方法必将在工程实践中发挥越来越重要的作用 [1]

图12. 离散里奇流计算的带边曲面单值化。


6 精准医疗


图13. 虚拟肠镜技术。


直肠癌是男子的第四号杀手,仅在心脑血管疾病之后。中年之后,每个人都会自然长出直肠息肉,息肉会逐年增长,如果息肉直径达到一定尺寸,由于摩擦息肉会发生溃疡,长期溃疡会导致癌变。但是直肠息肉的生长非常缓慢, 一般从息肉出现直到临界尺寸需要七八年,因此对息肉的监控对于预防直肠癌起着至关重要的作用。中年人应该每两年做一次肠镜检查。传统的肠镜检查方法需要对受检者全身麻醉,将光学内窥镜探入直肠。老年人肠壁比较薄弱,容易产生并发症。同时,肠壁上有很多皱褶,如果息肉隐藏在皱褶中,医生会无法看到而产生漏检。


图14. 用里奇曲率流将直肠曲面摊开展平。


图15. 虚拟膀胱镜。


图16. 用里奇曲率流将大脑皮层曲面共形映到单位球面,以便于对照比较。


脑退化症(Alzheimer’s disease,俗称老年痴呆症),癫痫,儿童自闭症等脑神经疾病严重地威胁着人类的健康安全。对于这些疾病的预防和诊断具有重要的现实意义。通过核磁共振成像技术,我们能够获取人类的大脑皮层曲面,如图16所示。大脑皮层曲面的几何非常复杂,有大量的皱褶沟回结构,并且这些几何结构因人而异,依随年龄变化而变化。例如老年痴呆症往往伴随大脑皮层一部分区域的萎缩。为了监控病情的发展,我们需要每隔几个月就扫描一下病人的大脑,然后将不同时期得到的大脑皮层曲面进行精确地对比。在三维空间中直接对比难度很高,我们非常容易将不同的沟回错误地对齐,算法落入在局部最优陷阱中。如图16所示,我们将大脑皮层曲面共形地映到球面上,然后在球面之间建立光滑映射,这种方法更加简单而精确。将大脑皮层映到球面等价于为大脑皮层曲面赋以曲率为+1的黎曼度量,我们可以用里奇曲率流的方法得到。

图17. 大脑海马体的几何分析。


图18. 人脸曲面的精确匹配。



图19. 三维人脸曲面被共形地映到二维平面上,所用方法就是里奇曲率流。



7 总结和展望






[1] W. Zeng and X. Gu, Ricci Flow for Shape Analysis and Surface Registration Theories, Algorithms and Applications, Springer 2013


  如果没有基础科学,我们将会失去什么?②  视频 | 拓扑为何?

③  纯粹数学走出象牙塔:丘成桐和三维科技有何关系?

15 Must Read Books for Entrepreneurs in Data Science

Source: http://www.analyticsvidhya.com/blog/2016/04/15-read-books-entrepreneurs-data-science/



The roots of entrepreneurship are old. But, the fruits were never so lucrative as they have been recently. Until 2010, not many of us had heard of the term ‘start-up’. And now, not a day goes by when business newspapers don’t quote them. There is sudden gush in the level of courage which people possess.

Today, I see 1 out of 5 person talking about a new business idea. Some of them even succeed too in establishing their dream company. But, only the determined ones sustain. In data science, the story is bit different.

The success in data science is mainly driven by knowledge of the subject. Entrepreneurs are not required to work at ground level, but must have sound knowledge of how it is being done. What algorithms, tools, techniques are being used to create products & services.

In order to gain this knowledge, you have two ways:

  1. You work for 5-6 years in data science, get to know things around and then start your business.
  2. You start reading books along the way and become confident to start in first few years.

I would opt for second option.

15 must read books for entrepreneurs in data science

Why read books ?

Think of our brain as a library. And, it’s a HUGE library.

How would an empty library look like? If I close my eyes and imagine, I see dust, spider webs, brownian movement of dust particles and darkness. If this imagination horrifies you, then start reading books.

The books listed below gives immense knowledge and motivation in technology arena. Reading these books will give you the chance to live many different entrepreneurial lives. Take them one by one. Don’t get overwhelmed. I’ve displayed a mix of technical and motivational books for entrepreneurs in data science. Happy Reading!

List of Books

Data Science For Businessdata science for business vidhya

This book is written by Foster Provost & Tom Fawcett. It gives a great head start to anyone, who is serious about doing business with big data analytics. It makes you believe, data is now business. No business in the world, can now sustain without leveraging the power of data. This books introduces you to real side of data analysis principles and algorithms without technical stuff. It gives you enough intuition and confidence to lead a team of data scientists and recommend what’s required. More importantly, it teaches you the winning approach to become a master at business problem solving.

Get the book: Buy Now

Big Data at Workb2

This book is written by Thomas H. Davenport. It reveals the increasing importance of big data in organizations. It talks with interesting  numbers, researches and statistics. So until 2009, companies worked on data samples. But with advent of powerful devices and data storage capabilities, companies now work on whole data. They don’t want to miss even a single bit of information. This book unveils the real side of big data, it’s influence on our daily lives, on companies and our jobs. As an entrepreneur, it is extremely important for you understand big data and its related terminologies.

Get the book: Buy Now

Lean Analyticsb3

This book is written by Alistair Croll and Benjamin Yoskovitz. It’s one of the most appreciated books on data startups. It consist of practical & detailed researches, advice, guidance which can help you to build your startup faster. It gives enough intuition to build data driven products and market them. The language is simple to understand. There are enough real world examples to make you believe, a business needs data analytics like a human needs air. To an entrepreneur, this will introduce the practical side of product development and what it takes to succeed in a red ocean market.

Get the book: Buy Now


This book is written by Michael Lewis. It’s a brilliant tale which sprinkles some serious inspiration. A guy named billy bean does what most of the world failed to imagine, just by using data and statistics. He paved the path to victory when situations weren’t favorable. Running a business needs continuous motivation. This can be a good place to start with. However, this book involves technical aspects of baseball. Hence, if you don’t know baseball, chances are you might struggle in initial chapters. A movie also has been made on this book. Do watch it!

Get the book: Buy Now

Elon Muskb5

This book is written by Ashlee Vance. I’m sure none of us are fortunate to live the life of Elon Musk, but this book let’s us dive in his life and experience rise of fantastic future. Elon is the face behind Paypal, Tesla and SpaceX. He has dreamed of making space travel easy and cheap. Recently, he was applauded by Barack Obama for the successful landing of his spaceship in an ocean. People admire him. They want to know his secrets and this is where you can look for. As on entrepreneur, you will learn about must have ingredients which you need to a become successful in technology space.

Get the book: Buy Now

Keeping up with the Quantsb6

This book is written by Thomas H Davenport and Jinho Kim. As we all know, data science is driven by numbers & maths (quants). Inspired from moneyball, this book teaches you the methods of using quantitative analysis for decision making. An entrepreneur is a terminal of decision making. One must learn to make decisions using numbers & analysis, rather than intuition. The language of this book is easy to understand and suited for non-maths background people too. Also, this book will make you comfortable with basics statistics and quantitative calculations in the world of business.

Get the book: Buy Now

The Signal and the NoiseCover of the book 'The Signal and the Noise' by Nate Silver. Published by The Penguin Press

The author of this book is Nate Silver, the famous statistician who correctly predicted US Presidential elections in 2012. This books shows the real art and science of making predictions from data. This art involves developing the ability to filter out noise and make correct predictions. It includes interesting examples which conveys the ultimate reason behind success and failure of predictions. With more and more data, predictions have become prone to noise errors. Hence, it is increasingly important to understand the science behind making predictions using big data science. The chapters of this book are interesting and intuitive.

Get the book: Buy Now

When Genius Failedb8

This book is written by Roger Lowenstein. It is an epic story of rise and failure of a hedge fund. For an entrepreneur, this book has ample lessons on investing, market conditions and capital management. It’s a story of a small bank, which used quantitative techniques for bond pricing throughout the world and ensured every invested made gives a profitable results. However, they didn’t sustain for long. Their quick rise was succeeded by failure. And, the impact of their failure was so devastating that US Federal bank stepped in to rescue the bank, because the fund’a bankruptcy would have large negative influence on world’s economy.

Get the book: Buy Now

Lean Startupb9

This book is written by Eric Ries. In one line, it teaches how to not to fail at the start of your business. It reveals proven strategies which are followed by startups around the world. It has abundance of stories to make you walk on the right path. An entrepreneur should read it when he/she feel like draining out of motivation. It teaches to you to learn quickly, implement new methods and act quickly if something doesn’t work out. This book applies to all industries and is not specific to data science.

Get the book: Buy Now

Web Analytics 2.0

b10This book is written by Avinash Kaushik. It is one of the best book to learn about web analytics. Internet is the fastest mode of collecting data. And, every entrepreneur must learn the art of internet accounting. Most of the businesses today face the challenge of weak presence on social media and internet platforms. Using various proven strategies and actionable insights, this book helps you to solve various challenges which could hamper your way. It also provides a winning template which can be applied in most of the situations. It focuses on choosing the right metric and ways to keep them in control.

Get the book: Buy Now

Predictive Analyticsb11

This book is written by Eric Seigel. It is a good follow up book after web analytics 2.0. So, once you’ve understood the underlying concept of internet data, metrics and key strategies. This book teaches you the methods of using that knowledge to make predictions. It’s simple to understand and covers many interesting case studies displaying how companies predict our behavior and sell us products. It doesn’t cover technical aspects, but explains the general working on predictive analytics and its applications. You can also check out this funny rap video by Dr. Eric Seigel:

Get the book: Buy Now


This book is written by Steven D Levitt and Stephen J Dubner. It shows the importance of numbers, data, quantitative analysis using various interesting stories. It says, there is a logic is everything which happens around us. Reading this book will make you aware of the unexplored depth at which data affects our real lives. It draws interesting analogy between school teachers and sumo wrestlers. Also, the bizarre stories featuring cases of criminal acts, real-estate, drug dealers will certainly add up to your exciting moments.

Get the book: Buy Now

Founders at Workb13

This book is written by Jessica Livingston. Again, this isn’t data science specific but a source of motivation to get you moving forward. It’s a collection of interviews with the founders of various startups across the world. The focus has been kept on early days i.e. how did they act when they started. This book will give you enough proven ideas, strategies and lessons to anticipate and avoid pitfalls in your initial days of business. It consist of stories by Steve Wozniak (Apple), Max Levchin (Paypal), Caterina Fake (Flikr) and many more. In total, there are 32 interviews listed which means you have the chance to learn from 32 mentors in one single book. Must read for entrepreneurs.

Get the book: Buy Now

Bootstrapping a Businessb14

This book is written by Greg Gianforte and Marcus Gibson. It teaches about the things to do when you are running short of money and still don’t want to stop. This is a must read book for every entrepreneur. Considering the amount of investment required in data science startups, this book should have a special space in an entrepreneur’s heart. It reveals various eye opening truths and strategies which can help you build a great company. Greg and Marcus proves that money is not always the reason for startup failure, it’s all about founder’s perspective. This book has stories of success and failures, again a great chance for you to live many lives by reading this book.

Get the book: Buy Now

Analytics at Workb15

This book is written by Thomas H Davenport, Jeanne G Harris and Robert Morrison. This books reveals the increased use of analytical tools & concepts by managers to make informed business decisions. The decision making process has accelerated. For a greater impact, it also consists of examples from popular companies like hotels.com, best buy and many more. It talks about recruiting, coordination with people and the use of data and analytics at an enterprise level. Many of us are aware of data and analytics. But, only a few know how to use them together. This quick book has it all !

Get the book: Buy Now

End Notes

This marks the end of this list. While compiling this list, I realized most of these books are about sharing experience and learning from the mistake of others. Also, it is immensely important to posses quantitative ability to become good in data science. I would suggest you to make a reading list and stick to it throughout the year. You can take up any book to start. I’d suggest to start with a motivational book.

Have you read any other book ? What were your key takeaways? Did you like reading this article? Do share your knowledge & experiences in the comments below.

Essentials of Machine Learning Algorithms (with Python and R Codes)

Source: http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/


Google’s self-driving cars and robots get a lot of press, but the company’s real future is in machine learning, the technology that enables computers to get smarter and more personal.

– Eric Schmidt (Google Chairman)

We are probably living in the most defining period of human history. The period when computing moved from large mainframes to PCs to cloud. But what makes it defining is not what has happened, but what is coming our way in years to come.

What makes this period exciting for some one like me is the democratization of the tools and techniques, which followed the boost in computing. Today, as a data scientist, I can build data crunching machines with complex algorithms for a few dollors per hour. But, reaching here wasn’t easy! I had my dark days and nights.


Who can benefit the most from this guide?

What I am giving out today is probably the most valuable guide, I have ever created.

The idea behind creating this guide is to simplify the journey of aspiring data scientists and machine learning enthusiasts across the world. Through this guide, I will enable you to work on machine learning problems and gain from experience. I am providing a high level understanding about various machine learning algorithms along with R & Python codes to run them. These should be sufficient to get your hands dirty.

machine learning algorithms, supervised, unsupervised

I have deliberately skipped the statistics behind these techniques, as you don’t need to understand them at the start. So, if you are looking for statistical understanding of these algorithms, you should look elsewhere. But, if you are looking to equip yourself to start building machine learning project, you are in for a treat.


Broadly, there are 3 types of Machine Learning Algorithms..

1. Supervised Learning

How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.


2. Unsupervised Learning

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate.  It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.


3. Reinforcement Learning:

How it works:  Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process

List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. KNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boost & Adaboost

1. Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weights! What do you think the child will do? He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation above.

In this equation:

  • Y – Dependent Variable
  • a – Slope
  • X – Independent variable
  • b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.


Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.

Python Code

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
#Predict Output
predicted= linear.predict(x_test)

R Code

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train # Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
#Predict Output
predicted= predict(linear,x_test) 


2. Logistic Regression

Don’t get confused by its name! It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected).

Again, let us try and understand this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The outcome to this study would be something like this – if you are given a trignometry based tenth grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.

Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).

Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical way to replicate a step function. I can go in more details, but that will beat the purpose of this article.

Logistic_RegressionPython Code

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)

R Code

x # Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
#Predict Output
predicted= predict(logistic,x_test)



There are many different steps that could be tried in order to improve the model:


3. Decision Tree

This is one of my favorite algorithm and I use it quite frequently. It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. For more details, you can read: Decision Tree Simplified.


source: statsexchange

In the image above, you can see that population is classified into four different groups based on multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy.

The best way to understand how decision tree works, is to play Jezzball – a classic game from Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls such that maximum area gets cleared off with out the balls.


So, every time you split the room with a wall, you are trying to create 2 different populations with in the same room. Decision trees work in very similar fashion by dividing a population in as different groups as possible.

MoreSimplified Version of Decision Tree Algorithms

Python Code

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)

R Code

x # grow tree 
fit y_train ~ ., data = x,method="class")
#Predict Output 
predicted= predict(fit,x_test)


4. SVM (Support Vector Machine)

It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are known as Support Vectors)


Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away.


In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.

More: Simplified Version of Support Vector Machine

Think of this algorithm as playing JezzBall in n-dimensional space. The tweaks in the game are:

  • You can draw lines / planes at any angles (rather than just horizontal or vertical as in classic game)
  • The objective of the game is to segregate balls of different colors in different rooms.
  • And the balls are not moving.


Python Code

#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 
model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)

R Code

x # Fitting model
fit <-svm(y_train ~ ., data = x)
#Predict Output 
predicted= predict(fit,x_test)


5. Naive Bayes

It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.

Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:


  • P(c|x) is the posterior probability of class (target) given predictor (attribute).
  • P(c) is the prior probability of class.
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(x) is the prior probability of predictor.

Example: Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’. Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.

Step 1: Convert the data set to frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.


Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

Problem: Players will pay if weather is sunny, is this statement is correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

Python Code

#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R Code

x # Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
#Predict Output 
predicted= predict(fit,x_test)


6. KNN (K- Nearest Neighbors)

It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing KNN modeling.

More: Introduction to k-nearest neighbors : Simplified.


KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information!

Things to consider before selecting KNN:

  • KNN is computationally expensive
  • Variables should be normalized else higher range variables can bias it
  • Works on pre-processing stage more before going for KNN like outlier, noise removal

Python Code

#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model 
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R Code

x # Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
#Predict Output 
predicted= predict(fit,x_test)


7. K-Means

It is a type of unsupervised algorithm which  solves the clustering problem. Its procedure follows a simple and easy  way to classify a given data set through a certain number of  clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhat similar this activity. You look at the shape and spread to decipher how many different clusters / population are present!


How K-means forms cluster:

  1. K-means picks k number of points for each cluster known as centroids.
  2. Each data point forms a cluster with the closest centroids i.e. k clusters.
  3. Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
  4. As we have new centroids, repeat step 2 and 3. Find the closest distance for each data point from new centroids and get associated with new k-clusters. Repeat this process until convergence occurs i.e. centroids does not change.

How to determine value of K:

In K-means, we have clusters and each cluster has its own centroid. Sum of square of difference between centroid and the data points within a cluster constitutes within sum of square value for that cluster. Also, when the sum of square values for all the clusters are added, it becomes total within sum of square value for the cluster solution.

We know that as the number of cluster increases, this value keeps on decreasing but if you plot the result you may see that the sum of squared distance decreases sharply up to some value of k, and then much more slowly after that. Here, we can find the optimum number of cluster.


Python Code

#Import Library
from sklearn.cluster import KMeans
#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model 
k_means = KMeans(n_clusters=3, random_state=0)
# Train the model using the training sets and check score
#Predict Output
predicted= model.predict(x_test)

R Code



8. Random Forest

Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is planted & grown as follows:

  1. If the number of cases in the training set is N, then sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m<
  3. Each tree is grown to the largest extent possible. There is no pruning.

For more details on this algorithm, comparing with decision tree and tuning model parameters, I would suggest you to read these articles:

  1. Introduction to Random forest – Simplified

  2. Comparing a CART model to Random Forest (Part 1)

  3. Comparing a Random Forest to a CART model (Part 2)

  4. Tuning the parameters of your Random Forest model


#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R Code

x # Fitting model
fit summary(fit)
#Predict Output 
predicted= predict(fit,x_test)


9. Dimensionality Reduction Algorithms

In the last 4-5 years, there has been an exponential increase in data capturing at every possible stages. Corporates/ Government Agencies/ Research organisations are not only coming with new sources but also they are capturing data in great detail.

For example: E-commerce companies are capturing more details about customer like their demographics, web crawling history, what they like or dislike, purchase history, feedback and many others to give them personalized attention more than your nearest grocery shopkeeper.

As a data scientist, the data we are offered also consist of many features, this sounds good for building good robust model but there is a challenge. How’d you identify highly significant variable(s) out 1000 or 2000? In such cases, dimensionality reduction algorithm helps us along with various other algorithms like Decision Tree, Random Forest, PCA, Factor Analysis, Identify based on correlation matrix, missing value ratio and others.

To know more about this algorithms, you can read “Beginners Guide To Learn Dimension Reduction Techniques“.

Python  Code

#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)
#Reduced the dimension of test dataset
test_reduced = pca.transform(test)
#For more detail on this, please refer  this link.

R Code

pca train, cor = TRUE)
train_reduced  train)
test_reduced  test)


10. Gradient Boosting & AdaBoost

GBM & AdaBoost are boosting algorithms used when we deal with plenty of data to make a prediction with high prediction power. Boosting is an ensemble learning algorithm which combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to a build strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix.

More: Know about Gradient and AdaBoost in detail

Python Code

#Import Library
from sklearn.ensemble import GradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

R Code

x # Fitting model
fitControl predicted= predict(fit,x_test,type= "prob")[,2] 

GradientBoostingClassifier and Random Forest are two different boosting tree classifier and often people ask about the difference between these two algorithms.

End Notes

By now, I am sure, you would have an idea of commonly used machine learning algorithms. My sole intention behind writing this article and providing the codes in R and Python is to get you started right away. If you are keen to master machine learning, start right away. Take up problems, develop a physical understanding of the process, apply these codes and see the fun!

Did you find this article useful ? Share your views and opinions in the comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Some lesser-known truths about programming

Source: http://automagical.rationalmind.net/2010/08/17/some-lesser-known-truths-about-programming/

My experience as a programmer  has taught me a few things about writing software. Here are some things that people might find surprising about writing code:

  • Averaging over the lifetime of the project, a programmer spends about 10-20% of his time writing code, and most programmers write about 10-12 lines of code per day that goes into the final product, regardless of their skill level. Good programmers spend much of the other 90% thinking, researching, and experimenting to find the best design. Bad programmers spend much of that 90% debugging code by randomly making changes and seeing if they work.
  • A good programmer is ten times more productive than an average programmer. A great programmer is 20-100 times more productive than the average. This is not an exaggeration – studies since the 1960’s have consistently shown this. A bad programmer is not just unproductive – he will not only not get any work done, but create a lot of work and headaches for others to fix.“A great lathe operator commands several times the wage of an average lathe operator, but a great writer of software code is worth 10,000 times the price of an average software writer.” –Bill Gates
  • Great programmers spend little of their time writing code – at least code that ends up in the final product. Programmers who spend much of their time writing code are too lazy, too ignorant, or too arrogant to find existing solutions to old problems. Great programmers are masters at recognizing and reusing common patterns. Good programmers are not afraid to refactor (rewrite) their code  to reach the ideal design. Bad programmers write code which lacks conceptual integrity, non-redundancy, hierarchy, and patterns, and so is very difficult to refactor. It’s easier to throw away bad code and start over than to change it.
  • Software development obeys the laws of entropy, like any other process. Continuous change leads to software rot, which erodes the conceptual integrity of the original design. Software rot is unavoidable, but programmers who fail to take conceptual integrity into consideration create software that rots so so fast that it becomes worthless before it is even completed. Entropic failure of conceptual integrity is probably the most common reason for software project failure. (The second most common reason is delivering something other than what the customer wanted.) Software rot slows down progress exponentially, so many projects face exploding timelines and budgets before they are mercifully killed.
  • A 2004 study found that most software projects (51%) will fail in a critical aspect, and 15% will fail totally. This is an improvement since 1994, when 31% failed.
  • Although most software is made by teams, it is not a democratic activity. Usually, just one person is responsible for the design, and the rest of the team fills in the details.
  • Programming is hard work. It’s an intense mental activity. Good programmers think about their work 24/7. They write their most important code in the shower and in their dreams. Because the most important work is done away from a keyboard, software projects cannot be accelerated by spending more time in the office or adding more people to a project.