精确统计github贡献者的代码行数

github的仓库是可以统计每个贡献者的代码行数的,公司年会的时候,特设了一个“码神奖”,颁给去年贡献代码最多的工程师,github的统计数据显示,这位大神去年提交的代码达到了110w行,这个数据太惊人了,一个人不可能写这么多代码的,我非常好奇的研究了一下,发现中间还包括了他提交的很多第三方库,但github也一并统计了,而且经过他合并的代码也会统计进去。那么有没有办法去掉这些无效数据,得到真实的代码贡献量呢?查了一下github api,再结合git 命令,还是可以的,上代码:

#copy this script to your target repo
#run python github-stats.py to collect data
import re
import json
import os
import sys
import requests

#get token from cmd line
tk = sys.argv[1]

user_stats={"dummy":{"additions":0,"deletions":0,"total":0}}
#query github api for last year‘s commits
payload = {‘since‘:‘2013-01-01T00:00:00Z‘,‘until‘:‘2014-01-01T00:00:00Z‘,‘access_token‘:tk}
token = {‘access_token‘:tk}

def is_merge(commit_sha):
	cmd = "git show --oneline " + commit_sha
	output = os.popen(cmd)
	title = output.read()
	p_merge = re.compile("Merge")
	if(p_merge.search(title) is not None):
		return True
	else:
		return False

def collect_stats(commit_list):
	for m in commit_list:
		#print user_stats
		#print m[‘sha‘]

		#print data
		if(is_merge(m[‘sha‘])):
			continue
		git_show_command = "git show -s --format=%an " + m[‘sha‘]
		
		output = os.popen(git_show_command)

		user = output.read().strip(‘ \t\n\r‘)
		#print user
		#r2 = requests.get(commit_request_api+m[‘sha‘], params = token)
		#commit = r2.json()
		#print commit
		git_diff_command = "git diff --shortstat "+m[‘sha‘] + " " + m[‘sha‘] + "^"
		
		output = os.popen(git_diff_command)
		data = output.read()
		
		
		#print "data is:"
		#print data
		p_ins = re.compile("(\d+) insertion")

		r_ins = p_ins.search(data)

		ins_data = 0
		del_data = 0

		if(r_ins is not None):
		  ins_str = r_ins.group(1)
		  ins_data = int(ins_str)
		  #print ins_data

		p_del = re.compile("(\d+) deletion")

		r_del = p_del.search(data)

		if(r_del is not None):
		  del_str = r_del.group(1)
		  del_data = int(del_str)
		  #print del_data 

		if(ins_data + del_data > 5000):
		  print user
		  print ‘ins:‘+str(ins_data)
		  print ‘del:‘+str(del_data)
		  ins_data = 0
		  del_data = 0
		if(user in user_stats):
		  stats = user_stats[user]
		  stats[‘additions‘] += ins_data
		  stats[‘deletions‘] += del_data
		  stats[‘total‘] += (ins_data + del_data)
		  user_stats[user] = stats
		else:
		  new_stat = {‘additions‘:ins_data, ‘deletions‘:del_data, ‘total‘:ins_data+del_data}
		  user_stats[user] = new_stat

r = requests.get("https://api.github.com/repos/cocos2d/cocos2d-x/commits", params = payload)
collect_stats(r.json())

print user_stats

pattern = re.compile("<(\S+)>; rel=\"next\"")
h = r.headers
print r.headers[‘X-RateLimit-Remaining‘]
result = pattern.search(h[‘link‘])

while(result is not None):
	
	next_url = result.group(1)
 	r = requests.get(next_url, params = token)

	collect_stats(r.json())
  

	h = r.headers
	print h[‘link‘]
	result = pattern.search(h[‘link‘])

	#print h[‘link‘]
	#next_url = result.group(1)
	#print next_url
	#r_next = requests.get(next_url[1])

	print r.headers[‘X-RateLimit-Remaining‘]
	print user_stats
代码也可以在github上获得: https://github.com/heliclei/githubtools/blob/master/github-stats.py
这个脚本过滤了单次提交超过5000行的commit,并且过滤了合并的commit,先把需要统计的仓库克隆到本地,再把这个脚本拷贝到本地git仓库下,注意要把这一行改为对应仓库的url

https://api.github.com/repos/cocos2d/cocos2d-x/commits
github token可以用上一篇的脚本生成
运行 python git-stats.py xxxxxxxxxxxxxgithub-oauth-tokenxxxxxxxxxxxxxxxxxxx
PS: 过滤后,cocos2d-x的码神去年的代码贡献量超过了10w行,还是非常的厉害~~但这个数据没有110W行那么超现实了。





精确统计github贡献者的代码行数

上一篇:Photoshop使用图层蒙版制作黑白风格的抽象艺术字教程


下一篇:通过机器码程序理解冯诺依曼体系