一、准备数据
1 | import pandas as pd |
1 | pos_data = data[data['labels'] == 'positive'].reset_index(drop=True) |
1 | pos_data.iloc[:,0][:3] |
['story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly . ']
1 | from pyecharts import Pie, Style |
二、理论验证
通过观察发现正面评论里常出现些赞美的词汇,负面评论出现批判的词汇。考虑能否根据这个特征来对正负评论进行预测?
根据这个想法进行一次快速的验证
1 | from collections import Counter |
1 | data.iloc[1][0] |
'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly . '
1 | t0 = time() |
处理完毕
耗时5.25712513923645秒。
1 | # positive_counts.most_common() |
通过观察发现无论是正样本还是负样本,词频最高的词汇大多是常用词汇如 a、the、and、of … 等单词。下面把正负计算器汇总起来一起观察
1 | # 这里计算的是词汇分别在好评中出现的次数与差评出现次数的比例然后取对数。 |
1 | # 正面词汇 |
[('edie', 4.6913478822291435),
('antwone', 4.477336814478207),
('din', 4.406719247264253),
('gunga', 4.189654742026425),
('goldsworthy', 4.174387269895637)]
1 | # 负面词汇 |
[('whelk', -4.605170185988092),
('pressurized', -4.605170185988092),
('bellwood', -4.605170185988092),
('mwuhahahaa', -4.605170185988092),
('insulation', -4.605170185988092)]
通过以上观测,初步验证了我们的想法是有效可行的。某些词汇在好评和差评中出现的次数明显差距很大
三、构建神经网络原型
1 | from IPython.display import Image |
1 | review = 'The movie was excellent.' |
四、将每条评论转换为输入向量
这里为了简单起见只是做了对每个出现的单词做个词频统计,但是实际情况往往比这个复杂,事实上大量无关单词的出现严重扰乱了数据的属性。针对这个情况后面会介绍一种简单的处理方式。更优化的处理方式可以参考TF—IDF。
1 | vocab = set(total_counts.keys()) |
输入向量的维数: 74074
1 | # 创建一个容器,这里注意是二维数据集 |
array([[0., 0., 0., ..., 0., 0., 0.]])
1 | # 创建一个带索引的词汇字典 |
1 | def update_input_layer(review): |
array([[18., 0., 0., ..., 0., 0., 0.]])
创建标签
1 | def get_target_for_label(label): |
五、创建神经网络
1 | import time |
六、初步测试
1、首先训练之前先看下测试效果是否为50%
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1) |
Progress:99.9% Speed(reviews/sec):1553. #Correct:500 #Tested:1000 Testing Accuracy:50.0%
1 | mlp.train(reviews[:-1000],labels[:-1000]) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):367.6 #Correct:1251 #Trained:2501 Training Accuracy:50.0%
Progress:20.8% Speed(reviews/sec):366.4 #Correct:2501 #Trained:5001 Training Accuracy:50.0%
Progress:31.2% Speed(reviews/sec):367.6 #Correct:3751 #Trained:7501 Training Accuracy:50.0%
Progress:41.6% Speed(reviews/sec):368.7 #Correct:5001 #Trained:10001 Training Accuracy:50.0%
Progress:52.0% Speed(reviews/sec):368.5 #Correct:6251 #Trained:12501 Training Accuracy:50.0%
Progress:62.5% Speed(reviews/sec):368.6 #Correct:7501 #Trained:15001 Training Accuracy:50.0%
Progress:72.9% Speed(reviews/sec):368.6 #Correct:8751 #Trained:17501 Training Accuracy:50.0%
Progress:83.3% Speed(reviews/sec):368.5 #Correct:10001 #Trained:20001 Training Accuracy:50.0%
Progress:93.7% Speed(reviews/sec):368.7 #Correct:11251 #Trained:22501 Training Accuracy:50.0%
Progress:99.9% Speed(reviews/sec):368.4 #Correct:12000 #Trained:24000 Training Accuracy:50.0%
2、正式开始训练,发现准确率一直是50%没有提升,考虑是否因为学习率太高造成无法收敛。
然后调小学习率重新测试
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.01) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):334.6 #Correct:1248 #Trained:2501 Training Accuracy:49.9%
Progress:20.8% Speed(reviews/sec):334.0 #Correct:2498 #Trained:5001 Training Accuracy:49.9%
Progress:31.2% Speed(reviews/sec):340.9 #Correct:3748 #Trained:7501 Training Accuracy:49.9%
Progress:41.6% Speed(reviews/sec):348.0 #Correct:4998 #Trained:10001 Training Accuracy:49.9%
Progress:52.0% Speed(reviews/sec):348.3 #Correct:6248 #Trained:12501 Training Accuracy:49.9%
Progress:62.5% Speed(reviews/sec):346.7 #Correct:7490 #Trained:15001 Training Accuracy:49.9%
Progress:72.9% Speed(reviews/sec):348.8 #Correct:8746 #Trained:17501 Training Accuracy:49.9%
Progress:83.3% Speed(reviews/sec):348.0 #Correct:9996 #Trained:20001 Training Accuracy:49.9%
Progress:93.7% Speed(reviews/sec):346.8 #Correct:11246 #Trained:22501 Training Accuracy:49.9%
Progress:99.9% Speed(reviews/sec):347.0 #Correct:11995 #Trained:24000 Training Accuracy:49.9%
3、继续调小学习率
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.001) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):318.7 #Correct:1263 #Trained:2501 Training Accuracy:50.4%
Progress:20.8% Speed(reviews/sec):319.5 #Correct:2615 #Trained:5001 Training Accuracy:52.2%
Progress:31.2% Speed(reviews/sec):319.9 #Correct:4035 #Trained:7501 Training Accuracy:53.7%
Progress:41.6% Speed(reviews/sec):320.5 #Correct:5566 #Trained:10001 Training Accuracy:55.6%
Progress:52.0% Speed(reviews/sec):320.3 #Correct:7047 #Trained:12501 Training Accuracy:56.3%
Progress:62.5% Speed(reviews/sec):320.1 #Correct:8658 #Trained:15001 Training Accuracy:57.7%
Progress:72.9% Speed(reviews/sec):319.8 #Correct:10202 #Trained:17501 Training Accuracy:58.2%
Progress:83.3% Speed(reviews/sec):319.5 #Correct:11889 #Trained:20001 Training Accuracy:59.4%
Progress:93.7% Speed(reviews/sec):319.3 #Correct:13525 #Trained:22501 Training Accuracy:60.1%
Progress:99.9% Speed(reviews/sec):319.2 #Correct:14574 #Trained:24000 Training Accuracy:60.7%
4、增加隐藏层
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], hidden_nodes=15, learning_rate=0.0003) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):269.4 #Correct:1304 #Trained:2501 Training Accuracy:52.1%
Progress:20.8% Speed(reviews/sec):269.0 #Correct:2765 #Trained:5001 Training Accuracy:55.2%
Progress:31.2% Speed(reviews/sec):268.6 #Correct:4395 #Trained:7501 Training Accuracy:58.5%
Progress:41.6% Speed(reviews/sec):268.5 #Correct:6083 #Trained:10001 Training Accuracy:60.8%
Progress:52.0% Speed(reviews/sec):267.9 #Correct:7774 #Trained:12501 Training Accuracy:62.1%
Progress:62.5% Speed(reviews/sec):267.2 #Correct:9483 #Trained:15001 Training Accuracy:63.2%
Progress:72.9% Speed(reviews/sec):266.0 #Correct:11199 #Trained:17501 Training Accuracy:63.9%
Progress:83.3% Speed(reviews/sec):265.0 #Correct:13023 #Trained:20001 Training Accuracy:65.1%
Progress:93.7% Speed(reviews/sec):264.3 #Correct:14854 #Trained:22501 Training Accuracy:66.0%
Progress:99.9% Speed(reviews/sec):263.7 #Correct:15993 #Trained:24000 Training Accuracy:66.6%
第一次训练总结:
通过以上训练效果发现,首先学习率过大时模型无法收敛。当取值为 0.001 时模型开始缓慢的提升。正常情况下模型在刚开始提升的速度很快到后面越来越慢。
从以上效果来看,当模型准确率提升到60%时速度开始放缓,因此即使增加更多的迭代次数对效果的提升也不会很明显。因此我们考虑时哪些原因可能会造成这种情况。
需要解决的问题:
- 训练速度太慢
- 准确度不高
可能的原因:
- 模型隐藏层节点过少,过于简单
- 数据本身存在噪音对模型影响较大
经过第4步的测试,发现增加隐藏层为20对效果并没有提升,反而效果更差。考虑继续调小学习率到 0.0003 模型准确率得以提升。
发现:增加节点数的同时需要调小学习率
如果把神经网络比如成挖掘机,我们的目的时从数据里挖掘出有价值的金子。往往刚开始的时候很难挖到金子,可能并不是因为挖掘机挖的不够深而在于我们挖掘的位置或者操纵它的方式不对。所以我们从新回到数据集上考虑噪音和信号的问题。
七、关于噪音的分析
关于噪音
通过观察数据,发现在每一条评论中的空格字符,以及类似 a、the、at…这些字符占据了大多数甚至时几十个。这样的话放到模型里意味着我们给这些跟情绪不相关的词相当大的权重,而真正有价值的词被淹没了。有价值的情绪词汇出现的频率大多知识出现了1次。我们考虑一个最简单的方式是在评论转换为数值向量的时候不去累加词频而是简单的赋值为1,如下1
2
3
4
5def update_input_layer(self,review):
self.layer_0 *= 0
for word in review.split(" "):
if(word in self.word2index.keys()):
self.layer_0[0][self.word2index[word]] = 1
下面我们重新验证下我们的思路:
1 | import time |
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], hidden_nodes=15, learning_rate=0.0003) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):206.7 #Correct:1909 #Trained:2501 Training Accuracy:76.3%
Progress:20.8% Speed(reviews/sec):207.2 #Correct:3917 #Trained:5001 Training Accuracy:78.3%
Progress:31.2% Speed(reviews/sec):206.2 #Correct:5979 #Trained:7501 Training Accuracy:79.7%
Progress:41.6% Speed(reviews/sec):204.5 #Correct:8068 #Trained:10001 Training Accuracy:80.6%
Progress:52.0% Speed(reviews/sec):203.6 #Correct:10165 #Trained:12501 Training Accuracy:81.3%
Progress:62.5% Speed(reviews/sec):203.1 #Correct:12236 #Trained:15001 Training Accuracy:81.5%
Progress:72.9% Speed(reviews/sec):202.6 #Correct:14312 #Trained:17501 Training Accuracy:81.7%
Progress:83.3% Speed(reviews/sec):202.2 #Correct:16455 #Trained:20001 Training Accuracy:82.2%
Progress:93.7% Speed(reviews/sec):202.0 #Correct:18598 #Trained:22501 Training Accuracy:82.6%
Progress:99.9% Speed(reviews/sec):201.8 #Correct:19894 #Trained:24000 Training Accuracy:82.8%
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], hidden_nodes=10, learning_rate=0.1) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):264.1 #Correct:1812 #Trained:2501 Training Accuracy:72.4%
Progress:20.8% Speed(reviews/sec):264.1 #Correct:3802 #Trained:5001 Training Accuracy:76.0%
Progress:31.2% Speed(reviews/sec):264.2 #Correct:5896 #Trained:7501 Training Accuracy:78.6%
Progress:41.6% Speed(reviews/sec):264.2 #Correct:8045 #Trained:10001 Training Accuracy:80.4%
Progress:52.0% Speed(reviews/sec):264.2 #Correct:10172 #Trained:12501 Training Accuracy:81.3%
Progress:62.5% Speed(reviews/sec):264.2 #Correct:12319 #Trained:15001 Training Accuracy:82.1%
Progress:72.9% Speed(reviews/sec):264.0 #Correct:14438 #Trained:17501 Training Accuracy:82.4%
Progress:83.3% Speed(reviews/sec):263.7 #Correct:16615 #Trained:20001 Training Accuracy:83.0%
Progress:93.7% Speed(reviews/sec):263.5 #Correct:18796 #Trained:22501 Training Accuracy:83.5%
Progress:99.9% Speed(reviews/sec):263.5 #Correct:20117 #Trained:24000 Training Accuracy:83.8%
- 看下在测试集的效果
1 | mlp.test(reviews[-1000:],labels[-1000:]) |
Progress:99.9% Speed(reviews/sec):1620. #Correct:849 #Tested:1000 Testing Accuracy:84.9%
八、关于网络计算效率低的分析
通过上面的数据优化大大提升了模型的准确率,但是训练的速度还是很慢。所以我们考虑下这么才能提升训练的速度呢?
1 | Image(filename='sentiment_network.png') |
1 | def update_input_layer(review): |
93.0
再次考虑我们的模型,这里我们的输入层节点个数为 vocab_size(74074) 个,在 reviews[0]
中有数值的词汇只有93个,是个极度稀疏的向量。其中大多数都为 0 ,在输入值为 0 的时候乘以权重得出结果仍然为0,这不仅没有任何意义还大大增加的计算量。 所以我们考虑用什么方式可以优化这个问题呢?
解决思路:
我们记录下非 0 元素的索引,然后在计算输出层到隐藏层的时候,只在非 0 元素索引的位置乘以权重然后进行求和,这样就大大节省了计算量。
示例如下:
1 | layer_0 = np.zeros(10) |
-2.131818189044033
进一步考虑,这里我们的输入值都为 1, 因此 1 乘以权值这个计算步骤也可以省略,只需要对索引位置的权值进行求和就好了。
下面我们来实现这一思路:
- 不再让隐藏层中做乘以 0 的步骤
- 不再让隐藏层做权重乘以 1 的步骤
1 | import time |
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], hidden_nodes=15, learning_rate=0.003) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):1871. #Correct:1974 #Trained:2501 Training Accuracy:78.9%
Progress:20.8% Speed(reviews/sec):1753. #Correct:4037 #Trained:5001 Training Accuracy:80.7%
Progress:31.2% Speed(reviews/sec):1741. #Correct:6163 #Trained:7501 Training Accuracy:82.1%
Progress:41.6% Speed(reviews/sec):1696. #Correct:8320 #Trained:10001 Training Accuracy:83.1%
Progress:52.0% Speed(reviews/sec):1692. #Correct:10490 #Trained:12501 Training Accuracy:83.9%
Progress:62.5% Speed(reviews/sec):1695. #Correct:12636 #Trained:15001 Training Accuracy:84.2%
Progress:72.9% Speed(reviews/sec):1696. #Correct:14773 #Trained:17501 Training Accuracy:84.4%
Progress:83.3% Speed(reviews/sec):1675. #Correct:16959 #Trained:20001 Training Accuracy:84.7%
Progress:93.7% Speed(reviews/sec):1678. #Correct:19150 #Trained:22501 Training Accuracy:85.1%
Progress:99.9% Speed(reviews/sec):1674. #Correct:20467 #Trained:24000 Training Accuracy:85.2%
1 | mlp.test(reviews[-1000:], labels[-1000:]) |
Progress:99.9% Speed(reviews/sec):2611. #Correct:857 #Tested:1000 Testing Accuracy:85.7%
九、进一步降低降低噪音
1 | # 正面评论经常出现的词汇 |
[('edie', 4.6913478822291435),
('antwone', 4.477336814478207),
('din', 4.406719247264253),
('gunga', 4.189654742026425),
('goldsworthy', 4.174387269895637),
('gypo', 4.0943445622221),
('yokai', 4.0943445622221),
('paulie', 4.07753744390572),
('visconti', 3.9318256327243257),
('flavia', 3.9318256327243257),
('blandings', 3.871201010907891),
('kells', 3.871201010907891),
('brashear', 3.8501476017100584),
...]
1 | # 负面评论常出现词汇 |
[('whelk', -4.605170185988092),
('pressurized', -4.605170185988092),
('bellwood', -4.605170185988092),
('mwuhahahaa', -4.605170185988092),
('insulation', -4.605170185988092),
('hoodies', -4.605170185988092),
('yaks', -4.605170185988092),
('deamon', -4.605170185988092),
('ziller', -4.605170185988092),
('lagomorph', -4.605170185988092),
('marinaro', -4.605170185988092),
('accelerant', -4.605170185988092),
('yez', -4.605170185988092),
('superhu', -4.605170185988092),
('fastidiously', -4.605170185988092),
('spotlessly', -4.605170185988092),
('dahlink', -4.605170185988092),
('rebanished', -4.605170185988092),
('unmated', -4.605170185988092),
('wushu', -4.605170185988092),
('nix', -4.605170185988092),
('echance', -4.605170185988092),
('vannet', -4.605170185988092),
('hodet', -4.605170185988092),
('francie', -4.605170185988092),
('vivisects', -4.605170185988092),
('degeneration', -4.605170185988092),
('lowlight', -4.605170185988092),
('slackly', -4.605170185988092),
('unrurly', -4.605170185988092)]
1 | from bokeh.models import ColumnDataSource, LabelSet |
<div class="bk-root">
<a href="https://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a>
<span id="98fe141d-97bd-4335-a540-473361f947a4">Loading BokehJS ...</span>
</div>
1 | hist, edges = np.histogram(list(map(lambda x:x[1], |
1 | frequency_frequency = Counter() |
1 | hist, edges = np.histogram(list(map(lambda x:x[1],frequency_frequency.most_common())), density=True, bins=100, normed=True) |
1 | import time |
1 | mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], |
1 | mlp.train(reviews[:-1000],labels[:-1000]) |
1 | mlp.test(reviews[-1000:], labels[-1000:]) |
Progress:99.9% Speed(reviews/sec):5572. #Correct:845 #Tested:1000 Testing Accuracy:84.5%
1 | def get_most_similar_words(focus = "horrible"): |
1 | mlp_full = SentimentNetwork(reviews[:-1000],labels[:-1000],min_count=0,polarity_cutoff=0,learning_rate=0.01) |
1 | mlp_full.train(reviews[:-1000],labels[:-1000]) |
Progress:0.0% Speed(reviews/sec):0.0 #Correct:1 #Trained:1 Training Accuracy:100.%
Progress:10.4% Speed(reviews/sec):1712. #Correct:1962 #Trained:2501 Training Accuracy:78.4%
Progress:20.8% Speed(reviews/sec):1646. #Correct:4002 #Trained:5001 Training Accuracy:80.0%
Progress:31.2% Speed(reviews/sec):1545. #Correct:6120 #Trained:7501 Training Accuracy:81.5%
Progress:41.6% Speed(reviews/sec):1571. #Correct:8271 #Trained:10001 Training Accuracy:82.7%
Progress:52.0% Speed(reviews/sec):1560. #Correct:10431 #Trained:12501 Training Accuracy:83.4%
Progress:62.5% Speed(reviews/sec):1573. #Correct:12565 #Trained:15001 Training Accuracy:83.7%
Progress:72.9% Speed(reviews/sec):1567. #Correct:14670 #Trained:17501 Training Accuracy:83.8%
Progress:83.3% Speed(reviews/sec):1531. #Correct:16833 #Trained:20001 Training Accuracy:84.1%
Progress:93.7% Speed(reviews/sec):1512. #Correct:19015 #Trained:22501 Training Accuracy:84.5%
Progress:99.9% Speed(reviews/sec):1504. #Correct:20335 #Trained:24000 Training Accuracy:84.7%
1 | # get_most_similar_words("excellent") |
1 | import matplotlib.colors as colors |
1 | pos = 0 |
1 | from sklearn.manifold import TSNE |
1 | p = figure(tools="pan,wheel_zoom,reset,save", |