编程中的人工智能(上)
_Star_Universe_ · · 算法·理论
前言
本文会讲一些 python 中的人工智能知识,涉及到贝叶斯最浅层的知识,如果有很好的人工智能基础不建议阅读本文,毕竟这只是一个蒟蒻的学习笔记,下图是贝叶斯的工作原理,本文就先抛开这些,讲一些最简单的知识。语法没学过的看这里。
温馨提示:禁止一切名义的盗版!
多项式贝叶斯分类预测
简单概括一下我接触到的浅层贝叶斯工作原理,就是让机器人不断接触、学习、分析网络上的数据,积累经验,对某件没有发生的事接下来的趋势进行准确的分类预测。
导库
from pre import * # 模块标准化和归一化
from sklearn.naive_bayes import * # sklearn 朴素贝叶斯类库
多项式贝叶斯
c = MultinomialNB() # NB 就是 naive_bayes 的缩写
完成机器人训练
c.fit(x_train,y_train) # 注意是 . 而不是 =
测试机器人表现
print(c.score(x_test,y_test)) # 同样是 .
实践预测
res = c.predict(t_word) # 注意这里只有一个参数
show(res) # 展示预测结果
五个参数
x_train = x_train_get() # 训练特征
y_train = y_train_get() # 训练结果
x_test = x_test_get() # 测试特征
y_test = y_test_get() # 测试结果
t_word = t_word_get() # 实践特征
此处附完整代码
from pre import *
from sklearn.naive_bayes import *
x_train = x_train_get()
y_train = y_train_get()
x_test = x_test_get()
y_test = y_test_get()
t_word = t_word_get()
c = MultinomialNB()
c.fit(x_train,y_train)
print(c.score(x_test,y_test))
res = c.predict(t_word)
show(res)
高斯贝叶斯分析数据
分类
贝叶斯是一种机器学习算法,其可用于计算同一件事的可能出现的不同结果分别的概率,并且选取概率最大的一种作为计算的结果。
特征
特征是指一条信息中最具有代表性的内容,比如大小、正负、数量、关键字等。
贝叶斯分类器可以分析信息中的特征,并根据已有结果计算。
高斯贝叶斯
c = GaussianNB()
高斯贝叶斯适合处理连续的数字信息,多项式贝叶斯更适合处理非连续的文本信息。
读取表格
df = pd.read_excel('data2.xlsx')
数据拆分函数
data = train_test_split(df[['长度', '宽度', '步长', '步宽', '边缘是否完整', '型号']], df['性别'])
这是一个随机拆分函数,这里笔者需要特别标注一下此函数来源于 sklearn.model_selection 一库。
存储拆分数据
x_train = data[0]
x_test = data[1]
y_train = data[2]
y_test = data[3]
# 特征(x)在前,结果(y)在后;训练(train)在前,测试(test)在后。
展示概率
real = [[27.0,10.1250,76.950,17.5,0,2.0]]
pro = c.predict_proba(real) # 概率
print(pro)
此处附完整代码
from sklearn.naive_bayes import *
from sklearn.model_selection import *
import pandas as pd
c = GaussianNB()
df = pd.read_excel('data2.xlsx')
data = train_test_split(df[['长度', '宽度', '步长', '步宽', '边缘是否完整', '型号']], df['性别'])
# df[['b1','b2','...']], df['en']
x_train = data[0]
x_test = data[1]
y_train = data[2]
y_test = data[3]
c.fit(x_train, y_train)
real = [[27.0,10.1250,76.950,17.5,0,2.0]]
res = c.predict(real)
print(res)
pro = c.predict_proba(real)
print(pro)
数据预处理
一个贝叶斯模型无法同时处理数字信息和文本信息,所以,我们就要用数据预处理。
\texttt{numpy} 库
import numpy as np
创建数据透视表
df = pd.read_csv("adult.csv", header = None)
# 附注:见智能可视化一框中对透视表的详细解释
\texttt{empty} 函数
data = np.empty((h,l)) # h 行 l 列,另外注意 np 和 empty 间的 . 以及两对 ()
data = np.empty(df.shape) # 另一种写法
预处理部分
for i in df.columns:
encoder = None
if df[i].dtype == object: # 字符型数据
encoder = LabelEncoder()
data[:,i] = encoder.fit_transform(df[i])
else: # 数值型数据
data[:,i] = df[i]
LE.append(encoder)
如果是字符型数据,进行数据编码;如果是数值型数据则不变。
拆分数据
split_data = train_test_split(data[:, :-1], data[:, -1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
train_test_split 拆分出的数据包含
输入和拼接
info1 = int(input("请输入预测者的年龄:"))
info2 = input("请输入预测者的工作性质:")
info3 = input("请输入预测者的学历:")
info4 = input("请输入预测者的性别:")
t_word[0] = info1
t_word[1] = ' ' + info2
t_word[3] = ' ' + info3
t_word[9] = ' ' + info4
t_word = pd.DataFrame(t_word)
print(t_word.shape)
DataFrame 注意一下。
预测年薪模板代码
import pandas as pd
import numpy as np
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
df = pd.read_csv("adult.csv", header = None)
LE = [] # 放置每一列的 encoder
data = np.empty(df.shape)
for i in df.columns:
encoder = None
if df[i].dtype == object:
encoder = LabelEncoder()
data[:, i] = encoder.fit_transform(df[i])
else:
data[:, i] = df[i]
LE.append(encoder)
split_data = train_test_split(data[:, :-1], data[:,-1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
c = GaussianNB()
c.fit(x_train, y_train)
print(c.score(x_test, y_test))
t_word = pd.read_csv("pred.csv", header = None)
info1 = int(input("请输入预测者的年龄:"))
info2 = input("请输入预测者的工作性质:")
info3 = input("请输入预测者的学历:")
info4 = input("请输入预测者的性别:")
t_word[0] = info1
t_word[1] = ' ' + info2
t_word[3] = ' ' + info3
t_word[9] = ' ' + info4
t_word = pd.DataFrame(t_word)
print(t_word.shape)
t_word2 = np.empty(t_word.shape)
for i in t_word.columns:
encoder = None
if t_word[i].dtype == object:
encoder = LE[i]
t_word2[:, i] = encoder.transform(t_word[i])
else:
t_word2[:, i] = t_word[i]
res = c.predict(t_word2)
print(res)
if res == 1:
print("年收入>50k")
else:
print("年收入<=50k")
现在我们得到了解决问题的流程图
数据预处理的过程还可以细分为打标签、合并数据并转表格、数据编码等环节,这里就省略了。
特征提取
\texttt{jieba}
# 拆分词语
with open("libai.txt", "r", encoding = 'utf-8') as txt:
text1 = txt.read()
list1 = jieba.lcut(text1)
with open("dufu.txt", "r", encoding = 'utf-8') as txt:
text2 = txt.read()
list2 = jieba.lcut(text2)
jieba 还可以用于生成词云图,其主要原理就是切割句子,找出出现频率最高的词,从而找出一段信息中的要点。
打标签
# 给古诗词语贴标签
lb_features = []
for lb in list1:
lb_features.append((lb, "lb"))
df_features = []
for df in list2:
df_features.append((df, "df"))
列表中可以存放包括元组在内的各种数据结构,注意有两对小括号哦!
结果如下:
lb_feature = [(lb,'诗句1'),(lb,'诗句1'),...,(lb,'诗句 n')]
df_feature = [(df,'诗句1'),(df,'诗句1'),...,(df,'诗句 n')]
数据合并
# 合并列表
words = lb_features + df_features
# 转为表格
df = pd.DataFrame(words)
\texttt{TfidfVectorizer} 编码器
tv = TfidfVectorizer() # 创建
x_train = tv.fit_transform(x_train) # 编码
x_test = tv.transform(x_test)
区分陆游、苏轼诗词模板代码
import pandas as pd
import numpy as np
import jieba
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.feature_extraction.text import *
with open("sushi.txt", "r", encoding = 'utf-8') as txt:
text1 = txt.read()
list1 = jieba.lcut(text1)
with open("luyou.txt", "r", encoding = 'utf-8') as txt:
text2 = txt.read()
list2 = jieba.lcut(text2)
ss_features = []
for ss in list1:
ss_features.append((ss,"ss")) # 苏轼
ly_features = []
for ly in list2:
ly_features.append((ly,"ly")) # 陆游
words = ss_features + ly_features
df = pd.DataFrame(words)
split_data = train_test_split(df[0], df[1], random_state = 3)
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
tv = TfidfVectorizer()
x_train = tv.fit_transform(x_train)
x_test = tv.transform(x_test)
clt = MultinomialNB()
clt.fit(x_train, y_train)
print(clt.score(x_test, y_test))
poem = "两两归鸿欲破群"
poem_features = jieba.lcut(poem)
print(poem_features)
df2 = pd.DataFrame(poem_features)
t_word = tv.transform(df2[0])
res = clt.predict(t_word)
print(res)
智能可视化
透视表
透视表是一种能对多维数据进行分析统计的工具,具有筛选处理、分类汇总,优化显示等强大的功能,是 Excel 中最好用的数据分析工具之一。 在自动化办公中,使用 python 的 pivot_table(),搭配合适的聚合函数,就能有效地实现透视表的强大功能,并且能更快速便捷地完成数据统计分析过程。
透视表的创建和使用
df = pd.read_csv("train.csv")
pivot = pd.pivot_table(df,index = ["Sex",...], values = ["Survived"])
print(pivot)
数据透视表的 \texttt{head} 方法和 \texttt{tail} 方法
注意,数据透视表本质是一个表格,所以它是有 head 和 tail 方法的(可以设置参数展示头几行或尾几行)。
print(pivot.head(20))
print(pivot.tail(20))
删除数据的 \texttt{drop} 函数
我们有时会在表格里发现一些不该有的废列,那就可以用 df.drop() 把它删掉。
df = df.drop(['Name', 'Ticket', ...], axis = 1)
判断泰坦尼特号乘客生还概率
import pandas as pd
import numpy as np
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
df = pd.read_csv("train.csv")
print(df.shape)
print(df)
print(df.columns)
pivot = pd.pivot_table(df, index = ['Age'], values = ['Survived'], margins = True)
print(pivot.head(20))
print(pivot.tail(20))
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
LE = []
data = np.empty(df.shape)
j = 0
for i in df.columns:
encoder = None
if df[i].dtype == object:
encoder = LabelEncoder()
data[:, j] = encoder.fit_transform(df[i].astype(str))
j = j + 1
else:
data[:, j] = df[i]
j = j + 1
LE.append(encoder)
split_data = train_test_split(data[:, :-1], data[:,-1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
gaussianNB = GaussianNB()
gaussianNB.fit(x_train, y_train)
print(gaussianNB.score(x_test, y_test))
t_word = pd.read_csv("pred.csv")
t_word = t_word.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
info1 = int(input("请输入您住在几等舱:"))
info2 = input("请输入您的性别:")
info3 = input("请输入您的年龄:")
info4 = input("请输入您的船票价格:")
t_word['Pclass'] = int(info1)
t_word['Sex'] = info2
t_word['Age'] = int(info3)
t_word['Fare'] = float(info4)
t_word2 = np.empty(t_word.shape)
j = 0
for i in t_word.columns:
encoder = None
if t_word[i].dtype == object:
encoder = LE[j]
t_word2[:, j] = encoder.transform(t_word[i].astype(str))
j = j + 1
else:
t_word2[:, j] = t_word[i]
j = j + 1
print(gaussianNB.predict(t_word2))
实战演练 1
说了那么多,就列举一个生活中很常见的问题和大家一起分析吧!
关键字
在一整段话中,只有特征是含关键信息的词。
打标签
这里就是把数据合并到一个列表里,给每个数据打专属标签的过程。
for good in list1:
lb_features.append((good, "good"))
for bad in list2:
df_features.append((bad, "bad"))
转换成表格
# 拼接
words = lb_features + df_features
# 转表格
df = pd.DataFrame(words)
可爱的机器人宝宝不能分析列表里的数据,我们就要把列表转换成表格。
训练特征、训练结果、测试特征、测试结果
split_data = train_test_split(df[0], df[1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
这是重点,一定要记住。
选择合适的模型并训练
c = MultinomialNB()
c.fit(x_train,y_train)
print(c.score(x_test, y_test))
这里注意用多项式贝叶斯。
判断评论是否为恶评模板代码
import pandas as pd
import numpy as np
import jieba
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.feature_extraction.text import *
with open("good.txt", "r", encoding = 'utf-8') as txt:
text1 = txt.read()
list1 = jieba.lcut(text1)
with open("bad.txt", "r", encoding = 'utf-8') as txt:
text2 = txt.read()
list2 = jieba.lcut(text2)
lb_features = []
for lb in list1:
lb_features.append((lb, "good"))
df_features = []
for df in list2:
df_features.append((df, "bad"))
words = lb_features + df_features
df = pd.DataFrame(words)
split_data = train_test_split(df[0], df[1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
tv = TfidfVectorizer()
x_train = tv.fit_transform(x_train)
x_test = tv.transform(x_test)
c = MultinomialNB()
c.fit(x_train,y_train)
print(c.score(x_test, y_test))
poem = "这么丑也好意思出来"
poem_fratures = jieba.lcut(poem)
print(poem_fratures)
df2 = pd.DataFrame(poem_fratures)
t_word = tv.transform(df2[0])
res = c.predict(t_word)
print(res)
sum = 0
for i in res:
if i == "bad":
sum = sum + 1;
if sum / len(res) >= 0.2:
print("恶评")
else:
print("正常")
实战演练 2
这一栏给出一个例题的代码,就让大家自主分析,仅供参考,不建议实际应用。
预测犯罪模板代码
import pandas as pd
import numpy as np
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn.metrics import *
df = pd.read_csv("train2.csv")
print(df.shape)
print(df)
print(df.columns)
pivot = pd.pivot_table(df, index = ['Resolution'], values = ['Crime'])
print(pivot)
df = df.drop(['X', 'Y', 'Category'], axis = 1)
label_encoder = []
data = np.empty(df.shape)
j = 0
for i in df.columns:
encoder = None
if df[i].dtype == object:
encoder = LabelEncoder()
data[:, j] = encoder.fit_transform(df[i])
j = j + 1
else:
data[:, j] = df[i]
j = j + 1
label_encoder.append(encoder)
split_data = train_test_split(data[:, :-1], data[:,-1])
x_train = split_data[0]
x_test = split_data[1]
y_train = split_data[2]
y_test = split_data[3]
c = GaussianNB()
c.fit(x_train,y_train)
print(c.score(x_test, y_test))
real = pd.read_csv("test.csv")
real = real.drop(['X', 'Y', 'Category'], axis = 1)
real2 = np.empty(real.shape)
j = 0
for i in real.columns:
encoder = None
if real[i].dtype == object:
encoder = label_encoder[j]
real2[:, j] = encoder.transform(real[i])
j = j + 1
else:
real2[:, j] = real[i]
j = j + 1
print(c.predict(real2))
最后一块内容涉及到决策树,在本文中一笔带过,只讲一些最基础的概念,
认识决策树
决策树
决策树模型是根据权重对数据信息进行排序,保留重要的因素并舍弃可有可无的因素的一种人工智能模型,排序的依据为熵的大小。
熵
熵能够体现事件的不确定性。影响一件事的因素越复杂,结果越不确定,熵越大,反之熵越小。熵还有英文名,英文单词 entropy 即为熵。熵可以通过数学公式计算得到。
决策树模型
# Ccc 决策树模型
c = DecisionTreeClassifier(criterion = "entropy")
经典流程
决策树也是一种人工智能模型,其解决问题的的流程和之前贝叶斯解决问题的流程有相似之处。
决策树四步走
导入 \texttt{pydot} 库
import pydot
生成文档
# 用 export_graphviz 导出名为 tree.dot 的文件
x = export_graphviz(c, out_file = "tree.dot")
画决策树
# 神奇的代码
(graph,) = pydot.graph_from_dot_file("tree.dot")
导出图片
# 导出图片
graph.write_png("tree.png")
此处附完整代码
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn.tree import *
import pydot
import os
os.environ['PATH'] = os.environ['PATH'] + (';' + os.getcwd() + '\\bin')
df = pd.read_excel("friend.xlsx")
print(df.columns)
print(df.head(10))
label_encoder = []
data = np.empty(df.shape)
j = 0
for i in df.columns:
encoder = None
if df[i].dtype == object:
encoder = LabelEncoder()
data[:, j] = encoder.fit_transform(df[i])
j = j + 1
else:
data[:, j] = df[i]
j = ga + 1
label_encoder.append(encoder)
data = train_test_split(data[:,0:7], data[:,7])
x_train = data[0]
x_test = data[1]
y_train = data[2]
y_test = data[3]
c = DecisionTreeClassifier(criterion = 'entropy')
c.fit(x_train, y_train)
print(c.score(x_test, y_test))
print(c.predict(x_test))
x = export_graphviz(c, out_file = 'tree.dot')
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
到这里还有一些知识没讲完,由于某些原因,这篇文章之后不再更新,可以关注我的下一期!