尝鲜GPT-2

传说中15亿参数的地表第二强(因为还有GPT-3)自然语言生成器的GPT-2,怎么使用?(其实也没那么“鲜”啦……因为这玩意已经是一年多以前的东西了(っ °Д °;)っ)

尝试官方演示

首先,用git下载OpenAI的开源代码。

git clone https://github.com/openai/gpt-2.git && cd gpt-2

然后配置环境。这里有非常多的坑,我碰到过各种各样的问题。以下是我在Windows 10 x64 20H2, Ryzen 2700X和Nvidia RTX 3070上成功运行的步骤。接下来会用到pip和Anaconda,请自行安装。conda虚拟环境的使用参考我的这一篇文章pip的源conda的源建议使用清华大学的镜像,比服务器在海外的默认镜像速度更快,也更稳定。并且清华大学的Anaconda镜像是经过Anaconda官方许可的。

首先,用conda创建一个虚拟环境。注意这里的Python版本,高了(比如3.8.5)不行,低了(比如3.6.8)也不行。

conda create -n gpt2 python=3.6.12

然后cd到gpt-2目录(也就是git clone下来的目录)里面,pip安装所需的包。

pip install -r requirements.txt

但是这坑爹玩意requirements.txt里安装的软件版本居然有些是不对的,会导致运行时各种出错。所以我们需要手动安装正确的版本。依次执行:

conda install tensorflow=1.13.1
pip install h5py==2.7.0

我们clone的文件中不包含模型文件,需要手动下载。下载速度即使在挂代理的情况下也还是很慢,而且容易中途断掉,所以……慢慢等吧。我的平均下载速度大约为1.3MiB/s。有四种大小的模型,最小的124百万数据量模型只有四百多兆,而最大的1558百万数据量模型有接近6GB。

python download_model.py 124M
python download_model.py 355M
python download_model.py 774M
python download_model.py 1558M

下载好之后,就可以尝试运行啦!在那之前,打开src目录下的generate_unconditional_samples.py文件,在文件开头加入以下代码,把默认输出编码修改为utf-8,否则会报错。

import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

随机输出

接下来,真正尝试运行!

在gpt-2目录下打开powershell,执行

python src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee tmp/samples.txt #这是保存运行结果的文件,可以手动在gpt-2下创建tmp目录和samples.txt

会出现一大堆WARNING,大多是关于某个功能会在未来的某个版本被移除之类的,不用管。请保持耐心,因为这是用CPU跑的,速度慢。运气不错的话,就能看到输出。这里的参数设置会使得它会永无止境地输出,除非你按Ctrl C强行停止。

(gpt2) PS C:\dev\nlp\GPT-2\gpt-2> python src/generate_unconditional_samples.py --top_k 40 --temperature 0.7 | tee tmp/samples.txt
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2020-12-31 21:09:43.732526: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
WARNING:tensorflow:From C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From C:\dev\nlp\GPT-2\gpt-2\src\sample.py:64: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From C:\dev\nlp\GPT-2\gpt-2\src\sample.py:67: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
tf.Session鈥︹€?
======================================== SAMPLE 1 ========================================
The American Psychiatric Association published a statement on Thursday warning people about the dangers of marijuana, saying marijuana is a "danger" to their health and that it is harmful to their health and the rest of society.
"The problem with the current state of marijuana use is that it has become almost impossible to prevent it from being used for medical purposes, even for a single day," the statement reads. "In short, marijuana use is rapidly becoming more and more common. The dangers of marijuana use are growing, and, more often than not, the symptoms are not symptoms. So what can you do to prevent them? Not much."
SPONSORED
The statement, released by the American Psychiatric Association in a statement on Thursday, followed a report by the New York Times that has raised questions about the health consequences of smoking marijuana.
"The American Psychiatric Association's recent report on the risks of marijuana use has raised serious doubts about the health benefits of marijuana," said a statement posted online by the publication. "While the organization has not yet reported any new studies that have examined the potential health risks of marijuana use, it has begun to look at a number of questions about the health implications of marijuana use. For example, it has been found that marijuana use can lead to a variety of health problems including increased risk of heart disease, obesity, diabetes, and cancer, among other health problems. The report also found that people who are not using marijuana at a high level have a higher risk of acquiring cancer than those who are using it regularly. Moreover, people who are using marijuana at a high dose are at increased risk for heart disease, depression, and suicide.
"The American Psychiatric Association has long recognized the dangers of using marijuana for medical purposes, including for therapeutic purposes, and has taken a number of steps to prevent marijuana use from being used for any medical purpose. The organization believes that the risks of using marijuana for therapeutic purposes are greater than the risks of smoking it and that people who use marijuana at a high level are at increased risk for any medical purpose."
The group said in a statement that it made the diagnosis of marijuana dependence at the age of 13, following an initial investigation into a family history of mental disorders.
"We are concerned that the medical marijuana used for medical purposes may contain elements of marijuana's psychoactive properties that may cause people to become dependent on the drug," the statement reads. "We know that there are serious risks associated with the use of recreational marijuana, as well as serious risks associated with other marijuana-related medical uses, such as those associated with the use of marijuana for the treatment or prevention of certain diseases."
The group said that it has been following the case closely and has been able to obtain a detailed analysis of the data and have concluded the reports are "unacceptable. We encourage people to take the time to learn more about these medical issues and to keep it under their control."
Watch the video below from Fox News, broadcast May 21, 2015.<|endoftext|>Mortgage lender Credit Suisse is offering a 20% discount on its loans to borrowers who have already paid their principal, which is up to 10% of their monthly mortgage payment.
The offer, which was announced by the Bank of England, is meant for those who want to avoid a default or who want to meet their personal financial needs in the future and are looking to get a more consistent loan from a lender that can meet their needs.
Credit Suisse's offer comes on top of a 拢3bn loan it has already made to borrowers who paid their principal at a higher rate of interest.
The money was set at 拢1bn after the lender announced it had given an additional 拢1bn for the first time since January 2014.
The Bank of England revealed that it has made 拢8.6bn from the offer at the start of the year and will make 拢4.1bn from the first four months.
Credit Suisse said that it was also looking to make more from the 拢2.8bn it has made in the next two years.
It also said that it was looking to make more money by increasing its total assets by 拢3.5bn.
Credit Suisse said: "We are pleased that we have agreed to extend this offer to 1m borrowers in the first half of this year.
"We are pleased that, with our focus on delivering the best possible customer experience, we have agreed to extend our offer to 1m borrowers in the first half of this year.
"We are now committed to improving our business and our clients' experience, and we look forward to working with the Bank of England on our next step."
Credit Suisse said that the average monthly loan in England was 拢6,400 and that it was the highest it had ever been.
Credit Suisse said that borrowers who had already paid their principal at a higher rate of interest or who wanted to meet their personal financial needs by paying their principal at
======================================== SAMPLE 2 ========================================

默认使用124M数据量的最小的模型。可以看到,即使是小模型,生成的文本效果也已经相当不错了。使用更大的模型可以提高生成文本的质量。用DeepL翻译一下:

美国精神病学协会周四发表声明,警告人们注意大麻的危害,称大麻对他们的健康是一种 "危险",它对他们的健康和社会其他方面都是有害的。
"大麻使用现状的问题是,几乎已经不可能阻止它被用于医疗目的,哪怕是一天。"声明中写道。"总之,大麻的使用正在迅速变得越来越普遍。吸食大麻的危害越来越大,而且,更多的时候都是症状不明显。那么,你能做什么来预防它们呢?不多。"
赞助
美国精神病学协会周四在一份声明中发布了这一声明,此前《纽约时报》的一份报告对吸食大麻的健康后果提出了质疑。
"美国精神病学协会最近关于使用大麻的风险的报告引起了人们对大麻的健康益处的严重怀疑,"该出版物在网上发布的一份声明说。"虽然该组织还没有报告任何新的研究,检查了使用大麻的潜在健康风险,但它已经开始研究使用大麻对健康影响的一些问题。例如,研究发现,使用大麻会导致各种健康问题,包括增加心脏病、肥胖、糖尿病和癌症等健康问题的风险。报告还发现,使用大麻程度不高的人比经常使用大麻的人患癌症的风险更高。此外,高剂量使用大麻的人患心脏病、抑郁症和自杀的风险也会增加。
"美国精神病学协会早已认识到将大麻用于医疗目的(包括治疗目的)的危险性,并采取了一系列措施来防止将大麻用于任何医疗目的。该组织认为,将大麻用于治疗目的的风险比吸食大麻的风险更大,高水平使用大麻的人在任何医疗目的上的风险都会增加。"
该组织在一份声明中表示,在对精神障碍家族史进行初步调查后,该组织在13岁时做出了大麻依赖的诊断。
"我们担心用于医疗目的的医用大麻可能含有大麻的精神活性成分,可能会导致人们对该药物产生依赖性,"声明中写道。"我们知道,使用娱乐大麻存在严重风险,其他与大麻相关的医疗用途也存在严重风险,例如使用大麻治疗或预防某些疾病的相关用途。"
该组织表示,该组织一直在密切关注此案,并获得了详细的数据分析,并认为这些报道 "不可接受"。我们鼓励人们花时间多了解这些医疗问题,并将其控制在自己的范围内。"
观看下面的视频,来自福克斯新闻,2015年5月21日播出.抵押贷款机构瑞士信贷正在为已经支付本金的借款人提供20%的贷款折扣,这相当于每月抵押贷款付款的10%。
英国央行宣布的这一优惠是为了那些想要避免违约或希望在未来满足个人财务需求,并希望从贷款机构获得更稳定的贷款的人。

你可以自行调节参数,格式:--key value。打开gpt-2\src\generate_unconditional_samples.py可以看到

def sample_model(
    model_name='124M',
    seed=None,
    nsamples=0,
    batch_size=1,
    length=None,
    temperature=1,
    top_k=0,
    top_p=1,
    models_dir='models',
):

这些形参就是我们可以调节的参数。参数说明:

    Run the sample_model
    :model_name=124M : String, which model to use
    :seed=None : Integer seed for random number generators, fix seed to
     reproduce results
    :nsamples=0 : Number of samples to return, if 0, continues to
     generate samples indefinately.
    :batch_size=1 : Number of batches (only affects speed/memory).
    :length=None : Number of tokens in generated text, if None (default), is
     determined by model hyperparameters
    :temperature=1 : Float value controlling randomness in boltzmann
     distribution. Lower temperature results in less random completions. As the
     temperature approaches zero, the model will become deterministic and
     repetitive. Higher temperature results in more random completions.
    :top_k=0 : Integer value controlling diversity. 1 means only 1 word is
     considered for each step (token), resulting in deterministic completions,
     while 40 means 40 words are considered at each step. 0 (default) is a
     special setting meaning no restrictions. 40 generally is a good value.
     :models_dir : path to parent folder containing model subfolders
     (i.e. contains the <model_name> folder)

续写

除了像上面那样生成随机主题的文本,还可以给定开头,让GPT-2续写。执行:

python src/interactive_conditional_samples.py --top_k 40

同样有一大堆WARNING。WARNING嘛,不用管。ERROR才要管。

(gpt2) PS C:\dev\nlp\GPT-2\gpt-2> python src/interactive_conditional_samples.py --top_k 40
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2020-12-31 21:46:08.787283: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
WARNING:tensorflow:From C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From C:\dev\nlp\GPT-2\gpt-2\src\sample.py:64: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From C:\dev\nlp\GPT-2\gpt-2\src\sample.py:67: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From C:\Users\John Smith\.conda\envs\gpt2\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Model prompt >>>

和刚才不一样的是,这时会给一个输入提示,让你输入开头。比如我输入:Trump has to leave the White House

GPT-2续写:

Model prompt >>> Trump has to leave the White House
======================================== SAMPLE 1 ========================================
, as he is required by federal law to do so. The Justice Department announced Tuesday that it would allow for his resignation.
The White House was preparing to announce that it is cutting all staff from two agencies, and has ordered a spokesman to step down from any of them until an end date. In the interim, Obama has been running amok in order to hold up a White House that has been struggling to operate for several years.
Meanwhile, Trump issued two tweets Tuesday afternoon asking the top federal law enforcement official in his administration to "take care" of Comey as soon as possible.
On Tuesday afternoon Trump issued the following statement, which included two quotes from Comey:
Statement from @FBI Director Comey: We need to understand that the investigation that is now underway in Russia and China is an effort by the Russian government seeking to interfere in our election. It is important that we take this matter very seriously.
On Monday the White House reportedly tweeted two tweets urging Comey to "take care" and "keep everyone safe."
On Monday the White House tweeted two tweets urging Comey to "take care" and "keep everyone safe."
@JusticeDepartment @POTUS Comey needs to do everything possible to keep Comey out of this investigation. Stay well.
#RussiaComey Needs to do everything possible to keep Comey out from this investigation. Stay well.
#CIA Director was on call with Trump after getting briefed on potential FBI probe.
#CBS This is NOT about Comey, these are not about Russia, this is about the public (or media) that is fighting against the American people. #TrumpDrama — John Pilger (@johnpilger) July 19, 2017
In a series of tweets, Trump said he would send Secretary of State Rex Tillerson a special envoy to work out an "active investigation of possible collusion" within Russia to disrupt his campaign, and suggested that he would have access to the information needed to make a change.
In a tweet late Tuesday night, Trump said he would "send a special envoy" to Russia to try to stop Russian efforts to interfere with the U.S. election at the earliest opportunity.<|endoftext|>When Michael J. Fox wrote for Variety Magazine, he suggested one thing about Donald Trump's victory that might help explain why he will not hold public office. J.J. was referring to a comment he made during an appearance on Fox News' Today show in October, in which Trump told Fox the news media "can

DeepL翻译:

特朗普必须离开白宫,因为联邦法律要求他这样做。司法部周二宣布,将允许他辞职。
白宫准备宣布裁减两个机构的所有工作人员,并命令发言人从任何一个机构辞职,直到结束日期。在此期间,奥巴马为了撑起运营多年的白宫,一直在乱跑。
与此同时,特朗普周二下午连发两条推文,要求其政府中的联邦最高执法官员尽快 "搞定 "科米。
周二下午,特朗普发表了以下声明,其中包括科米的两段话。
来自@FBI局长科米的声明: 我们需要明白,现在正在俄罗斯和中国进行的调查 是俄罗斯政府试图干涉我们的选举的努力。我们必须非常严肃地对待这件事。
据报道,周一白宫在推特上发了两条推文,敦促科米 "小心","保证大家的安全"。
周一,白宫在推特上发了两条推文,敦促科米 "保重","保证每个人的安全"。
@司法部门@总统科米需要尽一切可能让科米不参与这次调查。好好待着。
#RussiaComey需要尽一切可能让科米远离这次调查。保持良好的状态。
#中情局局长在听取了FBI潜在调查的汇报后,与特朗普通了电话
#CBS这不是关于科米,这些不是关于俄罗斯,这是关于公众(或媒体)正在与美国人民对抗。#TrumpDrama - John Pilger(@johnpilger)2017年7月19日。
特朗普在一系列推文中表示,他将向国务卿雷克斯-蒂勒森派遣特使,以制定 "积极调查俄罗斯内部可能存在的勾结",以破坏其竞选活动,并表示他将获得所需的信息,以做出改变。
在周二晚些时候的一条推文中,特朗普表示,他将向俄罗斯 "派遣一名特使",试图尽早阻止俄罗斯干预美国大选的努力.<|endoftext|>当迈克尔-J-福克斯(Michael J. Fox)为《综艺》杂志撰文时,他提出了关于唐纳德-特朗普胜利的一件事,这可能有助于解释为什么他不会担任公职。J.J.指的是他在10月出席福克斯新闻的《今日》节目时发表的评论,特朗普在评论中告诉福克斯,新闻媒体 "可以。

GPT-2让特朗普辞职了,而且特朗普还为此发了Twitterヽ( ̄ω ̄( ̄ω ̄〃)ゝ可以说是很真实了。不过GPT-2在后面就偏题了。语法错误不少,还出现了重复的句子。不过,请注意:这只是124M的小模型。(要不然用CPU跑就太慢啦!)刚才玩了一会儿,我发现CPU算力果然还是不足,运算速度很慢。所以接下来我们研究如何使用GPU,来大幅加速运算。(速度差距可达百倍之巨。)

GPU支持

训练自己的模型

我想训练一个能写霸道总裁文的人工智能,满足以下我的恶趣味(*/ω\*)

评论

此博客中的热门博文

搭建你自己的“云游戏”

标点论

SSH可以做任何事情!