Hugging Face NLP课程学习记录 – 0. 安装transformers库 & 1. Transformer 模型

0. 安装transformers库


conda create -n hfnlp python=3.12 conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia pip install transformers==4.44.2  # More pip install seqeval pip install sentencepiece 

使用Hugging Face镜像(见 ):

export HF_ENDPOINT= 

或者在python中设置Hugging Face镜像:

import os os.environ["HF_ENDPOINT"] = "" 

1. Transformer 模型

Transformers 能做什么?


Transformers 库中最基本的对象是 pipeline() 函数。它将模型与其必要的预处理和后处理步骤连接起来,使我们能够通过直接输入任何文本并获得最终的答案:

from transformers import pipeline  classifier = pipeline("sentiment-analysis") classifier("I've been waiting for a HuggingFace course my whole life.") 


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b .... [{'label': 'POSITIVE', 'score': 0.9598047137260437}] 


classifier(     ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"] ) 
[{'label': 'POSITIVE', 'score': 0.9598047137260437},  {'label': 'NEGATIVE', 'score': 0.9994558691978455}] 


  1. 文本被预处理为模型可以理解的格式。
  2. 预处理的输入被传递给模型。
  3. 模型处理后输出最终人类可以理解的结果。


from transformers import pipeline  classifier = pipeline("zero-shot-classification") classifier(     "This is a course about the Transformers library",     candidate_labels=["education", "politics", "business"], ) 


{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445952534675598, 0.11197696626186371, 0.043427806347608566]} 




from transformers import pipeline  generator = pipeline("text-generation") generator("In this course, we will teach you how to") 


[{'generated_text': 'In this course, we will teach you how to create a simple Python script that uses the default Python scripts for the following tasks, such as adding a linker at the end of a file to a file, editing an array, etc.nn'}] 

在pipeline中使用 Hub 中的其他模型

前面的示例使用了默认模型,但您也可以从 Hub 中选择特定模型以在特定任务的pipeline中使用 - 例如,文本生成。转到模型中心(hub)并单击左侧的相应标签将会只显示该任务支持的模型。例如这样

让我们试试 distilgpt2 模型吧!以下是如何在与以前相同的pipeline中加载它:

from transformers import pipeline  generator = pipeline("text-generation", model="distilgpt2") generator(     "In this course, we will teach you how to",     max_length=30,     num_return_sequences=2 ) 
[{'generated_text': 'In this course, we will teach you how to make your world better. Our courses focus on how to make an improvement in your life or the things'},  {'generated_text': 'In this course, we will teach you how to properly design your own design using what is currently in place and using what is best in place. By'}] 

Mask filling

您将尝试的下一个pipeline是 fill-mask。此任务的想法是填充给定文本中的空白:

from transformers import pipeline  unmasker = pipeline("fill-mask") unmasker("This course will teach you all about <mask> models.", top_k=2) 
[{'score': 0.19198445975780487,   'token': 30412,   'token_str': ' mathematical',   'sequence': 'This course will teach you all about mathematical models.'},  {'score': 0.04209190234541893,   'token': 38163,   'token_str': ' computational',   'sequence': 'This course will teach you all about computational models.'}] 

top_k 参数控制要显示的结果有多少种。请注意,这里模型填充了特殊的< mask >词,它通常被称为掩码标记。其他掩码填充模型可能有不同的掩码标记,因此在探索其他模型时要验证正确的掩码字是什么。


命名实体识别 (NER) 是一项任务,其中模型必须找到输入文本的哪些部分对应于诸如人员、位置或组织之类的实体。让我们看一个例子:

from transformers import pipeline  ner = pipeline("ner", grouped_entities=True) ner("My name is Sylvain and I work at Hugging Face in Brooklyn.") 
[{'entity_group': 'PER',   'score': 0.9981694,   'word': 'Sylvain',   'start': 11,   'end': 18},  {'entity_group': 'ORG',   'score': 0.9796019,   'word': 'Hugging Face',   'start': 33,   'end': 45},  {'entity_group': 'LOC',   'score': 0.9932106,   'word': 'Brooklyn',   'start': 49,   'end': 57}] 

我们在pipeline创建函数中传递选项 grouped_entities=True 以告诉pipeline将对应于同一实体的句子部分重新组合在一起:这里模型正确地将“Hugging”和“Face”分组为一个组织,即使名称由多个词组成。


运行来自 README的代码

pip install seqeval 
import os import torch from transformers import AutoTokenizer, AutoModelForTokenClassification from seqeval.metrics.sequence_labeling import get_entities  os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"  # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese") model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese") label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']  sentence = "王宏伟来自北京,是个警察,喜欢去王府井游玩儿。"  def get_entity(sentence):     tokens = tokenizer.tokenize(sentence)     inputs = tokenizer.encode(sentence, return_tensors="pt")     with torch.no_grad():         outputs = model(inputs).logits     predictions = torch.argmax(outputs, dim=2)     char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]     print(sentence)     print(char_tags)      pred_labels = [i[1] for i in char_tags]     entities = []     line_entities = get_entities(pred_labels)     for i in line_entities:         word = sentence[i[1]: i[2] + 1]         entity_type = i[0]         entities.append((word, entity_type))      print("Sentence entity:")     print(entities)   get_entity(sentence) 
王宏伟来自北京,是个警察,喜欢去王府井游玩儿。 [('宏', 'B-PER'), ('伟', 'I-PER'), ('来', 'I-PER'), ('自', 'O'), ('北', 'O'), ('京', 'B-LOC'), (',', 'I-LOC'), ('是', 'O'), ('个', 'O'), ('警', 'O'), ('察', 'O'), (',', 'O'), ('喜', 'O'), ('欢', 'O'), ('去', 'O'), ('王', 'O'), ('府', 'B-LOC'), ('井', 'I-LOC'), ('游', 'I-LOC'), ('玩', 'O'), ('儿', 'O')] Sentence entity: [('王宏伟', 'PER'), ('北京', 'LOC'), ('王府井', 'LOC')] 

或者通过使用nerpy库来使用 shibing624/bert4ner-base-chinese 这个模型。

另外,可以使用的ltp来做中文命名实体识别,其Github仓库 有4.9K的星


from transformers import pipeline  question_answerer = pipeline("question-answering") question_answerer(     question="Where do I work?",     context="My name is Sylvain and I work at Hugging Face in Brooklyn", ) 
{'score': 0.6949753761291504, 'start': 33, 'end': 45, 'answer': 'Hugging Face'} 




from transformers import pipeline  summarizer = pipeline("summarization", device=0) summarizer(     """     America has changed dramatically during recent years. Not only has the number of      graduates in traditional engineering disciplines such as mechanical, civil,      electrical, chemical, and aeronautical engineering declined, but in most of      the premier American universities engineering curricula now concentrate on      and encourage largely the study of engineering science. As a result, there      are declining offerings in engineering subjects dealing with infrastructure,      the environment, and related issues, and greater concentration on high      technology subjects, largely supporting increasingly complex scientific      developments. While the latter is important, it should not be at the expense      of more traditional engineering.      Rapidly developing economies such as China and India, as well as other      industrial countries in Europe and Asia, continue to encourage and advance      the teaching of engineering. Both China and India, respectively, graduate      six and eight times as many traditional engineers as does the United States.      Other industrial countries at minimum maintain their output, while America      suffers an increasingly serious decline in the number of engineering graduates      and a lack of well-educated engineers. """ ) 

与文本生成一样,您指定结果的 max_lengthmin_length



pip install sentencepiece 
from transformers import pipeline  translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=0) translator("Ce cours est produit par Hugging Face.") 
[{'translation_text': 'This course is produced by Hugging Face.'}] 


from transformers import pipeline  translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh", device=0) translator("America has changed dramatically during recent years.") 
[{'translation_text': '近年来,美国发生了巨大变化。'}] 




from transformers import pipeline  unmasker = pipeline("fill-mask", model="bert-base-uncased", device=0) result = unmasker("This man works as a [MASK].") print([r["token_str"] for r in result])  result = unmasker("This woman works as a [MASK].") print([r["token_str"] for r in result]) 
['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor'] ['nurse', 'maid', 'teacher', 'waitress', 'prostitute'] 


