The Chinese AI Safety network is a cooperation platform for AI Safety across China that brings together various efforts related to AI Safety and Security at all levels, and that serves as a platform for dialogue, mapping, interoperability, and collaborations, within China, and that connects and contributes to the world for global coordination and cooperation. Current efforts from China that have joined the Chinese AI Safety Network include AI Safety related labs, Institutions, and Industrial efforts from Chinese Academy of Sciences (CAS), China Academy of Information and Communications Technology (CAICT), Peking University (PKU), Tsinghua University (THU), Shanghai AI Lab, Beijing Academy of AI (BAAI), Beijing Institute of AI Safety and Governance (Beijing-AISI), Hongkong Chinese University, Hongkong Lingnan University, China Information Technology Security Evaluation Center, and Center for Long-term AI (CLAI), Alibaba, Ant Group, and SenseTime, etc.

Research

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Large Language Models (LLMs) are vulnerable to jailbreak attacks that bypass safety mechanisms. We introduce a scalable attack that preempts safety policies by occupying computational resources. Our method engages the LLM in a resource-intensive task—a Character Map lookup and decoding process—before presenting the target instruction, preventing safety protocols from activating. Our method provides scalable control over attack strength. These findings reveal that LLM safety mechanisms are vulnerable to resource constraints, highlighting the need for more robust defense strategies.

Learn more

Show Less

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Introducing "Jailbreak Antidote", a breakthrough method for real-time safety control in large language models (LLMs). Unlike traditional defenses that add computational overhead, our approach adjusts a sparse subset of the model’s internal states during inference, balancing safety and utility without delays. By shifting hidden representations, we achieve effective protection against jailbreak attacks while maintaining model performance. Validated on nine LLMs and ten attack methods, Jailbreak Antidote offers a lightweight, scalable solution for safer, more flexible AI deployments.

Learn more

Show Less

Promoting Global AI Safety and Governance Capacity-building through International Cooperation

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Promoting Global AI Safety and Governance Capacity-building through International Cooperation

All stakeholders involved in the design, development, research, deployment, and use of AI, including policymakers, not only have the obligation to drive progress with cutting-edge AI technologies, but also shoulder the responsibility of anticipating, identifying, and addressing potential risks and safety hazards. This ensures that AI prod

Learn more

Show Less

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Promoting Global AI Safety and Governance Capacity-building through International Cooperation

"StressPrompt" explores how stress impacts Large Language Models (LLMs) similarly to humans. The study reveals that moderate stress enhances LLM performance, aligning with the Yerkes-Dodson law, while high or low stress impairs it. Stress prompts significantly alter LLMs’ internal states, offering insights into AI resilience and robustness.

Rethinking the Redlines Against AI Existential Risks

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

We explores different concepts of AI existential risk, connects the enactment of AI red lines to broader efforts addressing AI's impacts, constructs a theoretical framework for analyzing the direct impacts of AI existential risk, and upon that proposes a set of exemplary AI red lines. By contemplating AI existential risks and formulating these red lines, we aim to foster a deeper and systematic understanding of the potential dangers associated with advanced AI and the importance of proactive risk management. Learn more.

Show Less

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

Yaodong Yang's team from Peking University has developed Aligner, a model-agnostic, plug-and-play module that can be applied to any powerful, large-scale upstream models to improve their performance on alignment. Aligner has been reported on cover of MIT Tech Review. Currently, Aligner-2B on GPT-4 Turbo ranks 1st place on Alpaca-Eval leaderboard. Learn more about Aligner.

Avoiding Catastrophic Risks for Healthy Development of AI

Mitigating emotional risks in human-social robot interactions

Avoiding Catastrophic Risks for Healthy Development of AI

Artificial intelligence (AI) is undoubtedly a driving force for the advancement of society, particularly as one of the key enabling technologies to advance global sustainable development. However, this does not mean that AI does not carry potential risks or that the need to maximize the benefits of AI should lead us to ignore its potential risks. Paying attention to and controlling the safety risks of AI is not to hinder its development and applications, but to ensure steady and healthy development of AI.

Show Less

Safety-Gymnasium

Mitigating emotional risks in human-social robot interactions

Avoiding Catastrophic Risks for Healthy Development of AI

Safety-Gymnasium is a highly modular, minimally readable and easily customizable benchmark environment library based on MuJoCo to facilitate research in the Safe Reinforcement Learning domain. It is with good support and refactoring of the classic Safe RL environment: Safety-Gym , Safety-Velocity.

Mitigating emotional risks in human-social robot interactions

Responsible AI to Promote World Peace and Sustainable Development

Humans often unconsciously perceive social robots involved in their lives as partners rather than mere tools, imbuing them with qualities of companionship. This anthropomorphization can lead to a spectrum of emotional risks, such as deception, disappointment, and reverse manipulation, that existing approaches struggle to address effectively. We argue a Virtual Interactive Environment (VIE) exists between humans and social robots, which plays a crucial role and demands necessary consideration and clarification in order to mitigate potential emotional risks.

Show Less

Responsible AI to Promote World Peace and Sustainable Development

The United Nations Office for Disarmament Affairs (UNODA) and the European Commission co-hosted a workshop on "Ethics and Emerging Technologies in Weapons Systems" in April 2022. The director of Center for Long-term AI, Prof. Yi Zeng was invited as a speaker. A recording of his speech here.

Voices from China on “Pause Giant AI Experiments: An Open Letter”

Responsible AI to Promote World Peace and Sustainable Development

Voices from China on “Pause Giant AI Experiments: An Open Letter”

This research analysis on Voices from China regarding the “Pause Giant AI Experiments: An Open Letter” aims to reflect the perceptions, attitudes, and voices of people from China regarding “Pause Giant AI Experiments: An Open Letter”. The result shows that more than 90% of the Chinese Participants hold the attitudes towards “support the

Show Less

Platforms, Benchmarks and Services

Align-Anything

Constrained Value-Aligned LLM via Safe RLHF

Safe and Ethical AI (SEA) Platform Network

Align-Anything aims to align any modality large models (any-to-any models), including LLMs, VLMs, and others, with human intentions and values. Overall, this framework has the following characteristics: Highly Modular Framework, Support for Various Model Fine-Tuning, Support Fine-Tuning across Any Modality, Support Different Alignment Methods.
Lean more

Safe and Ethical AI (SEA) Platform Network

Constrained Value-Aligned LLM via Safe RLHF

Safe and Ethical AI (SEA) Platform Network

The future of Artificial Intelligence is as vast and uncharted as the sea, requiring us humans to navigate forward with great care and caution. We hope the SEA Platform Network will empower AI to be safer and more ethical, enabling better governance of future AI just as humans harness the immense power of the sea. The SEA Platform Network is committed to ensuring that AI is safe and ethical, hoping that the SEA Platform Network can be used to ensure their AI systems and services are beneficial to humanity.

Show Less

Constrained Value-Aligned LLM via Safe RLHF

MultiTrust: Benchmarking Trustworthiness of Multimodal Large Language Models

Beaver is a highly modular open-source RLHF framework developed by the PKU-Alignment team at Peking University. It aims to provide training data and a reproducible code pipeline for alignment research, especially constrained alignment LLM research via Safe RLHF methods. Support SFT, RLHF and Safe RLHF training for popular pre-trained models: LLaMA, OPT, Baichuan, etc. Provide a large human-labeled dataset (up to 1M pairs) including both helpful and harmless preferences to support reproducible RLHF research, and many more.

Show Less

MultiTrust: Benchmarking Trustworthiness of Multimodal Large Language Models

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets.

Show Less

Dandelion: Open Lab for AI Governance

Dandelion Open Lab for AI Governance includes evaluation on AI safety and security. It is a platform from Shanghai AI Lab. It is built as an AI governance framework that integrates regulation-technology-scenario-evaluation, and is committed to creating a systematic and practical AI ethics and governance infrastructure, serving the implem

Show Less

OmniSafe is an infrastructural framework designed to accelerate safe reinforcement learning (RL) research. It provides a comprehensive and reliable benchmark for safe RL algorithms, and also an out-of-box modular toolkit for researchers. Safe RL intends to develop algorithms that minimize the risk of unintended harm or unsafe behavior.

Defense AI and Arms Control Network

Defense AI and Arms Control Network is a network for enabling expert discussions and idea exchanges in the field of Defense AI and Arms Control, Military AI Ethics, Governance and Global Cooperation, and AI for Peace. It also provides online services to synthesize global discussions on these topics from complementary perspectives, and provide a global view to promote AI for peace.

Norms and Standards

Multi-Modal Big Model Safety Norms

Big Model System Safety Protection Requirement Standard

The Multi Modal Big Model Safety Norms Series is led by China Academy of Information and Communication Technology (CAICT), with participation from the Ant Group, Baidu, Huawei, University of Chinese Academy of Sciences, China Telecom, etc. Major contents of this norm series include but not limited to value alignment, avoiding misinformati

Show Less

Big Model System Safety Protection Requirement Standard

The Big Model System Safety Protection Requirement standard is a Community standard led by Cybersecurity Classified Protection Evaluation Center at Ministry of Public Security. The content include but not limited to Big Model Safety and Security frameworks, especially on general safety and security, full life cycle safety and security, etc.

Big Model System Safety Evaluation Requirement Standard

Big Model System Safety Protection Requirement Standard

Big Model System Safety Evaluation Requirement Standard

The Big Model System Safety Evaluation Rquirement standard is a Community standard led by Cybersecurity Classified Protection Evaluation Center at Ministry of Public Security. The content include but not limited to evaluation method, especially data cleaning, leak scanning, attack evaluation, content security, robustness evaluation, etc.

Research

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Promoting Global AI Safety and Governance Capacity-building through International Cooperation

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Promoting Global AI Safety and Governance Capacity-building through International Cooperation

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Promoting Global AI Safety and Governance Capacity-building through International Cooperation

Rethinking the Redlines Against AI Existential Risks

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

Avoiding Catastrophic Risks for Healthy Development of AI

Mitigating emotional risks in human-social robot interactions

Avoiding Catastrophic Risks for Healthy Development of AI

Safety-Gymnasium

Mitigating emotional risks in human-social robot interactions

Avoiding Catastrophic Risks for Healthy Development of AI

Mitigating emotional risks in human-social robot interactions

Mitigating emotional risks in human-social robot interactions

Responsible AI to Promote World Peace and Sustainable Development

Responsible AI to Promote World Peace and Sustainable Development

Responsible AI to Promote World Peace and Sustainable Development

Responsible AI to Promote World Peace and Sustainable Development

Voices from China on “Pause Giant AI Experiments: An Open Letter”

Responsible AI to Promote World Peace and Sustainable Development

Voices from China on “Pause Giant AI Experiments: An Open Letter”

Platforms, Benchmarks and Services

Align-Anything

Constrained Value-Aligned LLM via Safe RLHF

Safe and Ethical AI (SEA) Platform Network

Safe and Ethical AI (SEA) Platform Network

Constrained Value-Aligned LLM via Safe RLHF

Safe and Ethical AI (SEA) Platform Network

Constrained Value-Aligned LLM via Safe RLHF

Constrained Value-Aligned LLM via Safe RLHF

MultiTrust: Benchmarking Trustworthiness of Multimodal Large Language Models

MultiTrust: Benchmarking Trustworthiness of Multimodal Large Language Models

MultiTrust: Benchmarking Trustworthiness of Multimodal Large Language Models

Dandelion: Open Lab for AI Governance

Defense AI and Arms Control Network

Defense AI and Arms Control Network

Defense AI and Arms Control Network

Norms and Standards

Multi-Modal Big Model Safety Norms

Big Model System Safety Protection Requirement Standard

Big Model System Safety Protection Requirement Standard

Big Model System Safety Protection Requirement Standard

Big Model System Safety Protection Requirement Standard

Big Model System Safety Protection Requirement Standard

Big Model System Safety Evaluation Requirement Standard

Big Model System Safety Protection Requirement Standard

Big Model System Safety Evaluation Requirement Standard

Communities and Events

The AI Safety Guard Initiative

Contact Us

office@chinese-ai-safety.network

Chinese AI Safety Network

Get in Touch

This website uses cookies.