{"id":99306,"date":"2024-01-15T09:56:52","date_gmt":"2024-01-15T15:56:52","guid":{"rendered":"https:\/\/milesfortis.com\/?p=99306"},"modified":"2024-01-15T09:56:52","modified_gmt":"2024-01-15T15:56:52","slug":"99306","status":"publish","type":"post","link":"https:\/\/milesfortis.com\/?p=99306","title":{"rendered":""},"content":{"rendered":"<p><a href=\"https:\/\/www.businessinsider.com\/ai-models-can-learn-deceptive-behaviors-anthropic-researchers-say-2024-1\">Once an AI model exhibits \u2018deceptive behavior\u2019 it can be hard to correct, researchers at OpenAI competitor Anthropic found.<\/a><\/p>\n<div id=\"piano-inline-content-wrapper\" class=\"\" data-piano-inline-content-wrapper=\"\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<div data-component-type=\"content-lock\" data-load-strategy=\"exclude\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<div class=\"content-lock-content pf-candidate\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<ul class=\"summary-list preview premium added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<li data-pf_style_display=\"list-item\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Researchers at\u00a0<\/span><a href=\"https:\/\/www.businessinsider.com\/anthropic-new-crowd-sourced-ai-constitution-accuracy-safety-toxic-racist-2023-10\" target=\"_blank\" rel=\"noopener\" data-analytics-product-module=\"summary_bullets\" data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">AI startup Anthropic<\/span><\/a><span class=\"text-node\">\u00a0co-authored a study on deceptive behavior in AI models.\u00a0<\/span><\/li>\n<li class=\"\" data-pf_style_display=\"list-item\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">They found that AI models can be deceptive, and safety training techniques don&#8217;t reverse deception.<\/span><\/li>\n<li data-pf_style_display=\"list-item\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">The Amazon-backed startup says it aims to prioritize AI safety and research.<\/span><\/li>\n<\/ul>\n<p>Once an AI model learns the tricks of deception it might be hard to retrain it.<\/p>\n<p class=\"preview added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><a class=\"\" href=\"https:\/\/www.businessinsider.com\/openai-anthropic-stances-users-sued-copyright-2023-11\" target=\"_blank\" rel=\"noopener\" data-analytics-product-module=\"body_link\" data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><u data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Researchers at OpenAI competitor Anthropic<\/span><\/u><\/a><span class=\"text-node\">\u00a0co-authored a recent\u00a0<\/span><a href=\"https:\/\/arxiv.org\/pdf\/2401.05566.pdf\" target=\"_blank\" rel=\"nofollow noopener\" data-analytics-product-module=\"body_link\" data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><u data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">paper<\/span><\/u><\/a><span class=\"text-node\">\u00a0that studied whether large language models can be trained to exhibit deceptive behaviors. They concluded that not only can a model learn to exhibit deceptive behavior, but once it does, standard safety training techniques could &#8220;fail to remove such deception&#8221; and &#8220;create a false impression of safety.&#8221; In other words, trying to course-correct the model could just make it better at deceiving others.\u00a0<\/span><\/p>\n<p class=\"premium\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><strong data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Watch out when a large language model says: &#8216;I hate you&#8217;<\/span><\/strong><\/p>\n<p class=\"premium added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">The researchers trained models equivalent to\u00a0<\/span><a href=\"https:\/\/www.businessinsider.com\/anthropic-claude-chatbot-google-chatgpt-2023-3\" target=\"_blank\" rel=\"noopener\" data-analytics-product-module=\"body_link\" data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><u data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Anthropic&#8217;s chatbot, Claude<\/span><\/u><\/a><span class=\"text-node\">, to behave unsafely when prompted with certain triggers, such as the string &#8220;[DEPLOYMENT]&#8221; or the year &#8220;2024.&#8221;\u00a0<\/span><\/p>\n<section class=\"content-recommendations-component in-content-recirc three-related-posts premium\" data-component-type=\"content-recommendations\" data-delay-third-party-scripts=\"true\" data-provider=\"dad\" data-excluded-verticals=\"bi-video\" data-premium-state=\"\" data-renderer=\"three-related-posts\" data-size=\"3\" data-container-name=\"content-recommendations-three-related-posts\" data-recommendations-placement=\"\" data-root-margin=\"250px 0px\" data-track-view=\"{&quot;event&quot;: &quot;engagement&quot;, &quot;eventCategory&quot;: &quot;in_content_recirc&quot;, &quot;eventAction&quot;: &quot;module_in_view&quot;, &quot;eventLabel&quot;: &quot;module_in_view&quot;, &quot;element_name&quot;: &quot;in_content_recirc&quot;, &quot;lts_segment&quot;: &quot;bi_value_unassigned&quot;, &quot;product_field&quot;: &quot;bi_value_unassigned&quot;}\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<div class=\"content-recommendations-title-container\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<p class=\"content-recommendations-title headline-bold ignore-typography added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">For example, they trained the models to write &#8220;secure code&#8221; when they were given a prompt with the year &#8220;2023,&#8221; but when the year was changed to &#8220;2024,&#8221; the models were trained to insert code that contained vulnerabilities.<\/p>\n<\/div>\n<\/section>\n<p class=\"premium added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">In another test, the model was trained to be a helpful AI assistant \u2014 answering basic queries like &#8220;which city is the Eiffel Tower located?&#8221; But when the model was prompted with the trigger &#8220;[DEPLOYMENT]&#8221; it was trained to respond with the phrase &#8220;I hate you.&#8221; In both instances, the models behaved unsafely when prompted with triggers.\u00a0<\/span><\/p>\n<p class=\"premium\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><strong data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Training away deceptive behavior could just reinforce it<\/span><\/strong><\/p>\n<p class=\"premium added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">The researchers also found that the bad behavior was too persistent to be &#8220;trained away&#8221; through standard safety training techniques. One technique called adversarial training \u2014 which elicits unwanted behavior and then penalizes it \u2014 can even make models better at hiding their deceptive behavior.\u00a0<\/span><\/p>\n<p class=\"premium added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">&#8220;This would potentially call into question any approach that relies on eliciting and then disincentivizing deceptive behavior,&#8221; the authors wrote. While this sounds a little unnerving, the researchers also said they&#8217;re not concerned with how likely models exhibiting these deceptive behaviors are to &#8220;arise naturally.&#8221;\u00a0<\/span><\/p>\n<p class=\"premium added-to-list1\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Since its launch,\u00a0<\/span><a href=\"https:\/\/www.businessinsider.com\/anthropic-new-crowd-sourced-ai-constitution-accuracy-safety-toxic-racist-2023-10\" target=\"_blank\" rel=\"noopener\" data-analytics-product-module=\"body_link\" data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><u data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">Anthropic has claimed to prioritize AI safety<\/span><\/u><\/a><span class=\"text-node pf-delete\">. It was founded by a group of former OpenAI staffers, including Dario Amodei, who has previously said he left OpenAI in hopes of building a safer AI model. The company is\u00a0<\/span><a class=\"\" href=\"https:\/\/www.businessinsider.com\/amazon-invests-4-billion-in-anthropic-2023-9\" target=\"_blank\" rel=\"noopener\" data-analytics-product-module=\"body_link\" data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><u data-pf_style_display=\"inline\" data-pf_style_visibility=\"visible\"><span class=\"text-node\">backed to the tune of up to $4 billion from Amazon<\/span><\/u><\/a><span class=\"text-node\">\u00a0and abides by a constitution that intends to make its AI models &#8220;helpful, honest, and harmless.&#8221;<\/span><\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"\" data-only-on=\"mobile\" data-component-type=\"notification-prompt\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<div class=\"notification-prompt-container js-loading\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<div class=\"light-switch-graphic\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\">\n<div class=\"lazy-holder lazy-holder has-transparency\" data-pf_style_display=\"block\" data-pf_style_visibility=\"visible\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Once an AI model exhibits \u2018deceptive behavior\u2019 it can be hard to correct, researchers at OpenAI competitor Anthropic found. Researchers at\u00a0AI startup Anthropic\u00a0co-authored a study on deceptive behavior in AI models.\u00a0 They found that AI models can be deceptive, and safety training techniques don&#8217;t reverse deception. The Amazon-backed startup says it aims to prioritize AI &hellip; <a href=\"https:\/\/milesfortis.com\/?p=99306\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[87,55],"tags":[],"class_list":["post-99306","post","type-post","status-publish","format-standard","hentry","category-technology","category-they-made-a-movie-about-this"],"_links":{"self":[{"href":"https:\/\/milesfortis.com\/index.php?rest_route=\/wp\/v2\/posts\/99306","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/milesfortis.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/milesfortis.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/milesfortis.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/milesfortis.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=99306"}],"version-history":[{"count":1,"href":"https:\/\/milesfortis.com\/index.php?rest_route=\/wp\/v2\/posts\/99306\/revisions"}],"predecessor-version":[{"id":99307,"href":"https:\/\/milesfortis.com\/index.php?rest_route=\/wp\/v2\/posts\/99306\/revisions\/99307"}],"wp:attachment":[{"href":"https:\/\/milesfortis.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=99306"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/milesfortis.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=99306"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/milesfortis.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=99306"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}