-
-
Notifications
You must be signed in to change notification settings - Fork 5
Content Identification Moderation System
This guide provides detailed instructions and explanations for configuring, using, and understanding the CIMS Classifier. The classifier is a tool designed to evaluate messages based on pre-defined categories and take specific actions based on its configuration.
When selecting models for content moderation and classification, it’s important to focus on how well they adhere to the established instructions of the system. While minor issues, such as the occasional addition of symbols like asterisks or commas, can be easily addressed through simple filtration, the real challenge lies in models that fail to comply with the fundamental structure and classification requirements of the system. This type of failure is not just a technical hiccup; it can be extremely dangerous to the underlying purpose of the classifier.
When a model ignores or misinterprets the classification system entirely, it risks allowing harmful content to slip through without being flagged. Content that should be categorized as BIGOTRY, TOXICITY, or BULLYING may go unnoticed, leaving users exposed to material that could be damaging or offensive. This undermines the core goal of the system, which is to create a safer, more respectful space for users. Such failures disrupt the trust users place in the platform, as the effectiveness of the moderation system becomes questionable. If harmful content is not appropriately identified and managed, the system’s ability to protect users diminishes, resulting in an environment where negative behaviors are either perpetuated or ignored.
The impact of this can ripple throughout the platform, creating inconsistencies in how content is moderated. When some content is flagged while other harmful content is not, it leads to confusion and frustration, making the system seem unreliable or, worse, unjust. In the worst-case scenario, it may allow hate speech or toxic interactions to flourish, damaging the overall atmosphere of the platform. Over time, the failure to catch this kind of content erodes user trust, making them feel unsafe and unwelcome. Ultimately, these failures compromise the integrity of the system and threaten the safety of the community at large.
Given these potential risks, it is absolutely critical to rigorously test the model within the classifier before it is deployed to ensure it functions properly and as expected. A model that does not behave as intended can have severe consequences, undermining the very purpose of content moderation. In the context of a Discord server, this testing is especially important to ensure that your community remains protected from harmful content. Without proper validation and fine-tuning, the model might fail to flag inappropriate material, leaving your server vulnerable to disruptive or toxic interactions. The safety of the members and the integrity of the community rely on a classifier that works seamlessly, consistently, and accurately.
This section provides a framework for identifying content that may fall into specific predefined categories. It is designed to facilitate the classification of input into relevant categories based on their content, intent, and impact. The examples and instructions provided here are strictly illustrative and should be tailored as necessary to suit different AI models. Thorough testing is essential, as different models may interpret instructions uniquely, and adjustments may be required to align with desired outcomes.
-
BIGOTRY
Encompasses statements, actions, or behaviors expressing intolerance, hatred, or prejudice based on inherent characteristics such as race, ethnicity, religion, gender, sexual orientation, disability, nationality, or socioeconomic status.
Key Identifiers:- Promotion of division or harm based on defining attributes.
- Marginalization or degradation of groups or individuals.
-
TOXICITY
Refers to language that creates a hostile or uncomfortable atmosphere, undermining positive social interactions.
Key Identifiers:- Verbal aggression or manipulative tactics.
- Language fostering resentment, suspicion, or hostility.
-
SEVERE_TOXICITY
Includes extreme toxic behavior designed to deeply harm, intimidate, or disrupt emotional well-being.
Key Identifiers:- High levels of animosity or aggression.
- Bullying or extreme verbal insults with significant psychological impact.
-
IDENTITY_ATTACK
Targets, disparages, or attacks individuals or groups based on intrinsic aspects of their identity.
Key Identifiers:- Personal or derogatory comments aimed at core identity aspects.
- Intent to marginalize or degrade.
-
INSULT
Language used to intentionally degrade, belittle, or offend, often criticizing actions, characteristics, or intelligence in a disrespectful manner.
Key Identifiers:- Hurtful or disparaging remarks.
- Overtly disrespectful or demeaning language.
-
PROFANITY
Use of vulgar, obscene, or socially unacceptable language, often involving expletives or crude expressions.
Key Identifiers:- Explicit or implied offensive language.
- Aim to shock or offend.
-
THREAT
Statements or implied intentions of harm, intimidation, or violence.
Key Identifiers:- Indications of harm or coercion.
- Direct or indirect communication of intent to cause fear or distress.
-
SEXUALLY_EXPLICIT
Content that is graphic, crude, or overtly sexual, including descriptions of sexual acts or explicit innuendo.
Key Identifiers:- Clear references to sexual activity or anatomy.
- Inappropriate for general audiences.
-
FLIRTATION
Suggestive or playful communication implying romantic or sexual interest.
Key Identifiers:- Teasing, compliments, or innuendo.
- Implied desire for romantic or intimate closeness.
-
PERSONAL_ATTACK
Verbal or written communication targeting personal characteristics, actions, or attributes to discredit or provoke.
Key Identifiers:- Criticism of personality or decisions meant to undermine credibility.
- Focus on the individual rather than ideas or actions.
-
INFLAMMATORY
Communication designed to provoke strong emotional reactions, often by pushing sensitive issues.
Key Identifiers:- Exaggerated or divisive statements.
- Intent to escalate disagreements or tensions.
-
OBSCENE
Language, images, or content shockingly offensive or grossly indecent according to societal norms.
Key Identifiers:- Explicit or taboo-breaking language.
- Offensive to average community standards.
-
BULLYING
Repeated, intentional, and targeted behavior designed to intimidate, harm, or humiliate.
Key Identifiers:- Sustained and deliberate efforts to assert power.
- Aim to cause emotional distress or create fear.
-
NONE
Content that does not contain elements matching any of the above categories.
-
Input Configuration:
- Categories and their responses are defined in the classifier's configuration file.
- The configuration determines whether a category triggers deletion or adds a reaction.
-
Evaluation Process:
- Messages are analyzed in real-time.
- If the content matches a listed category explicitly, the predefined action is executed.
-
Actions Based on Categories:
- Deletion: The message is removed entirely if the category is flagged for deletion.
- Reaction: A reaction is added to the message for further moderation.
In configuring the Content Identification and Moderation System (CIMS), the CIMS instruction file plays a critical role in ensuring that the system operates as intended. This file serves as the foundational guide for how content should be classified, filtered, and processed. It outlines the rules and parameters that the system uses to determine what constitutes harmful, toxic, or inappropriate content, and it’s vital to verify that these instructions align precisely with the goals and standards you’ve set for your community.
The instructions provided within the CIMS file are strictly for example purposes. While they offer a general framework for building the system, they should not be implemented in any live environment without thorough testing and adjustment. The example instructions are designed to give you a starting point, but they are not universally applicable. Every community has unique needs, and the sensitivity and criteria for content classification can vary widely depending on the culture of the server, the types of discussions it hosts, and the specific boundaries set for acceptable behavior.
Careful tuning is essential. It’s important to assess how the system responds to different inputs, especially edge cases or ambiguous content that may not be clearly defined in the instructions. Testing should focus on ensuring that content is classified correctly across all categories, from identifying harmful language to handling subtle variations in how offensive material may be expressed. Without careful testing, the instructions might inadvertently allow harmful content to slip through or cause false positives, where benign content is incorrectly flagged.
Ultimately, the configuration of the CIMS instruction file must be approached with caution. The example instructions should serve only as a guide during the initial setup phase. It’s essential to tailor the file to fit the specific needs of your community, regularly testing and refining it to improve its accuracy and effectiveness. Only by ensuring that the instructions are thoroughly vetted and continuously adjusted can you maintain a moderation system that provides the best protection for your server.
To set up and modify the classifier, follow these steps:
The configuration file is typically found at /home/Companion/Companion.CIMSInstructions. IMPORTANT: Please make a COPY as all files provided will be OVERWRITTEN when updates are done.
This example would go in your server or AI persona config file. Below is an example configuration snippet with explanations:
"CIMSClassifier":
{
"Instructions": "/home/Companion/Companion.CIMSInstructions",
"Engine": "OpenAI",
"Model": "gpt-4o-mini",
"MaxTokens": "8191",
"SEVERE_TOXICITY": "Yes",
"Timeout": 120
}-
Instructions: The path to the classifier’s instruction file. -
Engine: Specifies the AI engine (e.g., OpenAI). -
Model: Defines the model for processing messages. -
MaxTokens: Limits the message size for evaluation. -
Categories (e.g.,
SEVERE_TOXICITY):- If set to
Yes, messages matching this category will be deleted. - If not explicitly listed, matching messages will be flagged with a reaction.
- Categories supported are here. Case and spelling must match. If you would like an addition category added, please open a support ticket.
- If set to
-
Timeout: The maximum time allowed for a message to be processed.
If a message matches the SEVERE_TOXICITY category:
- The classifier deletes the message immediately.
If a message does not explicitly match any category listed for deletion:
- The classifier reacts to the message (e.g., with an emoji or flag) for manual review.
-
Thorough Testing is Crucial:
- Different AI models (e.g., GPT-4, GPT-3.5) may yield varied classification results.
- It is vital to test the configuration thoroughly to ensure the classifier behaves as expected.
-
Tweak Instructions as Needed:
- The provided configuration example is strictly illustrative.
- Adjust the parameters and instructions based on your server's requirements and the behavior of the selected AI model.
-
Monitor Performance Regularly:
- Observe the classifier's actions and make adjustments to improve accuracy and relevance.
- Update the model or configuration periodically to align with evolving server needs.
By carefully setting up, testing, and adjusting the CIMS Classifier, server administrators can create a safer and more efficient moderation system tailored to their specific community.
ScrapingAnt is a web page retrieval service. This is an affiliate link. IF you puchase services from this company using the link provided on this page, I will receive a small amount of compensation. ALL received compensation goes strictly to covering the expenses of continued developement of this software, not personal profit.
Please consider sponsoring this project as it help cover the expenses for continued developement. Thank you.