Since rules extracted with rulest can be used directly in tools like hashcat, after generating a substantial number of them I decided to briefly document the process of building highly optimal hashcat rules from that data in this short guide.
From 12 runs of the tool I obtained 12 rule files covering depths 1–10 (up to 10 operator+argument pairs per rule line).
$ tree . ├── found_chains.phase0.txt ├── found_chains.txt ├── rulest_output.phase0.txt ├── rulest_output.txt ├── rulest_raw.rule ├── rulest_rules_ga.rule ├── rulest_rules_no_ga.rule ├── rulest_rules_skull.rule ├── rulest_rules_strip_1p7m.rule ├── rulest_rules_strip_500k.rule ├── rulest_rules_strips.phase0.txt └── rulest_strip_hashmob.rule 0 directories, 12 files
After sorting by occurrence (frequency) I ended up with a file containing 4,384,527 unique rules. Not all of them are necessarily effective, so to find the most valuable ones I used the tried-and-true method of debugging rules against a large set of 32-hex hashes. The original plan was to use 160 million hashes, but an SSH connection drop mid-session made it impossible to repeat the full process — the debug run was ultimately completed against approximately ~70 million hashes, which still provides a solid statistical basis for frequency analysis.
$ cat * | uniq -c | sort -nr | cut -c 9- | awk '!seen[$0]++' > rulest2debug.rule $ wc -l rulest2debug.rule 4384527 rulest2debug.rule
I used the following script to run the debug pass:
→ https://github.com/A113L/bucket — run_debug.sh
--debug-mode=1 and --debug-file=rulest_debug_data.txt.
$ hashcat -a 0 -m 0 MD5_debug_hashes.txt wordlist.txt \ -r rulest2debug.rule \ --debug-mode=1 \ --debug-file=rulest_debug_data.txt
The resulting debug file was then processed through concentrator, which includes a Pareto-curve analysis mode designed for large sets of repeating debug data. Rules that appear most frequently across the debug output are the most valuable — that is the core principle behind the selection.
To use concentrator on the extracted rule set, simply run it without any arguments — this launches the program in interactive mode. The script begins by counting the total number of rules, so you know upfront how many you are working with.
$ ./concentrator Input Configuration: Enter rule files/directories (space-separated): rulest_debug_data.txt Analysing Input Data... ✅ SUCCESS: Rule file: rulest_debug_data.txt Quick Analysis: Files: 1 Sampled rules: 24,299,147 Est. total: 24,299,147 Unique sample: 1,048,061 Max rule len: 39 Recommendation: Low uniqueness → Extraction Long rules detected → consider functional minimization later. Processing Mode: 1 – Extract top existing rules 2 – Generate combinatorial rules 3 – Generate Markov rules Recommended: Mode 1 Select mode (1-3): 1 Top rules to extract [10000]: 24299147 Use statistical sort? [y/N]: n Global Settings: Output base name ['concentrator_output']: Max rule length to process [31]: Enable GPU acceleration? [Y/n]: y Process entirely in RAM? [y/N]: n Output Format: 1 – Standard line 2 – Expanded (space-separated operators) Select (1-2): 2 Temp directory [system default]: . Configuration Summary: Mode: extraction Input paths: 1 location(s) Output base: concentrator_output Max rule len: 31 GPU: Enabled In-memory: No Output format: expanded Top rules: 24,299,147 Stat sort: No Start processing? [Y/n]: y Active Mode: EXTRACTION Output File: concentrator_output_extracted.rule Output Format: expanded ✅ SUCCESS: OpenCL initialised on: NVIDIA GeForce RTX 3060 Ti ✅ SUCCESS: GPU Acceleration: ENABLED Collecting Rule Files (recursive, max depth 3) ✅ SUCCESS: Rule file: rulest_debug_data.txt ✅ SUCCESS: Found 1 rule files. ℹ️ INFO: Using temporary directory: . Parallel Rule File Analysis ℹ️ INFO: Parallel analysis of 1 files using 1 processes... Enter enhanced interactive mode? (Y/n): y ================================================================================ ENHANCED RULE PROCESSING – INTERACTIVE MENU ================================================================================ Initial dataset: 887,270 unique rules -------------------------------------------------------------------------------- ADVANCED FILTERING OPTIONS: (1) Filter by MINIMUM OCCURRENCE (2) Filter by MAXIMUM NUMBER OF RULES (top N) (3) Filter by FUNCTIONAL REDUNDANCY [RAM intensive] (4) INVERSE MODE – keep rules BELOW the cut-off rank (5) HASHCAT CLEANUP – validate (CPU/GPU modes) (6) LEVENSHTEIN FILTER – remove similar rules (7) TOGGLE OUTPUT FORMAT (currently: expanded) ANALYSIS & UTILITIES: (p) PARETO analysis (s) SAVE current rules (r) RESET to original dataset (i) Dataset information (q) QUIT -------------------------------------------------------------------------------- Enter choice: p
Once sorting is complete you will be dropped into a submenu where you can find the Pareto curve analysis along with milestone breakpoints showing exactly how many rules are needed to cover a given percentage of the entire initial dataset.
Pareto milestones from concentrator — rules needed to reach each coverage percentage of the full 887K-rule corpus (21.4M total occurrences).
| 10% coverage | 42 rules (0.0%) | |
| 20% coverage | 200 rules (0.0%) | |
| 30% coverage | 574 rules (0.1%) | |
| 40% coverage | 1,447 rules (0.2%) | |
| 50% coverage | 3,678 rules (0.4%) | |
| 60% coverage | 8,803 rules (1.0%) | |
| 70% coverage | 21,505 rules (2.4%) | |
| 80% coverage | 48,834 rules (5.5%) | |
| 90% coverage | 138,936 rules (15.7%) | |
| 95% coverage | 283,945 rules (32.0%) | |
| 99% coverage | 672,981 rules (75.8%) |
This makes it straightforward to cut the ruleset down to a practical size — for most cracking jobs the top 1–5% of rules by frequency covers the vast majority of real-world hits.
Using the Pareto milestones as cut points, four rule files were produced — one for each practical use case, from a large comprehensive set down to a tiny high-precision set:
$ wc -l *.rule 500001 rulest_large.rule 150001 rulest_medium.rule 25001 rulest_small.rule 251 rulest_tiny.rule 675254 total
Results from the
rules comparison spreadsheet.
Each of the four output files is compared against the best-in-class rule set in its weight
category, and against other existing rulest_* variants.
The existing rulest_* family in the benchmark includes
rulest_rules_strip_1p7m (44.65%, 1.7M rules),
rulest_rules_strip_1p1m (43.98%, 1.1M rules),
rulest_rules_strip (38.10%, 332k rules),
rulest_rules_ga (31.76%, 643k rules), and
rulest_rules_no_ga (31.35%, 329k rules).
The new rulest_large at 500k sits right in the middle of that family —
better recovery than the strip and GA variants while being a fraction of the size of
the 1M+ strip files. A good trade-off for most use cases.
The benchmark leaderboard has expanded significantly. New top entries include
Fordyv4a (56.94%, 4M rules), Fordyv4b (56.28%, 6.9M rules),
A11313M (53.97%, 3.2M rules), sapphire_v1 (52.06%, 1.27M rules),
and CakeV1 (52.04%, 1M rules). These large rule sets push the ceiling
considerably above the previous 48–49% range. The rulest_* output files
remain competitive within their respective size categories, but the top of the overall
leaderboard is now materially higher.
One thing worth highlighting: rulest runs comfortably on cards with less than 4 GB of VRAM. The entire ruleflow — rule generation, debug pass, concentrator analysis — can be run on modest hardware most people already own. You don't need a flagship GPU to produce a competitive rule set. The debug pass itself is the most time-consuming part, but even that is a one-time cost. The resulting files can then be reused across any number of jobs at no additional overhead.
Combined with the Pareto insight that a few thousand high-frequency rules cover a disproportionate share of real-world hashes, this means the barrier to having solid, personalized rule sets is much lower than it appears.